Reinforcement Learning – The balance between exploration and exploitation

* How software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

* The focus is finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

* The environment is typically formulated as a Markov decision process (MDP) utilizing dynamic programming techniques that do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

Example:
An agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

You might also like

Markov Chain

Probability Chain rule/general product rule using only conditional probabilities

Tensorflow – Graph Freezing