Markov Decision Processes

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning.

At each time step, the process is in some state \(s\), and the decision maker may choose any action \(a\) a that is available in state \(s\). The process responds at the next time step by randomly moving into a new state \(s'\), and giving the decision maker a corresponding reward \(R_{a}(s,s')\).

The probability that the process moves into its new state \(s'\) is influenced by the chosen action. Specifically, it is given by the state transition function \(P_{a}(s,s')\). Thus, the next state \(s'\) depends on the current state \(s\) and the decision maker’s action \(a\). But it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfies the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.

MDP is an approach in achieving reinforcement learning to take decisions in a matrix. A grid would consist of states. The MDP tries to capture a world in the form of a grid by dividing it into states, actions, transition matrix, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for a task that MDP is imposed. Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP.

Comments