consider the policy iteration algorithm for a finite state MDP. Suppose the initial policy is a stochastic policy. Now, can the optimal policy be deterministic after improvements ? Or, can we say that always the optimal policy will be a stochastic one ? Confused about this. Any ideas will be helpful. The reason I am asking this question is that in the absence of model i.e. when we need to need to use Monte Carlo methods then each of the improved policies must be a stochastic one to make sure action-value function estimates are near equal to the mean.
asked
sosha |

I am afraid that I might misunderstand your question again. Optimal policy doesn't have to be stochastic. Deterministic policy can be considered as a special case of stochastic policy. But, mainly it depends on the problem. Sometimes, even we use stochastic policy, the result (optimal policy) becomes deterministic policy (policy with probability 1). And, 'Monte Carlo methods' can be used for approximate dynamic programming. MDP is a type of Dynamic programming. When the possible transition (or observation) is too many, we estimate value function using Monte Carlo methods. You may want to look at the approximate dynamic programming. (refer the book by Prof. Powell)
answered
ksphil |