Does a indefinite horizon stationary MDP have a stationary optimal policy?
- Since the agent has no information about when the process will stop, it has no reason to act differently at any particular point in time. Therefore the optimal policy is stationary (as it is for infinite horizon problems). In the case that the agent does have information about when the process will terminate, it has incentive to act more greedily toward immediate reward near the termination point (which is a non-stationary policy).
|