- Experiment 1: SARSA, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 1152181
- Experiment 2: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 641052
Compare the results of the first experiment to the second experiment.
- The first experiment should get a significantly higher reward than the second experiment. Exploring near s2 can incur a large penalty. SARSA will adopt a policy which avoids exploring near dangerous areas. Since the agent is forced to explore 20% of the time, SARSA's strategy of avoiding the dangerous area results in a better total reward.
|