AIspace

Experiment 1: SARSA, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 1152181
Experiment 2: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 641052

Compare the results of the first experiment to the second experiment.

The first experiment should get a significantly higher reward than the second experiment. Exploring near s₂ can incur a large penalty. SARSA will adopt a policy which avoids exploring near dangerous areas. Since the agent is forced to explore 20% of the time, SARSA's strategy of avoiding the dangerous area results in a better total reward.