- Experiment 1: SARSA, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 1152181
- Experiment 2: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 641052
- Experiment 3: SARSA, α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 20: 2664318
- Experiment 4: Q-learning, α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 20: 2664659
Compare the results of the first and second experiments to the third and fourth experiments.
- The difference between the first and second experiment should be much larger than the difference between the third and fourth. This is because when the Greedy Exploit parameter is set to 100%, SARSA should be exactly equivalent to the Q-learning strategy. However, in the first and second experiment the agent explores 20% of the time which causes a difference in the policies generated between SARSA and Q-learning.
|