AIspace

Experiment 1: α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 0: 641052
Experiment 2: α=0.1, γ=0.5, Greedy Exploit = 80%, and initial Q-value = 20: 648575
Experiment 3: α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 0: -90
Experiment 4: α=0.1, γ=0.5, Greedy Exploit = 100%, and initial Q-value = 20: 2664659

Compare the results of the first and second experiments to the third and fourth experiments.

The difference in reward should be much smaller between the first and second experiment compared to the third and fourth. In the third experiment the agent has no incentive to explore and so it will tend to get stuck following a bad policy. By increasing the initial Q-values to 5 in the fourth experiment, the unexplored areas look better which will cause the agent to initially explore. Changing the initial Q-values has a comparatively small effect in the first and second experiments because the agent is forced to explore 20% of the time regardless of the initial Q-values.