Choose the speed at which the agen will move. Very fast only renders every 1000 steps or when a goal is achieved
The classics - Alpha how quick to learn
The learning rate or step size determines to what extent the newly acquired information will override the old information.
A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.
The classics - Gamma how much do we value future rewards
The discount factor gamma determines the importance of future rewards.
A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward.
Is this all there is?
Our agent can find a good route and soon find that it's stuck in a rut! Just like in life, if we want to have a chance of enjoying the better things, we need to get off the beaten track. Don't be deceived though, every reward requires some sort of
risk and danger lurks there!
Also known as epsilon!
Cheat!
In life, our problems aren't constant and neither is our goal. It's just the same for our special agent. Choose how much the deathzones, obstacles and the goal wander about. It's not cheating, it's just that sometimes bad things do happen. And sometimes
we stumble unexpectedly onto our goal!
Take me away from here!
Want a new life? Try one of these pretrained models
On my command!
It's great that the agent goes quickly but if you want to control the exact pace of the moves, click here
The agent starts at a random location and needs to get to the goal at the bottom right. It sounds easy of course but the agent has zero knowledge of the grid, its own location, the location of the obstacles or the location of the goal.
The goal will give the agent a reward of 500
The pink cells give it a penalty of -50
The red cells (deathZones) give it a penalty of -500
If that's not bad enough, the death zones kill the agent (a new one is teleported in immediately)
And if its still not bad enough, the death zones wander around. Some quicker and some slower
To move vertically or horizontally incurs a cost of 10
To move diagonally incurs a cost of 14.14 (Thanks Pythagoras!)
To land on any cell incurs a cost of 1
In short, the agent can store in each cell that it just visited, a function of (the rewards it received and the costs it incurred from moving to the current cell and the costs, penalties and rewards for it's next move.
Unless the agent decides to go exploring, it makes the decision about which direction to in a rational way by looking at the scores for a move in each of the 8 possible directions that are stored in the current cell location.
I use the following video to understand the detail of Q Learning once the door had been opened by Siraj.