Completed week 2 notes

This commit is contained in:
levdoescode
2022-11-13 01:54:04 -05:00
parent e69b696f10
commit 6065016000

View File

@ -0,0 +1,68 @@
# States and Actions
Most games involve an iterative process of observations and actions, as well as a positive or negative result.
State is the current view of the game.
The actions are move left, move right or stay still.
The rewards is 1 point for hitting a brick or game over if the ball is missed.
An action taken on a state generates a derived state.
# Markov Decision Process
It's a way to formalize a stochastic sequential decision problem.
# Action Policy
It's a mapping between states and actions.
In sequential decision processes the reward might come in the far future, not necessarily immediate.
An **optimal policy** maximises the reward over a sequence of actions.
---
The **state transition matrix** tells us how the state will change over time based on our chosen actions.
The action policy dictates what the chosen action will be in each state by choosing the highest value action.
This means we need to know the complete transition matrix and the associated rewards to make an action policy, this is the first problem we encounter and need to solve.
# Q learning
Q learning tries to get the best approximation to what the value function is.
The Q function is an approximation of the value function.
However, Q-learning doesn't work in most real world problems. That's why we make use of deep networks to approximate the value function: **Deep Q Network** (DQN).
# DQN Agent Architecture
### What is an agent?
It's an entity that can observe and act autonomously.
The agent architecture needs to solve two problems:
* No state transition matrix
* No action policy
The DQN agent has a replay buffer that saves the sets of states, actions, derived states, rewards and game done (s, a, s', r, d)
Initially the actions will be made at random to create the replay buffer because the Q function has not been built yet.
After traning, the epsilon greedy exploration will use the Q function. Epsilon (rate of how greedy) starts at 0 and changes over time based on how confident we are on the Q function.
# The loss function for DQN
The Q function is a neural network
We train it on the state transition data, but it does not infer next state, it infers the value of an action in a state and taking account of future rewards.
In Neural Networks we usually have labeled datasets with outputs and input, train on 80% and test on 20%.
In DQN we have the replay buffer.
For the loss function, we'll keep two versions of the network, one being the version in the previous round of training, which we use as proxy for the correct answer (A). The other version of the network is the one we're traying over time (B). We update the A network periodically.
# Visual processing and states in DQN
For the Atari games, a states includes 3 frames of the game. This provies a sense of time.
Also, each of the three frames in a state if made up of two in-game frames because the Atari system used to place different sprites odd and even frames due to hardware limitations.
For data reduction, the RBG values are converted to luminance values, making 1 value per pixel instead of 3.
The next data reduction step is reducing the number of pixels from 216 x 216 to 84 x 84.