Tags: ml, study, notes
State: None
https://tmoer.github.io/AlphaZero/ https://web.stanford.edu/~surag/posts/alphazero.html
- s = state of the board
- neural net, two outputs
- continuous value of the board state
- policy , probably vector over all possible actions
Training examples are in the form of: (st, pit, zt)
- is estimate of policy from state s
t - is the final outcome of the game from perspective of player at (-1 for lose, +1 for win)
Loss #todo
Terms:
This is excluding regularization terms.
MCTS Search for Policy Improvement #todo
- Given state s, we get the policy
- During training, estimate of the policy is improved via Monte Carlo Tree Search
During the tree search the following is maintained:
- : the expected reward for taking action , from state
- : the number of times action was taking from state across simulations
- : the initial estimate of taking action from state according to policy returned by the current network
Compute , the upper confidence bound on Q-values as: