AlphaZero


Tags: ml, study, notes
State: None

https://tmoer.github.io/AlphaZero/ https://web.stanford.edu/~surag/posts/alphazero.html

  • s = state of the board
  • fθ(s)f_\theta(s) neural net, two outputs
    • continuous value of the board state
    • policy pθ(s)p_\theta(s), probably vector over all possible actions

Training examples are in the form of: (st, pit, zt)

  • πt\pi_t is estimate of policy from state st
  • zt={1,1}z_t = \{1, -1\} is the final outcome of the game from perspective of player at sts_t (-1 for lose, +1 for win)

Loss #todo

(vθ(st)zt)2pitlog(pθ(st))\sum (v_{\theta}(s_t) - z_t)^2 - \vec{pi}_t * \log(\vec{p_\theta}(s_t))

Terms:

This is excluding regularization terms.

MCTS Search for Policy Improvement #todo

  • Given state s, we get the policy pθ\vec{p}_\theta
  • During training, estimate of the policy is improved via Monte Carlo Tree Search

During the tree search the following is maintained:

  • Q(s,a)Q(s, a): the expected reward for taking action aa, from state ss
  • N(s,a)N(s, a): the number of times action aa was taking from state ss across simulations
  • P(s,)=pθ(s)P(s, \cdot) = \vec{p}_\theta(s): the initial estimate of taking action from state ss according to policy returned by the current network

Compute U(s,a)U(s, a), the upper confidence bound on Q-values as: