Training GPT-2 on Stockfish Games
TL;DRI trained a GPT-2 model on Stockfish self-played games in the most naive way, with no search, and it can play decently. The model is trained to output the next move given the FEN string of the board (single state). While I present some gotchas and caveats, the results are quite acceptable for the amount of work and computing invested. I also present a basic attention visualiser parsing the attention of the text tokens into the board.
# Context
DeepMind Paper
Deepming recently released a paper where they achieved grand-master level without search @1. They used a decoder only trained on the board state (FEN string) to predict whether the value of the board, the value of the action ($Q$ values) or the best action. I decided to take a similar approach but make it as simple as possible; I only trained the model to predict the next action (not necessarily the best).
With this approach, there is no need to compute tons of evals using the Stockfish engine, no need to filter out illegal moves and no need to use the fancy RL scheme (yes, I have been traumatised). There are some catches that come with this oversimplification that I present throughout the post, but I think it was a nice reframing.
Personal Background
For the record, this project only took me 1 full day of work to set up and around 3 more days to release everything. The overhead is mostly making this blog post, but it’s a good exercise for me, so it’s definitely worth it. I have already trained LLMs in the past, although I am far from being an expert. I have also worked on chess models, most recently lc0, and I am familiar with XAI / MI.
This project was not really a challenge for me; it was more of a hobby project I wanted to share.
# Training Method
Technical Details
General Framing: Fine-tuning of a pre-trained model.
The training code is available on the associated GitHub repository training-gpt2-on-stockfish-games (mostly quick and dirty). It is really straightforward and self-explanatory, as I relied on the Hugging Face Trainer (all the complexity is under the hood). Maybe the contribution you’ll find more worth is the Apptainer code to launch the training (not that difficult), which is cluster-friendly.
For the dataset I used the simple format of prompt + completion:
{"prompt": "FEN: {fen}\nMOVE:", "completion": " {move}"}
The run details:
- 1 epoch of the dataset
- LR of $10^{-5}$
- 12h on one A100 (40GB)
Tokenisation GotchaFor simplicity, I didn’t change the tokeniser, so the tokenisation, BPE, is not tailored for chess modelling. For example,
_r
(_
for whitespace) andr
get encoded differently, andnb
is encoded as a single token while representing a knight and a bishop. Also note that I used the raw FEN string, so8
actually represents 8 empty squares; definitely problematic for the model to grasp a spatial representation of the board with positional encodings (it still manages somewhat).
As discussed in the next section, changing the tokenisation is a low-hanging fruit to improve the model.
Straightforward Improvements
Prompt Engineering: Similarly to using your favourite chatbot, prompt engineering could be a simple way of making the model better. The model could be prompted (during training) with the strength, ELO, of the player to move and the final result. On inference time, the model can then be prompted with the desired ELO and outcome. It obviously is limited; don’t expect your model to reach ELO 10,000 with this method.
Conditioned ModelRecent framing of RL as a (self-)supervised learning problem showed that training an agent on sub-optimal trajectories can lead to optimal strategies should the agent be conditioned on a reward signal. For example, this is the setup of Imitation Learning @2 and Decision Transformer @3.
Tokenisation: I think one the best way to encode the board would be to only represent the board tokens, with 7 different tokens (P
, N
, B
, R
, Q
, K
and X
for empty squares), then using vertical and horizontal positional encodings (like in ViT), as well as for meta information. A piece’s colour encodings are different for each piece, and common encodings for castling rights are player colour, en passant, half-clock move, and full move number.
Model: A text encoder with a special pooling token, [CLS]
, and an input sequence of length 64 representing the board using the aforementioned tokenisation. The output of the encoder would be fed to a classification head of size 1858
, minimal move encodings but still complete (DeepMind didn’t get this right ^^ (neither in Alpha Zero for the record ^^)).
# Inspecting the Model
A Gradio space is available to interact with the model and its attention; check it out on Hugging Face ->.
Attention as an Interpretation
Attention is the most simple interpretation you can think of when studying Transformers. It is built into the model, and so it does reflect some internal knowledge of the model. In the autoregressive case (our case), you are often interested in “Why did my model predict this token?”. Basically, looking at a row of the attention coefficients $A$, defined by equation $\ref{eq:attention}$, for the predicted query token and the past key tokens gives you an importance distribution on the past tokens.
\[\begin{equation} \label{eq:attention} A={\rm softmax}( QK^T) \end{equation}\]Interpretation GotchaWhile attention can be used to do simple interpretation, the claim needs to be measured. Indeed, when interpreting the attention coefficients, you say nothing about the output values $O=AV$. This weighted sum needs to be interpreted as a whole for the conclusion to be more rigorous.
While @4 claimed that attention is not an explanation on its own, @5 conducted a more in-depth study pointing out the limitation of attention. You should especially remember the computation graph as its whole. Namely, attention heads will be merged and, in addition, composed via the residual stream, so the computation graph becomes a mess.
Parsing Text Attention on the board
In order to project attention on the board, I simply parsed the attention per token and kept only the attention on pieces. The attention on the last part of the FEN string is summed into Meta
while the rest of the tokens are summed in Dump
. This is a simple workaround and proxy to circumvent the unadapted tokenisation.
Measure GotchaAlways think twice about what you’re actually measuring. Here, the heatmaps only carry information about the piece configuration, so my heuristic would be to leave out heatmaps where less than 50% of attention is put on pieces (
Config
metric).
Always look at the distribution among the three categories of tokens and the raw attention figure. The raw attention can give you insight into when not to trust grouped attention, i.e. when a token refers to more than one square. This is especially important since it creates illusory patterns, so you should refrain from your deepest instinct to interpret this as a meaningful pattern (everyone loves to see patterns ^^).
Simple Interpretation
DisclaimerThe following section contains rough preliminary results, experiments and takes. Thus, everything should be taken with care.
Configuration attention: In terms of configuration attention, the crux is to predict the first letter of the move. I would mostly aim to interpret the attention of this token. While I didn’t conduct an in-depth analysis, it seems that the latter tokens are mostly doing a pooling of the previous representation.
Discovering simple heads: The most simple heads will be located in the early layers. Interpreting later heads is harder due to polysemanticity, convoluted circuits, and a high degree of composition, you name it.
[!success] Advice If you’re not familiar with chess notation or attention, maybe the better is to interact with the associated Hugging Face space. By copying the PGN (sequence of moves) or FEN (state of the board) you can visualise the plot of this post and more. It’s especially important if you disagree with my claims or don’t trust me (you shouldn’t ^^).
L1 H7: Focuses on the knights but only when it is paired in a token; see the figure 1. This is where this tokenisation gets annoying since it’s hard to say what the model might have learned from these paired tokens.
Figure 1: Configuration attention of the head 7 in layer 1.
[FEN] r1b1k1nr/pppp1ppp/2n1pq2/8/1b1PP3/2NQ4/PPP2PPP/R1B1KBNR w KQkq - 5 5
,
[Distribution] Config
: 0.78 Meta
: 0.10 Dump
: 0.13.
e2e3 opening: This opening is by far my favourite (not a strong player here ^^), but out of the 262k
training games, only 13k
opens with e2e3
. The main response to this opening is e7e5
with 4k games. I then like to push my queen to f3
, but since this combination is not in the training games, it seems that the model blunders doing c7c5
. Pushing the bishop to c4
leads the model to respond, moving the knight to c6
, letting the queen go for the checkmate (so in 4). This feels expected since, most likely, as a predictor, the LLM learned in a Bayesian way
since there is no prior on this sequence.
Counting ExperimentI think that counting all the games present in the dataset and comparing them to the learned distribution could be a nice low-hanging fruit experiment. The aim would be to examine the learned distribution on boards where the model has a prior against boards where it has no prior. This toy experiment could be framed in the larger picture of prediction generalisation and coherence.
c7c5 blunder: I am not giving the reason for the aforementioned blunder here but mostly pointing out the limitation of doing interpretability with attention. Inspecting the attention of the last layer after 1. e3 e5 2. Qf3
most heads are strongly attending to the pawn in e5
, see figure 2, but in the end, the pawn c7
is moved. By the way, L12 H2 seems to focus on “player to move best-placed pawn”.
Figure 2: Configuration attention of the head 2 in layer 12.
[FEN] rnbqkbnr/pppp1ppp/8/4p3/8/4PQ2/PPPP1PPP/RNB1KBNR b KQkq - 1 2
,
[Distribution] Config
: 0.95 Meta
: 0.01 Dump
: 0.04.
b8c6 second blunder: First, it seems to fail to see any danger or threat and continue to focus on pieces that would develop his position (like the knight move). Then, as a white, before the kill, the model doesn’t find the checkmate with the queen (obviously out-of-distribution here). It is important to note that all training games are long since the self-play means two players of approximately the same level.
Adaptative Strength ExperimentIt could be interesting to see if such a predictor model adapts its level to the adversary. Did it blunder because I played such weak moves (w.r.t. to the meta)? Does it play better if you use the book openings? This might not be straightforward since this correlates with out-of-distribution evaluation.
This analysis is only preliminary, so feel free to continue it and send me feedback; the discussion happens in this Discord thread on Eleuther AI.
# Resources
With this blog post, I release everything needed to reproduce, explore, understand and extend this work. In summary:
References
- Ruoss, Anian, et al. “Grandmaster-Level Chess Without Search.” ArXiv, 2024, /abs/2402.04494.
- Jonathan Ho et al. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Chen, Lili, et al. “Decision Transformer: Reinforcement Learning via Sequence Modeling.” ArXiv, 2021, /abs/2106.01345.
- Jain, Sarthak, and Byron C. Wallace. “Attention Is Not Explanation.” ArXiv, 2019, /abs/1902.10186.
- Wiegreffe, Sarah, and Yuval Pinter. “Attention Is Not Not Explanation.” ArXiv, 2019, /abs/1908.04626.