Tagged #XAI
# Stories
23 September 2024 | 11 min read | tags: Experience Research AIS XAI MARLMy PhD
A short-story about why I decided to do a PhD on the subject of explainable multi-agent reinforcement learning. I detail how I weighted this decision and how I created my proposal. I also try to depict what I plan to do for making the best of my PhD.
# Projects
29 February 2024 | 12 min read | tags: Chess LLM XAI Attention TrainingTraining GPT-2 on Stockfish Games
I trained a GPT-2 model on Stockfish self-played games in the most naive way, with no search, and it can play decently. The model is trained to output the next move given the FEN string of the board (single state). While I present some gotchas and caveats, the results are quite acceptable for the amount of work and computing invested. I also present a basic attention visualiser parsing the attention of the text tokens into the board.
# Articles
5 October 2024 | 14 min read | tags: AIS XAI FHE EvalFHE for Open Model Audits
Thanks to recent developments, FHE can now be applied easily and scalably to deep neural networks. I think, like many, that these advancements are a real opportunity to improve AI safety. I thus outline possible applications of FHE in model evaluation and interpretability, the most mature tools in safety as of today in my opinion.
Layer-Wise Relevance Propagation
Layer-Wise Relevance Propagation (LRP) is a propagation method that produces relevances for a given input with regard to a target output. Technically the computation happens using a single back-progation pass similarly to deconvolution. I propose to illustrate this method on an Alpha-Zero network trained to play Othello.
# Publications
5 June 2024 | 25 min read | tags: Chess XAIContrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
We propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.