Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

ENS de Lyon
Worskhop on Interpretable Policies in Reinforcement Learning @ RLC-2024

Abstract

AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.

Reviews

Mechanistic Interpretability Workshop @ ICML-2024
Main criticisms:
  • Not enough training details: The training of the CSAE is under-detailed, particularly concerning the choice of hyperparameters, data generation, and evaluation metrics. Specific layers used for training and the integration of the contrast loss are not clearly explained, making it difficult to replicate or understand the methodology fully.
  • Lack of comparison with other methods: The paper fails to compare the proposed CSAE method with other feature extraction techniques, such as standard Sparse Autoencoders (SAE), Independent Component Analysis (ICA), and other clustering or probing methods. This comparison is crucial to validate the efficacy and novelty of the CSAE over existing methods.
  • Lack of feature interpretation: The interpretation of features generated by CSAE is inadequate. The paper does not convincingly demonstrate that the identified features correspond to meaningful chess concepts, as only a few cherry-picked examples are provided without thorough validation of monosemanticity or broader representativeness.
  • No utilisation of the proposed clustering: Although clustering and dendrogram techniques are mentioned, they are not effectively used to enhance the understanding of the feature space. The paper does not provide labeled clusters or investigate the similarity within clusters to help readers understand the model’s internal representation.
  • Insufficient qualitative and quantitative evaluations: The qualitative assessments do not scale well with human effort, and there's a lack of extensive qualitative evidence to support the interpretability of learned features. Quantitatively, the performance metrics like F1, precision, and recall are barely above threshold levels, and no robust statistical analysis is provided to support the findings. Moreover, there's a gap in demonstrating how these features impact chess-playing decisions in practical scenarios.
  • Interesting idea: Despite the aforementioned criticisms, the idea of using CSAE to interpret the planning of chess-playing agents was overall praised. It was considered a novel approach that could potentially provide valuable insights should it be validated by a deeper study.

Roadmap

I propose the following roadmap to address the reviews:

  • Feature ablation study: I will carry out an ablation study to assess the impact of individual features extracted by the CSAE on the decision-making process of the chess agent. This will help quantify the contribution of each feature towards enhancing the agent’s planning capabilities.
  • Expose more features: In addition to the Huggingface space created for the paper, I will provide a more detailed analysis of the features extracted by the CSAE.
  • Remove or rethink the clustering approach: While theoretically the clustering approach would be interesting to scale human analysis it is not clear how it would be used in practice. I have no clear idea on how to address this issue yet and might remove it from the paper.

What I might leave out for further work:

  • Enhanced method evaluation: Currently, the paper lacks comparative analysis with other methods. To address this, I plan to conduct evaluations against simple heuristic models to determine how effectively my method extracts meaningful features. Additionally, I will compare the performance of Contrastive Sparse Autoencoders (CSAEs) with standard Sparse Autoencoders (SAE) and other relevant techniques to establish a clear benchmark.
  • Establishing a proper benchmark: Setting up a robust benchmark, particularly in the context of chess, would address many criticisms and provide a more objective evaluation of the CSAE. However, due to the significant work required, I consider this an essential next step for future research rather than for inclusion in the current paper revision

Poster

BibTeX

@misc{poupart2024contrastivesparseautoencodersinterpreting,
  title={Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents}, 
  author={Yoann Poupart},
  year={2024},
  eprint={2406.04028},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2406.04028}, 
}