Skip to content

CSSLab/maia2-skill-adaptation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mechanisms of Skill Adaptation in Generative Models: Chess as a Model System

We study knowing but not doing: whether skill-adaptive models acquire knowledge of concepts at all skill levels but selectively externalize that knowledge depending on the conditioning signal. We use Maia-2 and chess as a model system, as skill levels are precisely quantified by Elo, concepts are formally definable, and move quality is objectively measurable.

Two competing hypotheses: (H1) the model dynamically adjusts internal concept awareness per skill level; (H2) concept awareness is consistent across skill levels, and the skill gap stems from differential externalization. We found strong evidence for H2.

Setup

conda env create -f environment.yml
conda activate maiainterp

Pretrained Maia-2 (Rapid) checkpoint: weights.v2.ptdownload

SAE on residual streams: sae/best_jrsaes_2023-11-16384-1-res.ptdownload


Pipeline

1. Knowledge Acquisition

To test H1, we ask whether internal concept awareness (across a set of 172 fundamental and measurable chess concepts) in Maia-2 varies with the conditioned Elo level. We train linear probes per concept, per Elo level, per layer:

python train/train_probes.py --layer_key "transformer block 0 hidden states" --output_dir probes/layer0_efficient

Result. The internal representations encode chess concepts equally well regardless of the conditioned skill level, refuting H1.


2. Knowledge Externalization

2a. Externalization via policy head fine-tuning

We fine-tune only the policy head fc_1 on the Blundered Transitional Dataset. This is the most stringent test of H2: policy-head-only fine-tuning introduces no new knowledge to the model backbone, only recalibrates when that knowledge is acted upon.

cd extern/policy_distillation
python ft_per_concept_head_only.py

Configuration: extern/policy_distillation/finetune_config.yaml


2b. Targeted externalization via SAE feature steering

We train a set of SAEs on MAIA2 and identify the SAE features most predictive of each concept to surgically amplify concept-relevant features in the residual stream at inference time - similar to the feature steering.

python extern/feature_steering/select_sae_features.py

python extern/feature_steering/sae_intervention.py --layer 0 --mode salient

About

Mechanisms of Skill Adaptation in Generative Models: Chess as a Model System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages