We study knowing but not doing: whether skill-adaptive models acquire knowledge of concepts at all skill levels but selectively externalize that knowledge depending on the conditioning signal. We use Maia-2 and chess as a model system, as skill levels are precisely quantified by Elo, concepts are formally definable, and move quality is objectively measurable.
Two competing hypotheses: (H1) the model dynamically adjusts internal concept awareness per skill level; (H2) concept awareness is consistent across skill levels, and the skill gap stems from differential externalization. We found strong evidence for H2.
conda env create -f environment.yml
conda activate maiainterpPretrained Maia-2 (Rapid) checkpoint: weights.v2.pt — download
SAE on residual streams: sae/best_jrsaes_2023-11-16384-1-res.pt — download
To test H1, we ask whether internal concept awareness (across a set of 172 fundamental and measurable chess concepts) in Maia-2 varies with the conditioned Elo level. We train linear probes per concept, per Elo level, per layer:
python train/train_probes.py --layer_key "transformer block 0 hidden states" --output_dir probes/layer0_efficient
Result. The internal representations encode chess concepts equally well regardless of the conditioned skill level, refuting H1.
We fine-tune only the policy head fc_1 on the Blundered Transitional Dataset. This is the most stringent test of H2: policy-head-only fine-tuning introduces no new knowledge to the model backbone, only recalibrates when that knowledge is acted upon.
cd extern/policy_distillation
python ft_per_concept_head_only.pyConfiguration: extern/policy_distillation/finetune_config.yaml
We train a set of SAEs on MAIA2 and identify the SAE features most predictive of each concept to surgically amplify concept-relevant features in the residual stream at inference time - similar to the feature steering.
python extern/feature_steering/select_sae_features.py
python extern/feature_steering/sae_intervention.py --layer 0 --mode salient