From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Understanding the internal mechanisms of Vision-Language Models like CLIP is becoming increasingly critical as they see broader deployment. While mechanistic interpretability has made great strides, most existing methods still rely on model activations. This makes them inherently dataset-dependent, vulnerable to data bias, and often restricted to providing only coarse, head-level explanations.

To address these challenges, we introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free and training-free framework that analyzes CLIP’s vision transformer directly in weight space. By applying Singular Value Decomposition (SVD) to the value-output matrices of attention heads, SITH uncovers the principal semantic directions the model uses to process information. We then interpret these directions using COMP (Coherent Orthogonal Matching Pursuit), a new algorithm we developed to decompose weights into sparse, semantically coherent sets of human-interpretable concepts.

This weight-based perspective allows us to move beyond passive observation. With SITH, we can perform precise, interpretable model edits—such as suppressing spurious correlations or removing undesired concepts—entirely without retraining. Furthermore, SITH provides a unique lens into model adaptation, revealing that fine-tuning primarily works by reweighting a stable semantic basis rather than learning entirely new features from scratch.

For a more in-depth look at our findings, including interactive demos and visualizations, please visit the Project Page.

Cite