Guarantees-driven Mechanistic Interpretability
Mentor
Jason Gross
Jason leads a project in formalizing transformers funded by ARC Theory. He’s excited about approaches that might allow us to confidently scale interpretability. Previously he was a researcher at MIRI and got his Programming Languages PhD at MIT.
Before all of the AI stuff, he developed Fiat Cryptography, which enables the majority of secure connections to the internet (the big prime numbers behind the https lock icon), reported the plurality of all-time bugs in the proof-assistant Coq (and joined the dev team to help handle them!), and contributed category theory to the Coq-HoTT library (hoorah for beautiful, unifying mathematics).
Project
My team is developing a guarantees-driven approach to mechanistic interpretability, where we use compact formal proofs as a metric for evaluating mechanistic interpretability analyses. Our goal is to make interpretability scalable, which requires a level of rigor that we can trust enough to create automation for.
There are several possible projects that are useful to this agenda, and I would be happy to scope out a project with individuals depending on their interests and skills. Some ideas are:
Making convexity arguments for multiple attention heads or combinations of MLP neurons
Extending current analyses to different proof approaches and different theorem statements
Proving guarantees about accuracy / loss at various stages of training
At the end of program, you should have:
Intuition for and experience with toggling the level of rigor of mechanistic interpretability analyses,
Explored hypotheses around the advantages and limits of the compact guarantees framing,
Picked up soft skills relevant to working on a collaborative research project, and
Written up your results.
Personal Fit
Must-haves
Have read through A Mathematical Framework for Transformer Circuits
Familiarity with basic asymptotic complexity analysis
Nice-to-haves
Have gone through MLAB or equivalent (e.g., ARENA, Mechanistic Interpretability Quickstart Guide by Neel Nanda)
Familiarity with formal proofs
Mentorship style:
Weekly pair coding or code review
Biweekly group meeting for project updates and check-ins
Slack response time < 48 hours. Happy to point out resources that will help unstick you
Expected time commitment: ~10 hr/week
Selection Questions
Please see the following Google Doc.