Representation Engineering
Mentor
Andy Zou
Andy is a second-year PhD student at CMU and a co-founder of Center for AI Safety (CAIS) working on AI Safety. Recently, he led Universal and Transferable Adversarial Attacks on Aligned Language Models and Representation Engineering: A Top-Down Approach to AI Transparency. For more information, please check out his website
Project
The project will focus on representation engineering research (ai-transparency.org/). See project document.
In general we aim to understand and control high-level phenomena in neural networks, such as deception, vulnerability to adversarial attacks, power-seeking tendencies, memorization, etc. The specific research directions will be discussed when the program starts.
Personal Fit
Must-haves
Familiar with PyTorch and Transformers library
Familiar with implementing a transformer from scratch
Nice-to-haves
Experience in working with LLMs.
Mentorship style: Scheduling can be relatively flexible, from one to multiple syncs per week depending on your preference. 0.5-1.5 hr/week.
Expected time commitment: minimum 8hr/week, ideally 12hr+/week.