Research and Publication

Insert text/description here about CAISH’s research output

Scroll below to see some of the papers produced by CAISH members.

October 2024

Evaluating Language Model Character Traits

Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan


September 2024

NGD converges to less degenerate solutions than SGD

Moosa Saghir, N. R. Raghavendra, Zihe Liu, Evan Ryan Gunter


June 2024

Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan


June 2024

IDs for AI Systems

Alan Chan, Noam Kolt, Peter Wills, Usman Anwar, Christian Schroeder de Witt, Nitarshan Rajkumar, Lewis Hammond, David Krueger, Lennart Heim, Markus Anderljung


May 2024

Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger


May 2024

Implicit meta-learning may lead language models to trust more reliable sources

Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Kacper Mlodozeniec, Tegan Maharaj, David Krueger


April 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwan, Yoshua Bengio, Danqi Chen, Philip H.S. Torr, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger


October 2023

Reward Model Ensembles Help Mitigate Overoptimization

Thomas Coste, Usman Anwar, Robert Kirk, David Krueger


October 2023

Detecting Backdoors with Meta-Models

Lauro Langosco, Neel Alex, William Baker, David Quarel, Herbie Bradley, David Krueger


July 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell


June 2023

Harms from Increasingly Agentic Algorithmic Systems

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, Tegan Maharaj