Evaluating LM Agency
Mentor
Francis Rhys Ward
Francis Rhys Ward is a final year PhD student at Imperial College London, a member of Tom Everitt's causal incentives group, a recent GovAI summer fellow, and previously a CLR summer fellow. His technical work focuses on evaluating deception and agency in frontier AI systems. In addition, he organizes a number of community initiatives in London focused on AI existential safety, and is a lecturer on the Ethics of AI module at Imperial.
Project
The primary threat-models for AI x-risk depend on AI systems that are agents or “agentic”. Although there is no complete theory of agency, there are several important dimensions of agency, such as the coherence of a system's goals or beliefs. This project aims to empirically evaluate key dimensions of agency in frontier LMs, for example, by designing a method for evaluating the extent to which OpenAI models pursue the goal of being “helpful and harmless”.
Personal Fit
Proficient in Python
Strong background in mathematics and/or philosophy
Experience evaluating and fine-tuning LMs
Mentorship style: flexible, minimum: weekly meetings.
Selection Questions
What properties of agency are particularly important and tractable to measure and evaluate? Design a concrete evaluation method for measuring "agency" (or a key dimension of agency) in language models. (Answer in <45 mins)