Jesse Hoogland

  • Area: Singular learning theory for AI safety

    • Project: Simplicity biases and deceptive alignment. In their recent “sleeper agents” paper, Hubinger et al. show that adversarial training fails to remove a backdoor behavior trained into an LLM, and instead causes it to learn a more specific backdoor. They point out that this is highly relevant to safety training, since “once a model develops a harmful or unintended behavior, training on examples where the model exhibits the harmful behavior might serve only to hide the behavior rather than remove it entirely”. They conjecture an explanation in terms of a bias of SGD towards simpler modifications. This phenomenon and the conjectured bias can be understood in terms of the free energy formula within SLT, and we will investigate the use of restricted LLCs for studying this phenomenon.

    • Project: Backdoor detection with data restricted LLCs. If a model is backdoored it may “compute differently” on inputs from slightly different distributions, in a way that is unanticipated. Different input distributions determine different population loss landscape geometries, and the data restricted LLC reflects these changes; consequently, we expect the data-restricted LLC to be sensitive to changes in model computation as a result of small changes in input distribution. We will investigate how this can be used to detect backdoors.

    • Project: Understanding algorithm change in many-shot jailbreaking. As long-context LLMs become available, the study of many-shot in-context learning has become increasingly critical to AI safety. As a follow-up to our ongoing work using SLT and DevInterp to study the fine-tuning process (described below) we will study many-shot jailbreaking. In the forthcoming Carroll et al., we study how the algorithms learned by transformers change over training, according to the free energy formula in SLT, and we have ideas on how to adapt this to study how the mode of computation varies over long contexts.

    • Project: A minimal understanding-based eval. By scaling data and weight restricted LLCs to large open-weight models, combined with the above projects using these tools to reason about the effect of safety fine-tuning and adversarial training, we will produce an eval that goes beyond behavioral evaluation and demonstrate its application to simple examples of engineered deceptive alignment and backdoors

  • I run Timaeus, a nonprofit AI safety research organization working on singular learning theory (SLT) for alignment. SLT establishes a connection between the geometry of the loss landscape and internal structure in models, which we are using to develop scalable, rigorous tools for evaluating, interpreting, and aligning neural networks.

  • Ideally, a theoretical background (~PhD) in physics, mathematics, or a related technical field + ability to code & at least ARENA-level ML knowledge. 

Executive Director at Timaeus