Stefan Heimersheim

  • Area: Mechanistic Interpretability

    I'm excited to mentor both, fundamental and applied mechanistic interpretability projects. The former could be

    • Why are there "stable regions" in activation space, and can this tell us about features and representation in neural networks?

    • Can we create useful toy models for mechanistic interpretability? In particular I'm currently thinking of toy models of modularity.

    In general, I want such projects to be aimed at solving a problem in interpreting LLMs (plus reward models, etc.) because this is what matters in practice (in the short term). I do work with non-LLM toy models, but only when I expect the insights to be useful on the path to interpreting state-of-the-art models. I am less interested in Sparse Autoencoder projects (e.g. apply SAEs to <new model type>) unless there's a reason why we expect this to be useful.

    In terms of applied projects, I am interested in

    • What is stopping internals-based "LLM lie detectors" from working well? Why are these probe scores all over the place?

    • Can 2025-interpretability solve the 2023 trojan detection challenge? (Anthropic may have just done this; haven't read the paper yet)

    I'm also excited to work out new project ideas together with promising candidates.

    You can find a list of project ideas here.

  • I’m an interpretability researcher at Apollo Research. Previously, I completed a PhD in Astronomy and Neel Nanda’s MATS program. My research interests lie mostly in fundamental mech interp, new methods & good toy models. You can find some of my takes in my short-form comments. I try to focus on neglected ideas, which just means I avoid doing plain SAE projects. You can find my past projects on LessWrong as well as arXiv. I’ve mentored around 9 projects across various programs over the past year [1,2,3,4,5, more in preparation], including LASR, SPAR, MARS, and Pivotal.

  • I typically run projects with a team of 1-4 mentees. I will have a regular weekly 1h call with the team to answer questions, discuss progress, and give advice on the next steps. We also communicate via Slack, and I typically can respond to questions within 24h (faster to urgent and blocking issues). I encourage mentees to share research results, plots etc. throughout the week.


    I expect mentees to lead the research project, running the experiments and writing up the results. I encourage mentees to take agency, thinking about next steps or new directions, and to critically challenge my project plans. I’m always happy to give feedback on this. I expect mentees to prepare our weekly meeting, and to coordinate within their team. Usually this has the form of a simple slide deck (see here for what I’m aiming for) with a recap of last week’s goals, results from the current week, questions and surprises, and ideas for next week. I expect mentees to meet internally at least once a week to update each other and to prepare for the weekly call, and recommend running coworking sessions throughout the week.

Research Scientist, Apollo Research