TL;DR

Podcast Summary by NotebookLM - [Apple] [Spotify]

Get started on running Neuronpedia: GitHub or Instant Deploy
Download 4+ terabytes of interpretability data: Public Datasets
Review a summary of Neuronpedia's features and submit feature requests.
Reach out to us via Slack #neuronpedia, email, or GitHub Issues
Check out the tweet thread with demos

The new open source Neuronpedia - still supporting researchers, but now open sourced.

Current Capabilities

A summary of Neuronpedia's current capabilities - not all features are listed:

Neuronpedia's current functionality map, including API, search, auto-interp, exploration tools, and more.

Our Purpose: Accelerate Interpretability Research
How We Got Here: A Brief History
Why Open Source Now?
What Do You Get From an Open Source Neuronpedia?
What Does Open Sourcing Not Change?
- Unchanged: Supporting the Newest Interpretability Research
- Unchanged: We're Still Your Collaborator
How to Get Started
- Github and Datasets
Remaining Work + WIPs
Support and Contact
Acknowledgements

Our Purpose: Accelerate Interpretability Research

Interpretability is an unsolved problem - anyone that tells you otherwise is either trying to sell you something and/or misinformed - especially with new models being released so frequently.

Is interpretability needed? While it's possible that advanced AI is somehow "naturally aligned" to be pro-human and pro-Earth, there's no benefit to assuming that this is true. It seems unlikely that all advanced AI would be fully aligned in all the possible scenarios and edge cases.

Neuronpedia's role is to accelerate understanding of AI models, so that when they get powerful enough, we have a better chance of aligning them. If we can increase the probability of a good outcome by even 0.01%, that's an expected value of saving many, many current and future lives - certainly a worthwhile and meaningful endeavor.

How We Got Here: A Brief History

Neuronpedia was created in the summer of 2023 as a reference for GPT2-Small's neurons, using data and code from OpenAI's Superalignment initiative and informally advised by William Saunders. Combined with data from Neel Nanda's Neuroscope, we tried to answer the question: how do you create a useful interpretability platform?

One early experiment was crowdsourcing human explanations for neurons. By using LLMs to score explanations, the hope was to make fine-tuned models that could accurately explain neurons. This became a full-fledged graphical RPG:

The Neuronpedia game: with pixel art, animations, collectibles, leaderboards, and even an item shop. — The Neuronpedia game: complete with pixel art, animations, collectibles, leaderboards, and even an item shop. Neuronpedia collected over 10,000 user-submitted explanations, half of which scored higher than GPT-4.

However, the game, while interesting, was not the most impactful use of an interpretability platform - it was catering to one facet of a specific research question. We wanted to serve more researchers and their various areas of interest.

So in early 2024, thanks to the direction of Joseph Bloom, Neuronpedia sunset its game and went all-in on accelerating and supporting interpretability researchers, including hosting the world's first interpretability and steering API. We've since then collaborated with independent researchers, large organizations, and academic research labs.

Why Open Source Now?

It takes a significant amount of effort to make an app like Neuronpedia (frontend, backend, infrastructure, dependent services) ready for open source - there are considerations of security, backward compatibility, extensibility, etc. Not to mention, the database schema and architecture was changing significantly from week to week.

Two months ago, we decided that the longer-term benefits of open sourcing outweighed the short-term priority of quickly adding new functionalities, so we put most major changes on hiatus and focused on refactoring, cleaning up, and documenting code/processes for open sourcing. Over the next few weeks, we'll keep adding more features/guides for running, customizing, and extending Neuronpedia.

What Do You Get From an Open Source Neuronpedia?

New: Host Neuronpedia Yourself

For the impatient, you can instantly deploy a custom Neuronpedia by clicking Deploy:

Custom Neuronpedias are as simple as a few clicks. And you can git pull new changes when we deploy them, to stay updated with the latest fixes/features.

You can now host Neuronpedia yourself, with your own model and data, on your own cloud or hardware. And you can choose to make it public or private. For example - did you train new sparse coders on a particular dataset? You can load that into a local database and access all of Neuronpedia's features, like an API, steering, dashboards, search, lists, and more.

New: Fork Neuronpedia For Research, Projects, and Startups

We're grateful for the people and organizations that came before us to make Neuronpedia possible, and we're happy to have lit the fire to inspire various projects, both public and private. Now, you can customize Neuronpedia to your heart's content - changing everything from updating colors to adding new features.

For example, with a single prompt to Claude Code, it builds a new app "Steerify" using Neuronpedia components, with no intervention or fixes:

Neuronpedia is comprised of multiple services that can run independently, so you can also just fork parts of Neuronpedia that you want and replace other parts entirely. For example, if you want just the frontend of Neuronpedia but plug in a different system for inference, just replace the Neuronpedia inference server, ensuring that your new inference server matches the defined OpenAPI spec.

Subpoint: Applied Interpretability

This opens up many possibilites for applied interpretability projects, both safety-related and not. For example, we experimented with building "Safety Dashboards" for specific models - with just a few tweaks to the steering interface, we built a red-teaming prototype, which could have been easily extended to specific topics like deception and violence.

New: Contributing Code to the Global Neuronpedia

With every research collaboration, Neuronpedia adds new functionality for everyone to use - not just that specific research project. For example, when we added AxBench, we added the ability to explore and search "concepts", an alternative to SAE features.

But collaborations like that don't scale well because they're bottlenecked on our engineering resources, as well as time to communicate and verify implementations with the researcher. By open sourcing, we can focus on our respective strengths - the researcher can implement technical parts related to their project, and we fill in the gaps for UI, scaling, etc where needed. Finally, merging this to the global neuronpedia.org enables everyone to benefit.

Subpoint: Using AI Assist and Coding Agents

We're optimistic about using AI coding agents to accelerate the development of new features for Neuronpedia, and encourage you to try it. We've added some Cursor Rules files to give AI the proper context, and have done some early experiments to gauge AI capability in adding new functionality to Neuronpedia.

New: Roadmap (WIP) + Public Feature Requests

You can submit feature requests, bug reports, and questions on GitHub issues.

We're currently working on porting our TODOs and roadmap from our private task management lists, to our GitHub issues and discussions. GitHub is also the place where we'll list known issues and workarounds.

New: Global Datasets - Export and Import (WIP)

Let's say you came up with a new type of sparse coder, called the Sparse PuppyEncoder (SPE), and you want to not only host it yourself, but also to let others load Sparse PuppyEncoders (and your generated data) into their own local Neuronpedia and tinker with it? You can do that, too - just use the neuronpedia-utils export tools to export your activations, explanations, and metadata. Then, we upload your data to the public neuronpedia-datasets, so that anyone running a local instance can import it.

Here's importing datasets from the global Neuronpedia bucket to a localhost instance:

What Does Open Sourcing Not Change?

Unchanged: Supporting the Newest Interpretability Research

A key to Neuronpedia is moving quickly to adopt the latest in interpretability science - and updating our interfaces, schemas, and APIs to match. Nothing changes about our own development of new Neuronpedia features - we have a long list of functionalities we're itching to get back to adding and fixing.

What slightly changes is that we'll be more cautious and better document significant changes, especially with database schema updates, to ensure stability for other instances of Neuronpedia.

Unchanged: We're Still Your Collaborator

By open sourcing Neuronpedia, we make it easier for researchers to get a head start on changes they wish to add or make to Neuronpedia. However, we fully expect that in the near term, while the documentation is still getting fleshed out, that we'll still do the majority of the work in building out collaborations. There's still work for us to do in restructuring Neuronpedia to be more easily extensible for someone not intimately familiar with the codebase.

How to Get Started

All documention for getting started is in the READMEs in our public repository. All methods have been tested, but some details are a work-in-progress. Feel free to contact us.

Github and Datasets

GitHub
- Neuronpedia is a monorepo with multiple services and packages
- Multiple tutorials / methods for setting it up, depending on your level of customization
- Demo environment is publicly available to connect to (database, inference instances)
Datasets (4+ TB)
- Includes all current public Neuronpedia data
- 11 models
- 60+ million latents/features/concepts, 50+ million explanations, 3+ billion activations

Remaining Work + WIPs

We wanted make the open source Neuronpedia available ASAP so that people can start tinkering with it. The biggest risk is that the code is not well-documented or well-structured enough for people to make use of it. To that end, our immediate near-term goals are:

Better documentation of code and processes (eg tutorials on how to load your own sparse coders and dashboards)
- Examples of contributions and templated issues
Automation and pipelining of processes in a resilient way (eg generating and uploading activation data, and automatically retrying on failure)
Easier extensibility (eg refactoring/documenting so that it's obvious where/how to add a new visualization to Neuronpedia)
- Design System + Cleanup
Tests for all services, and integrating them into the CI/CD

Support and Contact

Neuronpedia is platform that's constantly evolving for the latest in interpretability science. So your feedback, feature requests, and bug reports are critical to its success. Reach us via:

GitHub Issues: Good for longer-form feature requests and specific bugs
Slack #neuronpedia or #general: Good for quick questions and general tips
Email: Good for high priority issues, privacy-sensitive communication, and collaborations

Acknowledgements

We're always standing on the shoulders of giants, and we're grateful for every one of our collaborators, supporters, and advisors. Here are some people and organizations that have made especially important contributions to the Neuronpedia project. Thank you!

William Saunders
Joseph Bloom
Neel Nanda
The Long Term Future Fund
AISTOF
OpenAI's Superalignment Team

We hope by open sourcing Neuronpedia, we can advance research and go further together in building public, shared interpretability tools - and maybe have some fun along the way.

The Residual Stream

Neuronpedia's Blog

The Babble

Podcast by NotebookLM

Neuronpedia is Now Open Source