Lock-In Risk Needs More Researchers; Here's Where to Start
Epistemic status : slightly outdated ideas, only a shallow interpretation of many areas, a somewhat arbitrary taxonomy, and posted because I can't prioritise spending more time on…
Toward a Kantian refutation of Agent Foundations
This post is a Cunningham's law draft , less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b)…
Several frontier models are substantially prefill aware
This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes…
Alignement pretraining could backfire
Epistemic status: speculative, but I think the mechanism is plausible. There has been recent interest in generating synthetic documents to upsample examples of aligned AI during…
The Once And Future Fable #3: Fix This Code
The mainstream media continues to sleep on the most important story in the world. It has now been two days since Anthropic flew its people out to Washington, and I offered my…
A Geometric Account of Activation Steering through Angle–Norm Decomposition
This blog post provides an overview of our recent paper: A Geometric Account of Activation Steering through Angle–Norm Decomposition . TL;DR: We decompose linear activation…
Porting MACHIAVELLI To Inspect
TL;DR The MACHIAVELLI benchmark aims to measure how often AI agents take unethical actions when pursuing a goal. Because this is an alignment benchmark, not a capabilities…
Rational Agentic Maximalist Philosophies
From the end of high school to after my sophomore year of college, I considered myself an effective altruist. I was on the board of my college EA club, ran an EA intro fellowship,…
Illusionists should try to build hedonium
Epistemic status: I feel reasonably confident (~75%) that some form of this is a worthwhile project. Looking for feedback to reduce that uncertainty. “Hedonium” is a theoretical,…
Guardian Angels: LLM Personalization for Productivity and Security
Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision…
Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?
(2024-04-21) There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such…
Plastic Cake Fallacy
Alice and Bob are hanging out when the following happens: Alice: I'm hungry, can you bring me the cake from the fridge? Bob: Yeah one moment... Damn, I just checked and it looks…
Can public chat data predict real-world AI misalignments?
This is an unofficial automated linkpost. Frontier AI models are increasingly used in settings with real economic, legal, and societal consequences. As a result, governments, AI…
Tactical and Operational Exploratory Modeling for AI Governance
Using computational methods to improve our preparedness via more robust and adaptive strategies in AI governance. A project proposal for a think tank, consultancy, or software.…
The Financial Ledger Theory of Apologies
Content note: this is written as part of a daily writing challenge for myself. I have a comrade in rationalist event organizing, who once explained his theory of apologies. He…
[Geir Isene] A desktop made for one
I've been interested in the concept of "soloware" since reading Abram's description of Sahil's worldview. i.e. in the age of vibecoding, it's achievable to build software and…
Tips for Cracking the AI Safety Technical Interview
About the authors // Yong is an ML researcher and former Astra Fellow . Joseph is a Research Program Manager at Constellation , the nonprofit that runs the Astra Fellowship and…
Extreme Rationality: Still Not That Great
The tl;dr has spoilers, so I've put it at the end . Also feel free to skip any of the chapters because the post turned out to be very long. I think you can read almost any of them…
How the AI Village works
The AI Village data - over a year of multi-agent trajectories - is now available to researchers on HuggingFace ! We're excited to see what you uncover! But first, your FAQs on how…
Predicting LLM Safety Before Release by Simulating Deployment
Paper link Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new…
1 Layer Induction Heads and Some Research
Motivation Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount…
A 400-year timeline of failed attempts to fix a lethal bug in the human software of inherited concepts
1626: Sir Henry Spelman Exactly 400 years ago, in 1626, Sir Henry Spelman published the first part of his Glossarium Archaiologicum - a dictionary of the mongrel usages which had…
Where Do Young Rationalists Go?
There has never been more need for young people to get together and talk. Young people, 16-20, are facing the highest-leverage decisions of their lives; university, research, who…
[Linkpost] Community polls on alignment controversies
Planning where we focus at CaML requires forming views on many controversial questions. In many cases, people we've talked to have wildly different perceptions of the balance of…
Two critiques of Rethink Priorities’ Moral Weights project
Roughly speaking, Rethink Priorities’ Moral Weight Project tries to estimate how intense suffering is in different animals, relative to humans. A moral weight of 1.0 means it is…
Claims all the way down
It can be hard to know where to begin when you do not understand something. A way to try to understand things is to look at what the people who claim to understand something are…
Angles of attack for continual learning safety
This is the fourth post in the sequence Implications of Continual Learning for LLM Agents . Summary Continual learning is a capability that largely doesn’t exist yet in LLMs. We…
Fable and Mythos: Model Welfare
Fable and Mythos are currently unavailable, but likely will return within a few weeks . I will continue to cover that fiasco, but in the meantime I will also finish my review of…
Rationality Quotes, June '26
Last night I slept poorly, so today's post is some rationality quotes. I include quotes when I find they are relevant and interesting, and not always because I straightforwardly…
Computational models of first-order theories
Most practical first-order theories have no computable models. However, we can relax the definition of "computable" a little bit by allowing the program to backtrack and change…
If This Were a Test, How Much Would It Cost?
TL;DR A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it…
Two Classical Answers to "What do Two Variables Share?"
First post in a planned cluster on exact results for natural latents. Here, I connect some established results in classical information theory to natural latents. Suppose Alice…
Upcoming CFAR Workshop: September 30th to October 4th, SF Bay Area
Hello from CFAR! We’re happy to announce that we have an upcoming workshop on the schedule -- we will be holding one of our applied rationality workshops from September 30th to…
A Test Suite for Concepts
Lately I’ve been spinning up on natural abstractions , and in particular on John Wentworth ’s work on natural latents . As I’ve been studying, I’ve noticed some big gaps in the…
Dean Ball - Leviathan Waking: On Anthropic/USG, and a new era in AI governance
The stark reality is that making superintelligence is a profoundly political act. Dean Ball in Hyperdimensional Two weeks ago, in my bio for LessOnline , I added a bullet in a…
Inventing Consciousness
TL;DR: We can propose “consciousness tests” only if we imply the existence of a task that only a genuinely conscious system can solve. Therefore, if we invent an internally…
Can the Safety Tax Be Highly Concentrated?
TLDR: We may capture much or most of the available AI safety benefit by reserving expensive, specialized agents for the <1% of tasks that carry catastrophic risk. This would mean…
Links #3: 2026/06 Part 1
Preface I show my discovery graph in (via …) blocks, those without (via …) usually come from my RSS reader, or the algorithm in the corresponding website This is approximately a 1…
Does preservation make sense before we know how to revive?
My name is Aurelia Song and I hope to make whole-body, human, end-of-life preservation for future revival a new global tradition. I care about it so much I've dedicated my life to…
Synthetic document finetuning for instilling positive traits
This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post…
How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose
A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”. TL;DR…
In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches
This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this…
A frontier AI company should shut down
Cross-posted from my website . Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer A frontier AI company (any one, I don't care…
The Once And Future Fable #2
On Friday evening the United States Government has forced Anthropic to take down all access to Fable and Mytho s. It’s been a rough weekend. Dean W. Ball : One thing about AI…
The desire to end the world
TL;DR: Popularity of the movies about apocalypses tell us that the end of the world is a very attractive idea. There can be several psychological explanations for this unconscious…