NewsWatchdog.com — Less Wrong

Less Wrong 7h ago

Lock-In Risk Needs More Researchers; Here's Where to Start

by Alfie Lamerton

Epistemic status : slightly outdated ideas, only a shallow interpretation of many areas, a somewhat arbitrary taxonomy, and posted because I can't prioritise spending more time on…

⚡70 · 🛡100

Less Wrong 11h ago

Toward a Kantian refutation of Agent Foundations

by Fernand0

This post is a Cunningham's law draft , less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b)…

⚡64 · 🛡100

Less Wrong 7h ago

Several frontier models are substantially prefill aware

by yeedrag

This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes…

⚡59 · 🛡100

Less Wrong 11h ago

Alignement pretraining could backfire

by Alexandre Variengien

Epistemic status: speculative, but I think the mechanism is plausible. There has been recent interest in generating synthetic documents to upsample examples of aligned AI during…

⚡53 · 🛡100

Less Wrong 10h ago

The Once And Future Fable #3: Fix This Code

by Zvi

The mainstream media continues to sleep on the most important story in the world. It has now been two days since Anthropic flew its people out to Washington, and I offered my…

⚡50 · 🛡100

Less Wrong 9h ago

A Geometric Account of Activation Steering through Angle–Norm Decomposition

by Atmyre

This blog post provides an overview of our recent paper: A Geometric Account of Activation Steering through Angle–Norm Decomposition . TL;DR: We decompose linear activation…

⚡48 · 🛡100

Less Wrong 7h ago

Porting MACHIAVELLI To Inspect

by Koby Lewis

TL;DR The MACHIAVELLI benchmark aims to measure how often AI agents take unethical actions when pursuing a goal. Because this is an alignment benchmark, not a capabilities…

⚡45 · 🛡100

Less Wrong 22h ago

Rational Agentic Maximalist Philosophies

by Connor Blake

From the end of high school to after my sophomore year of college, I considered myself an effective altruist. I was on the board of my college EA club, ran an EA intro fellowship,…

⚡39 · 🛡100

Less Wrong 12h ago

Illusionists should try to build hedonium

by Jack Thompson

Epistemic status: I feel reasonably confident (~75%) that some form of this is a worthwhile project. Looking for feedback to reduce that uncertainty. “Hedonium” is a theoretical,…

⚡37 · 🛡100

Less Wrong 21h ago

Guardian Angels: LLM Personalization for Productivity and Security

by gwern

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision…

⚡35 · 🛡100

Less Wrong 22h ago

Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?

by gwern

(2024-04-21) There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such…

⚡33 · 🛡100

Less Wrong 17h ago

Plastic Cake Fallacy

by nika koghuashvili

Alice and Bob are hanging out when the following happens: Alice: I'm hungry, can you bring me the cake from the fridge? Bob: Yeah one moment... Damn, I just checked and it looks…

⚡32 · 🛡100

Less Wrong 21h ago

Can public chat data predict real-world AI misalignments?

by papetoast

This is an unofficial automated linkpost. Frontier AI models are increasingly used in settings with real economic, legal, and societal consequences. As a result, governments, AI…

⚡30 · 🛡100

Less Wrong 23h ago

Tactical and Operational Exploratory Modeling for AI Governance

by Dawn Drescher

Using computational methods to improve our preparedness via more robust and adaptive strategies in AI governance. A project proposal for a think tank, consultancy, or software.…

⚡28 · 🛡100

Less Wrong 18h ago

The Financial Ledger Theory of Apologies

by Ben Pace

Content note: this is written as part of a daily writing challenge for myself. I have a comrade in rationalist event organizing, who once explained his theory of apologies. He…

⚡27 · 🛡100

Less Wrong 22h ago

[Geir Isene] A desktop made for one

by Raemon

I've been interested in the concept of "soloware" since reading Abram's description of Sahil's worldview. i.e. in the age of vibecoding, it's achievable to build software and…

⚡26 · 🛡100

Less Wrong 23h ago

Agents are under-elicited: A case study in optimization tasks

by zef

Discuss

⚡22 · 🛡100

Less Wrong 1d ago

Tips for Cracking the AI Safety Technical Interview

by Yong

About the authors // Yong is an ML researcher and former Astra Fellow . Joseph is a Research Program Manager at Constellation , the nonprofit that runs the Astra Fellowship and…

⚡21 · 🛡100

Less Wrong 1d ago

Extreme Rationality: Still Not That Great

by игорь тимофеев

The tl;dr has spoilers, so I've put it at the end . Also feel free to skip any of the chapters because the post turned out to be very long. I think you can read almost any of them…

⚡18 · 🛡100

Less Wrong 1d ago

How the AI Village works

by Adam B

The AI Village data - over a year of multi-agent trajectories - is now available to researchers on HuggingFace ! We're excited to see what you uncover! But first, your FAQs on how…

⚡17 · 🛡100

Less Wrong 1d ago

Predicting LLM Safety Before Release by Simulating Deployment

by Tomek Korbak

Paper link Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new…

⚡16 · 🛡100

Less Wrong 1d ago

1 Layer Induction Heads and Some Research

by Goutham Nalagatla

Motivation Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount…

⚡15 · 🛡100

Less Wrong 1d ago

A 400-year timeline of failed attempts to fix a lethal bug in the human software of inherited concepts

by Bruce Middleton

1626: Sir Henry Spelman Exactly 400 years ago, in 1626, Sir Henry Spelman published the first part of his Glossarium Archaiologicum - a dictionary of the mongrel usages which had…

⚡15 · 🛡100

Less Wrong 1d ago

Where Do Young Rationalists Go?

by fluxxrider

There has never been more need for young people to get together and talk. Young people, 16-20, are facing the highest-leverage decisions of their lives; university, research, who…

⚡14 · 🛡100

Less Wrong 1d ago

[Linkpost] Community polls on alignment controversies

by Jasmine Brazilek

Planning where we focus at CaML requires forming views on many controversial questions. In many cases, people we've talked to have wildly different perceptions of the balance of…

⚡13 · 🛡100

Less Wrong 1d ago

Two critiques of Rethink Priorities’ Moral Weights project

by Bill Jackson

Roughly speaking, Rethink Priorities’ Moral Weight Project tries to estimate how intense suffering is in different animals, relative to humans. A moral weight of 1.0 means it is…

⚡12 · 🛡100

Less Wrong 1d ago

Claims all the way down

by Jasper Blank

It can be hard to know where to begin when you do not understand something. A way to try to understand things is to look at what the people who claim to understand something are…

⚡12 · 🛡100

Less Wrong 1d ago

Angles of attack for continual learning safety

by Rauno Arike

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents . Summary Continual learning is a capability that largely doesn’t exist yet in LLMs. We…

⚡11 · 🛡100

Less Wrong 1d ago

Fable and Mythos: Model Welfare

by Zvi

Fable and Mythos are currently unavailable, but likely will return within a few weeks . I will continue to cover that fiasco, but in the meantime I will also finish my review of…

⚡11 · 🛡100

Less Wrong 1d ago

Rationality Quotes, June '26

by Ben Pace

Last night I slept poorly, so today's post is some rationality quotes. I include quotes when I find they are relevant and interesting, and not always because I straightforwardly…

⚡10 · 🛡100

Less Wrong 1d ago

Computational models of first-order theories

by MathMart

Most practical first-order theories have no computable models. However, we can relax the definition of "computable" a little bit by allowing the program to backtrack and change…

⚡9 · 🛡100

Less Wrong 1d ago

If This Were a Test, How Much Would It Cost?

by VojtaKovarik

TL;DR A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it…

⚡9 · 🛡100

Less Wrong 1d ago

Two Classical Answers to "What do Two Variables Share?"

by Haru

First post in a planned cluster on exact results for natural latents. Here, I connect some established results in classical information theory to natural latents. Suppose Alice…

⚡8 · 🛡100

Less Wrong 1d ago

Upcoming CFAR Workshop: September 30th to October 4th, SF Bay Area

by Davis_Kingsley

Hello from CFAR! We’re happy to announce that we have an upcoming workshop on the schedule -- we will be holding one of our applied rationality workshops from September 30th to…

⚡8 · 🛡100

Less Wrong 1d ago

A Test Suite for Concepts

by Gretta Duleba

Lately I’ve been spinning up on natural abstractions , and in particular on John Wentworth ’s work on natural latents . As I’ve been studying, I’ve noticed some big gaps in the…

⚡8 · 🛡100

Less Wrong 1d ago

Dean Ball - Leviathan Waking: On Anthropic/USG, and a new era in AI governance

by JohnofCharleston

The stark reality is that making superintelligence is a profoundly political act. Dean Ball in Hyperdimensional Two weeks ago, in my bio for LessOnline , I added a bullet in a…

⚡7 · 🛡100

Less Wrong 1d ago

Inventing Consciousness

by vasilisk

TL;DR: We can propose “consciousness tests” only if we imply the existence of a task that only a genuinely conscious system can solve. Therefore, if we invent an internally…

⚡7 · 🛡100

Less Wrong 2d ago

Can the Safety Tax Be Highly Concentrated?

by ozziegooen

TLDR: We may capture much or most of the available AI safety benefit by reserving expensive, specialized agents for the <1% of tasks that carry catastrophic risk. This would mean…

⚡6 · 🛡100

Less Wrong 2d ago

Links #3: 2026/06 Part 1

by papetoast

Preface I show my discovery graph in (via …) blocks, those without (via …) usually come from my RSS reader, or the algorithm in the corresponding website This is approximately a 1…

⚡6 · 🛡100

Less Wrong 2d ago

Does preservation make sense before we know how to revive?

by Aurelia

My name is Aurelia Song and I hope to make whole-body, human, end-of-life preservation for future revival a new global tradition. I care about it so much I've dedicated my life to…

⚡5 · 🛡100

Less Wrong 2d ago

Synthetic document finetuning for instilling positive traits

by CallumMcDougall

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post…

⚡5 · 🛡100

Less Wrong 2d ago

How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose

by baimamboukar

A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”. TL;DR…

⚡4 · 🛡100

Less Wrong 2d ago

In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

by JulesRoussel01

This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this…

⚡4 · 🛡100

Less Wrong 2d ago

A frontier AI company should shut down

by MichaelDickens

Cross-posted from my website . Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer A frontier AI company (any one, I don't care…

⚡4 · 🛡100

Less Wrong 2d ago

The Once And Future Fable #2

by Zvi

On Friday evening the United States Government has forced Anthropic to take down all access to Fable and Mytho s. It’s been a rough weekend. Dean W. Ball : One thing about AI…

⚡4 · 🛡100

Less Wrong 1d ago

The desire to end the world

by avturchin

TL;DR: Popularity of the movies about apocalypses tell us that the end of the world is a very attractive idea. There can be several psychological explanations for this unconscious…

⚡1 · 🛡100