The Michael Scott Paper Company of AI

Opus 4.6 and GPT-5.3-Codex were both released last week. And beyond the hooha and the kerfuffle and the METR benchmarks and the posturing over who is going to develop AGI first, I thought of a now quite famous scene from The Office.

Michael Scott has left Dunder Mifflin to found his own paper company. It's going well — suspiciously well. He's stealing clients left and right, because his prices are impossibly low. What Dunder Mifflin doesn't know is that the Michael Scott Paper Company is about to go broke. Michael assumed fixed costs when planning, didn't realise they'd grow as he scaled, and that's how he made prices so cheap, and that's how he stole so many clients.

But Dunder Mifflin has some idea of this. So David Wallace, the CFO, makes his move. He offers Michael a buyout and leans on the obvious leverage:

"Your company is four weeks old. I know this business. I know what suppliers are charging. I know you can't be making very much money. I don't know how your prices are so low, but I know it can't keep up that way. I'm sure you're scared. Probably in debt. This is the best offer you're gonna get."

And Michael (who is, underneath all the buffoonery, occasionally the most perceptive person in the room) says:

"I'll see your situation and I'll raise you a situation. Your company is losing clients left and right. You have a stockholder meeting coming up and you're going to have to explain to them why your most profitable branch is bleeding. So they may be looking for a little change in the CFO. So I don't think I need to wait out Dunder Mifflin. I think I just have to wait out you."

Michael wins. Not because his position is strong (it isn't, he's weeks from bankruptcy) but because he correctly identifies that Wallace's position is weaker than it looks on a shorter timescale. Wallace has a stockholder meeting. Michael just has to outlast the clock.

I think about this when I think about open-source AI.

The Intelligence Ceiling

Every model has an intelligence ceiling: the fraction of economically valuable work it can automate at a cost and quality where you'd be irrational not to deploy it. This is a lower bar than human-level performance — a model that's 80% as good as a domain expert but 200x cheaper and runs in seconds has cleared it comfortably for an enormous category of work. And this ceiling isn't fixed. It rises along two independent axes: models get smarter, and inference gets cheaper. Both expand the set of tasks worth automating, and they compound.

But most of that ceiling can only be reached through specialisation. A general-purpose frontier model, accessed through an API, will clear the bar for some tasks out of the box — the ones that look like the internet, the ones where general capability is enough. But most enterprise work lives in the long tail: proprietary edge cases, domain-specific data formats, error modes that only surface in production, workflows where "good enough" means something precise and unforgiving. For that work, you need to train the model on your domain. You need to run reinforcement learning against your own reward signal, on your own data. You need open weights.

This is worth being precise about. I'm not making a vague claim that fine-tuning is useful. I'm saying that for the majority of economically valuable tasks, the gap between what a general-purpose model can do via prompting and what a specialised model can do via post-training is large.

This is what I saw building specialised models at Parsed, and what I continue to see at Baseten: a sub 80B model fine-tuned on a client's claims data will outperform GPT-5.2 on that client's claims workflow, not because it's a smarter model, but because it's been trained on the actual distribution it needs to perform on. And it will do it faster, cheaper, and without sending proprietary data to a third party.

Sholto Douglas put it well:

"Even if algorithmic progress stalls out, and we just never figure out how to keep progress going — which I don't think is the case, that hasn't stalled out yet, it seems to be going great — the current suite of algorithms are sufficient to automate white collar work provided you have enough of the right kinds of data."

The key phrase is the right kinds of data. The algorithms exist. The base models are good enough. The bottleneck is task-specific training data and the RL loops to learn from it. That's specialisation. And specialisation, structurally, is where open-source wins.

The Timing Asymmetry

The conventional narrative is that the frontier labs will always be ahead. They have the capital, the talent, the data, and the compute. Open-source will lag behind by some number of months. And the Chinese labs who have been subsidising open-source development at enormous financial loss will eventually run out of money, at which point the gap widens and the game is over.

I think this narrative is wrong, but not for the reason most people think. I don't think open-source needs to match frontier models on raw capability. It needs to match them on the intelligence ceiling, which is the fraction of economically valuable work that can actually be automated. And because most of that ceiling lives in the specialisation layer, open-source is already ahead in the metric that matters.

But grant the premise for a moment. Grant that open-source base models will always lag closed-source by some margin. Grant that the Chinese labs eventually stop subsidising. The question is: what will the intelligence ceiling be when this happens?

This is the timing asymmetry. The frontier labs have a capital-dependency clock. To train the next model and stay ahead, they need to raise billions from VCs who need returns. The only way to raise that capital is to convince investors that the margin between closed-source and open-source capability is worth paying for. That margin is their entire business model.

Open-source has a different clock. Every new base model release, every hardware generation, every inference optimisation expands the set of tasks that a specialised open-source model can automate profitably. And critically, this clock ticks along two independent axes. Model capability improves: GLM-5 is better than GLM-4.7, and the fine-tuning ceiling rises with it. And serving costs fall: GB200-class systems are resetting the economics of GPU compute, dragging token pricing downward and making it viable to serve models that would have been cost-prohibitive a year ago. SemiAnalysis has quantified this dynamic extensively; each new hardware generation doesn't just make existing deployments cheaper, it makes entirely new deployments economically rational for the first time (“You’ve made a classic blunder! You forgot to consider Jevon’s paradox!”).

So the frontier labs need the gap to stay wide enough to justify their capital requirements. Open-source just needs the ceiling to keep rising. These are very different pressures, and time favours the latter.

Why the Big Labs Can't Just Pivot

The obvious counter-argument: if specialisation is where the value lives, why can't OpenAI or Anthropic or Google just offer fine-tuning? They already have fine-tuning APIs. What stops them from capturing the specialisation layer too?

The answer is that capturing the specialisation layer requires a fundamentally different organisational shape. Training a single frontier model is a concentrated effort: one massive compute allocation, one research team, one training run, one model that you then sell to everyone. This is what the Big Labs are built to do, and they are extraordinarily good at it.

Specialisation is the opposite. It's a thousand small efforts, each embedded in a different customer's workflow, iterating on their data, learning their edge cases, building training loops around their specific failure modes. It requires domain expertise the lab doesn't have, data the lab can't access, and iteration cycles that happen at the customer's pace, not the lab's. You can't do this from behind an API. You need open weights, custom training infrastructure, and people who understand both the ML and the domain deeply. In addition, I've talked at length here about why large models are hard to specialise at a general level.

This is the classic innovator's dilemma. The Big Labs are structured to do one thing at enormous scale, and that thing is becoming less differentiated over time. For instance, Kimi K2.5, Minimax M2.5 and GLM-5 were all released in the last week, with Deepseek-V4 and Qwen-3.5 on the horizon. The specialisation game, in contrast, requires doing a thousand things at small scale, and each of those things creates compounding value that's hard to replicate. OpenAI offering an API fine-tuning endpoint is to genuine specialisation what a hotel concierge recommending a restaurant is to actually knowing how to cook.

As Sarah Guo has pointed out, the economically efficient outcome is that a company should be willing to pay up to profit for the most model intelligence on a given task — because if you won't, someone else will, and they'll provide a better customer experience. This cuts against the "one model to rule them all" approach: if a specialised model delivers more capability on your specific task, and the open-source ecosystem makes it economically viable to build and serve that model, the frontier model is overpriced and underperforming for your use case. The rational buyer leaves.

Dario describes the AI industry settling into a Cournot equilibrium with a small number of firms with high barriers to entry and positive margins, like cloud. Maybe. But that equilibrium describes the pre-training layer, which is indeed expensive and concentrated. The specialisation layer has the opposite structure: low barriers to entry, thousands of players, value created through domain depth rather than compute scale. The frontier labs might sustain a comfortable oligopoly on base model training while the majority of economic value accrues to the specialisation layer, the way Intel was profitable making chips while the real wealth was captured by the companies building on top of them.

The RL Generalisation Counter

The strongest version of the case against open-source isn't that the Big Labs will pivot to specialisation. It's that they won't need to.

The argument goes like this. RL scaling is showing the same log-linear improvements that pre-training showed. As frontier labs train on a broader and broader set of RL tasks — first math, then code, then a wide variety of agentic work — the models will generalise. Just as GPT-2 generalised beyond its training distribution when trained on a broad enough internet scrape, RL-trained models will generalise beyond their specific training tasks when trained on a broad enough set of RL environments. At some point, the frontier model clears the automation-worthiness threshold for most enterprise tasks without any fine-tuning, and the specialisation gap closes from above.

I take this seriously. It's probably the strongest argument that the frontier labs are going to win. But I think it's wrong, for two reasons.

The first is empirical. Generalisation from broad training gets you remarkably far on tasks that look like the training distribution, which is why frontier models are so good at coding, at general knowledge tasks, at anything that resembles the internet. But enterprise work has a specific character that resists this kind of generalisation. The edge cases in an insurance claims workflow aren't hard because they require more intelligence. They're hard because they require knowledge of this specific insurer's policy language, this specific state's regulatory requirements, this specific client's historical claims patterns. No amount of general RL training will teach a model these things, because they exist only in the client's proprietary data. Generalisation gets you from 0 to 70. Specialisation gets you from 70 to 95. And the economic value lives disproportionately in that last 25, because that's where "good enough to demo" becomes "good enough to deploy in production without a human in the loop."

The second reason is that the RL generalisation argument actually supports the open-source thesis, even though it's deployed against it. If RL scaling is log-linear and broadly applicable, if there's no secret sauce, just scale and data and well-designed reward signals, then the same scaling laws apply to specialised fine-tuning. An open-source model fine-tuned with RL on domain-specific tasks benefits from exactly the same log-linear improvements that frontier labs see on their general RL training. The algorithms aren't proprietary. The scaling behaviour isn't proprietary. The only proprietary ingredient is the pre-training compute. And the argument of this piece is that, for the majority of economically valuable work, the specialisation layer on top of a good-enough base model is worth more than the marginal capability of the biggest base model. The frontier labs have announced the scaling laws, and those scaling laws work for everyone.

What About the Hard Stuff?

I want to be honest about where this argument doesn't hold.

For genuinely hard tasks such as automating novel scientific research, the kind of work that requires sustained reasoning over massive context windows with real creativity, I think the frontier labs will win, at least for a while. These tasks require raw capability that currently only comes from models trained at enormous scale, and the specialisation layer doesn't help as much when the task is fundamentally about general intelligence rather than domain expertise.

But many tasks that look like they require superhuman general intelligence actually require deep domain specialisation in disguise. The insurance claim that seems to need "reasoning" actually needs a model trained on 50,000 of that insurer's claims that knows the edge cases. The clinical note that seems to need "medical knowledge" actually needs a model trained on that health system's documentation standards and coding conventions. Leading AI application companies like OpenEvidence and Abridge have proven this. The compliance review that seems to need "legal reasoning" actually needs a model trained on that jurisdiction's specific regulatory framework. Over the course of numerous customer engagements, we’ve found a surprising amount of what we call "hard" work is actually "specific" work, and specificity is what specialisation buys you.

The counter-counter, though, is about sequencing. The revenue from automating the 90% of currently valuable work (the insurance claims, the clinical notes, the compliance reviews, the customer support, the back-office workflows that make up the unglamorous bulk of the knowledge economy) funds the push toward the remaining 10%. Open-source companies that capture specialisation revenue can reinvest in training better base models, can contribute to open research, can keep the frontier within reach. This is how the open-source ecosystem sustains itself: not by matching the labs dollar for dollar on pre-training, but by generating enough value from the specialisation layer to stay in the game.

There's a more sophisticated version of the objection, which is that the value created by very strong models, value from entirely new tasks that don't exist today, will dwarf the value of automating current work. Maybe. But this is a sequencing argument too. New-task value only materialises after the infrastructure is built, the use cases are discovered, and the ecosystem matures. That takes time. And during that time, the specialisation layer is generating real revenue and funding the next generation of open models. The future is not a single moment; it's a sequence of investments and returns, and open-source has a viable path through that sequence.

We Don't Have to Wait Out AGI

So here is the full argument.

Open-source AI is the Michael Scott Paper Company. The conventional wisdom says it can't last — the frontier labs have more money, more talent, more compute, and open-source will always trail behind. And maybe it will, on raw pre-training benchmarks. But that's not the metric that matters. The metric that matters is the intelligence ceiling: how much economically valuable work can you actually automate? And most of that ceiling lives in the specialisation layer, which open-source owns structurally.

The frontier labs are Dunder Mifflin. They need their moat to hold. They need the gap between closed-source and open-source to remain wide enough to justify billions in capital raises and trillion-dollar valuations. Every quarter that the intelligence ceiling rises — through better open models, through cheaper serving, through better fine-tuning techniques — that moat narrows. The RL scaling laws that the frontier labs discovered work for everyone. The algorithms aren't secret. The scaling is log-linear and domain-general. The only question is what data you feed the machine, and for most economically valuable work, the most valuable data is your own.

And we don't have to wait forever. We don't have to wait until AGI, or until open-source matches the frontier on some abstract benchmark. We just have to wait until the intelligence ceiling from fine-tuning open-source models covers something like 90% of economically valuable work. At that point, the specialisation advantage creates trillions of dollars in value for the open-source ecosystem, and the Big Labs have no choice but to grapple with a world where their most profitable layer i.e. serving general-purpose intelligence at premium margins, is being eaten from below. I think we're closer to this future of an ecosystem of specialised models built on open-source than most people realise. And the ceiling is rising every quarter.

So I don't think open-source needs to wait out the frontier labs. I think it just has to wait out the gap.

The Michael Scott Paper Company of AI

The Intelligence Ceiling

The Timing Asymmetry

Why the Big Labs Can't Just Pivot

The RL Generalisation Counter

What About the Hard Stuff?

We Don't Have to Wait Out AGI

Other research

Continual learning and the post monolith AI era

If we can't design neat latent structures, then maybe we can Bitter Lesson it through self-study

A letter to the C-suite: the shifting role of MLEs

Explore Baseten today

Other research

Continual learning and the post monolith AI era

If we can't design neat latent structures, then maybe we can Bitter Lesson it through self-study

A letter to the C-suite: the shifting role of MLEs