
The AI conversation keeps swinging back to “bubble” talk: too much money, too many models, too little real value. But that framing misses what’s actually constraining progress right now. The limiting factor isn’t capital or GPUs. It’s high-quality human signal, the kind that reliably improves models once they leave the lab and collide with real workflows.
For a while, scale was the unlock. Pretraining rewarded volume: more tokens, more modalities, bigger mixtures. But as models get deployed into products, the frontier shifts. The improvements that matter most increasingly come from post-training: shaping behavior against specific tasks, tools, constraints, and decision boundaries. And post-training is, at its core, a human-data problem. Not “humans in the loop” as a slogan, humans as the source of the signal that tells a system what “better” looks like in contexts where correctness isn’t obvious.
That difference matters because not all human data is the same. Early-stage training often leans heavily on what you might call state-based data: static inputs paired with labels, annotations, or preference judgments. This is familiar, well-understood, and scalable, especially when verification is straightforward. But the kind of signal that moves models in high-value, real-world settings is increasingly process-based. It looks like trajectories through tasks, decompositions, rubrics that encode expert judgment, tool-using workflows, scenario construction, and environment design. It’s less like a dataset you download once and more like an ongoing training process that you operate.
This is where many human data pipelines start to break. A lot of the industry was built on throughput: moving work across large pools, standardizing QA, and optimizing for delivery. That works when the tasks are easily checkable, when you can cheaply verify correctness, or at least approximate it with consistency checks and aggregation. But post-training quickly pushes you into domains where verification is asymmetric: it’s easier to produce an answer than to reliably judge whether it’s good. The tasks get longer-horizon, more contextual, more entangled with domain nuance, and more exposed to reward hacking. In those regimes, quality cannot be “inspected in” at the end. If the task framing is off, if the rubric is misaligned, if incentives are poorly designed, noise compounds fast, and models will faithfully learn the wrong thing.
A useful way to think about this: verifiability matters more than difficulty for training signal. The question isn’t “can a frontier model solve this today?” but “can we reliably detect incremental improvement over time?” A task can be impressive, hard, and expensive and still be a bad training target if you can’t tell whether the model is getting better in a stable way. Conversely, a task can be modest in surface complexity but extremely valuable if it supports reliable hillclimbing: clear feedback loops, consistent evaluation, and signals that generalize rather than overfit to a particular benchmark.
That shifts what “good data” means. The goal becomes creating or capturing signals in ways that make evaluation tractable. Sometimes that means designing environments and tools that automatically surface checks. Sometimes it means breaking expert judgment into components that can be audited and compared. Sometimes it means being honest about where only true domain experts can judge quality, and designing the pipeline so those experts spend time where it actually matters, rather than being diluted across endless review layers. And it nearly always means treating reward design and evaluation as engineering problems, not admin tasks.
This is also why QA can’t be treated as an ops afterthought. At scale, quality requires automation and instrumentation, systems that enforce constraints consistently and observability that tells you where the signal is drifting. Automation here means building repeatable checks into the pipeline so the baseline quality bar isn’t dependent on manual heroics. Instrumentation means tracing and measuring the end-to-end lineage of signal: which rubric version was used, what checks ran, how reviewers differed, where disagreement clusters, how quality shifts as tasks change, and how contributor behavior evolves over time. Without that, you don’t have a learning loop. You have a production line that can silently degrade while everyone hits their delivery metrics.
There’s an additional trap: when verification is weak and incentives are misaligned, it becomes possible to manufacture the appearance of progress. If the same party defines the evaluation and supplies the training signal, you have to assume coupling risk. You can end up with benchmarks that drift toward what’s easiest to produce, or “hillclimb” narratives that preserve work rather than reflect real capability gains. The model improves on the thing you measured, but the value doesn’t transfer.
All of this points to the same conclusion: the next era of AI progress won’t be decided purely by who has the biggest models or the most compute. It will be decided by who can run post-training as a reliable, continuous system, one that repeatedly assembles the right humans, structures their expertise into verifiable signals, and compounds improvements over time instead of resetting every cycle.
If there’s a bubble risk in AI, it’s not that models won’t get smarter. It’s that organizations will keep pouring effort into noisy pipelines that can’t reliably turn human judgment into durable training signals. The way out is better signal: process-based, verifiable, instrumented, and built to hillclimb in the real world; not just hoarding more data.