• datapro.news
  • Posts
  • 🛫Agent runtimes are the new layer in data stacks

🛫Agent runtimes are the new layer in data stacks

THIS WEEK: Orchestration, evaluation loops, and costs now look like infrastructure.

Dear Reader…

Anthropic’s Code with Claude 2026 keynote made a quiet but consequential claim. And the winners will not be the teams with the cleverest prompts, but the ones who can run model work like any other workload, with sessions, artefacts, quality gates, and cost controls. There was no new flagship model announcement, no breathless benchmark parade, no claim that general intelligence is around the corner. Instead, the company used its main stage to argue that the next wave of value will come from harnesses, orchestration, evaluation loops, and compute plumbing.

For data engineers, that shift is not cosmetic. It is the difference between an LLM being an occasionally useful assistant and an LLM becoming a schedulable, observable component in a production system. The keynote’s core message was that frontier models are now good enough for longer-running, tool-using work. The bottleneck has moved to the surrounding machinery.

The compute story signals a move from novelty to operations

The opening keynote made three concrete points that matter to anyone building data platforms. First, Anthropic says usage of its API platform has grown 17x year on year. Second, it announced higher usage limits for Claude Code and raised API rate limits for Claude Opus models. Third, it put a name and product surface on something many teams have been assembling from scratch. A managed agent runtime that looks increasingly like data infrastructure.

The compute story is both mundane and telling. Anthropic published a separate announcement describing a partnership with SpaceX to use the full compute capacity of the Colossus 1 data centre, which it says will provide more than 300 megawatts of new capacity and over 220,000 NVIDIA GPUs within the month. That additional headroom is being translated directly into higher limits. Claude Code’s five-hour rate limits are being doubled for Pro, Max, Team and seat-based Enterprise plans. Peak-hour limit reductions are being removed for Pro and Max. API rate limits are being raised considerably for Claude Opus models.

For engineers used to building around quotas and contention, the headline is not that rate limits moved. It is that Anthropic is publicly framing reliability and throughput as the substrate for adoption. That is a familiar progression in the data world. The technologies that become foundational are the ones that become boring, predictable, and well-instrumented, not the ones that win the flashiest demo.

Claude Managed Agents looks like a job runner for model work

The more consequential announcements lived in the Claude Platform rather than the consumer product. Anthropic’s Claude Managed Agents is positioned as a pre-built, configurable agent harness that runs in managed infrastructure and is best for long-running tasks and asynchronous work. The implication is straightforward. Instead of each team building its own agent loop, sandbox, tool execution layer, and state management, the vendor now offers an opinionated runtime with persistent sessions, a filesystem, built-in tools, and event streaming.

If you have built data systems, this is immediately legible. An agent harness is effectively an execution environment plus a control plane. It is a place where work can run for minutes or hours, where artefacts are written to storage, where events can be streamed and persisted, and where a client can interrupt, resume, or steer execution. This is not chat with tools. It is a job runner for model-driven tasks, with many of the same failure modes and operational concerns as any other workload.

Anthropic’s design choices reinforce that reading. Managed Agents is structured around four concepts. An agent definition, an environment as a configured container template, a session as a running agent instance, and events as messages, tool results, and status updates. Sessions persist server-side and can be fetched in full. There is explicit language about stateful sessions, persistent filesystems, and long-running execution. It is easy to imagine a future architecture diagram where Managed Agents sits alongside orchestrators, warehouses, and streaming systems, invoked as just another service with contracts and limits.

Multi-agent orchestration is controlled parallelism, not distributed compute

This framing becomes clearer when you look at the advanced orchestration features highlighted in the keynote. Multi-agent orchestration, delivered as Multiagent sessions in the documentation, allows one agent to coordinate others within a single session. The mechanism is not hand-wavy. All agents share the same container and filesystem, but each runs in its own context-isolated session thread with its own conversation history and event stream. Threads are persistent, and the coordinator can send follow-ups to agents it called earlier.

This matters because it defines the shape of parallelism that is actually available. It is not distributed compute in the Spark sense. It is controlled fan-out within a managed session, with a hard limit of 25 concurrent threads. For data engineers, that still unlocks a useful class of workflows. Parallel investigation, parallel extraction, multi-surface analysis, and specialised sub-tasks that write outputs to a shared filesystem for later synthesis. Think of it as a structured scatter-gather pattern for knowledge work and code work, with the filesystem acting as the intermediate store.

PRDs by voice. Bug reports by voice. Ship faster.

Dictate acceptance criteria and reproductions inside Cursor or Warp. Wispr Flow auto-tags file names, preserves syntax, and gives you paste-ready text in seconds. 4x faster than typing.

Outcomes brings quality gates and retries into the harness

The second feature, Outcomes, is where the harness starts to look like production-grade infrastructure rather than a developer convenience. An Outcome elevates a session from conversation to work, in Anthropic’s phrasing. You define what done looks like and how to measure quality. The harness then provisions a separate grader to evaluate the artefact against a rubric, iterating until the outcome is met or a maximum number of iterations is reached. The grader uses a separate context window, reducing the risk that the same reasoning that produced an artefact will also be used to excuse its flaws.

This is a direct response to the operational problem that has plagued early LLM integrations. Outputs that look plausible but fail silently. Data engineers solved analogous problems years ago with data quality checks, test-driven development, and contract-based pipelines. Outcomes is not identical to Great Expectations or dbt tests, but it rhymes with them. It provides an explicit quality gate and a retry loop, turning ask the model again into a structured, inspectable iteration process with a per-criterion breakdown of what failed and why.

If agent harnesses are becoming data infrastructure, Outcomes is the piece that makes them governable. It is the difference between ad hoc automation and something you can trust to run overnight without waking you up for every edge case.

Dreaming treats memory as data, which creates governance questions

The third feature, Dreaming, is a research preview in Anthropic’s terminology and is implemented in the docs as Dreams. It reflects on past sessions and an existing memory store to produce a new reorganised memory store, merging duplicates, replacing stale or contradicted entries, and surfacing new insights. The input store is not modified. This process is explicitly asynchronous and designed to curate memory for future sessions.

From a data engineering perspective, Dreaming is both promising and a red flag. The promise is that it treats memory as an asset that needs compaction, deduplication, and freshness management, which is closer to how real systems behave than the naïve just keep everything approach. The red flag is governance. A derived memory store is still derived. It can introduce errors, bias, or accidental optimisation that erases nuance. In data systems, we handle this with lineage, auditability, and clear definitions of authoritative sources. If Dreaming is to be used in production contexts, teams will need equivalent controls. Versioning, diffing, approval workflows, and the ability to roll back.

Cost routing is becoming a default architecture pattern

Anthropic also used the keynote to formalise a cost and architecture pattern it calls the advisor strategy. The idea is to route most work to a cheaper, faster model and consult a more capable model only when necessary. In keynote notes, one customer is cited as achieving frontier model quality at 5x lower cost by having a smaller model call Opus as an advisor.

This is immediately familiar to data engineers. We already design systems that reserve expensive compute for the cases that need it. Inference routing is shaping up to be the LLM equivalent of tiered storage and workload management. It will likely become more formal, with automated escalation triggered by failed rubrics, uncertainty scores, or business impact thresholds.

The compliance subtext is real

There is an investigative angle that should not be ignored. Anthropic’s compute expansion story leans heavily on enormous new capacity, including the SpaceX Colossus deal. Compute at that scale comes with physical, regulatory and reputational constraints. Data engineers working in regulated industries already know that data residency, in-region infrastructure, and supply chain security are not footnotes. Anthropic’s announcement nods to this, noting increasing demand for in-region infrastructure and referencing capacity expansion into Asia and Europe via other partnerships. The subtext is that agent infrastructure will inherit the same compliance and operational considerations as the data platforms it increasingly resembles.

What data engineering leaders should do next

Anthropic may have avoided a flashy model announcement, but that restraint is part of the story. The keynote suggested that the market is ready for something more operational. If agent harnesses are becoming data infrastructure, the winners will not be the teams that prompt best. They will be the teams that build the most reliable, testable, observable and governable agent pipelines, and that is a competition data engineers are well placed to win.

There are five practical takeaways:

🛑 Stop treating LLM integration as a pure application-layer problem. The vendor is offering an agent runtime with containers, sessions, filesystems and event streams. That is infrastructure, and it will need the same discipline you apply to any other production system.

🤔 Adopt evaluation loops as a first-class design constraint. If you cannot express what done looks like and test for it, you do not have a pipeline. Outcomes is an explicit attempt to productise that principle for model-driven work.

👁️ Be realistic about parallelism. Multi-agent orchestration will help with throughput and quality for complex tasks, but it is bounded and it is not a substitute for distributed processing engines. Use it where it fits. Investigation, code generation, documentation, and multi-surface analysis, not bulk record-level transformation.

🎗️ Treat memory as data. If you allow systems to rewrite their own state, you need governance, lineage, and rollback. Dreaming is intriguing precisely because it introduces the need for data management practices inside the agent layer.

💰 Assume cost routing will become standard. The advisor strategy is not a clever trick. It is the economic foundation for running model-driven workloads at scale.

If you are already building Airflow DAGs, dbt tests, and observability for your pipelines, you have most of the muscle needed for this next layer. The question is whether you will treat agent harnesses as production infrastructure from day one, or let them sprawl into an ungoverned shadow platform?

That’s a wrap for this week
Happy Engineering Data Pro’s