datapro.news
Posts
Context Engineering: The Skill Every Data Engineer Needs Right Now

Context Engineering: The Skill Every Data Engineer Needs Right Now

THIS WEEK: The shift from prompt tinkering to infrastructure thinking is here, and it changes everything about how we build with AI.

Samuel Williams
March 04, 2026

Dear Reader…

If you have been building agentic workflows, or planning to, you have probably hit the same wall where your AI agent hallucinates, loses track of instructions mid-task, or returns wildly inconsistent results. The instinct is to blame the model. The reality? It is almost never the model. It is the context.

Welcome to the era of context engineering and it is the most important new discipline for data engineers to get their heads around in 2026.

What Is Context Engineering, Exactly?

Think of your agent's context window like RAM. It is finite, it is expensive, and what you load into it determines everything about how well the system performs.

Context engineering is the art and science of filling that context window with precisely the right information at precisely the right moment, covering instructions, memory, knowledge, and tool outputs. It is a deliberate step beyond prompt engineering, which was designed for single-turn, stateless interactions. Prompts are brittle in multi-step agentic workflows. Context engineering is what replaces them.

Anthropic's framework breaks this down into four pillars: Write, Select, Compress, and Isolate. Here is what each one means for you as a data engineer.

Check out the Video Explainer here

The Four Pillars and Why They Matter to Data Engineers

1. Write — Engineering Persistent Memory

LLMs are stateless by default. Every session, they start fresh. The Write pillar is about building external memory systems, scratchpads and stores, that persist what agents learn across interactions.

For data engineers, this is not just a software problem. It is a data management problem. Writing context well means:

Capturing entity and topic metadata during memory creation
Tagging memories with semantic, temporal, and relational metadata
Scoring importance to decide what gets retained (critical keywords, recency, usage patterns)
Running cleanup routines to prevent "context rot", where stale or irrelevant memories corrupt future reasoning

Think of it as building a data lake for your agent's brain. Schema design, versioning, and governance apply here just as much as they do to your production pipelines.

2. Select — Precision Retrieval Over Information Dumps

Dumping your entire knowledge base into the context window is the fastest route to poor outputs. The Select pillar is about retrieving the minimum information needed for the agent's next step with high confidence, what the framework calls "sufficient context."

Modern selection strategies go well beyond keyword search:

Hybrid search: Combines semantic similarity with keyword filtering
Graph-aware retrieval: Traverses data lineage graphs to trace dependencies, useful for following a pipeline upstream to find the root cause of a data quality issue
Iterative sufficiency scoring: Keeps adding context until a confidence threshold (e.g. relevance probability > 0.8) is met, then stops

The data engineering angle here is operational context. Your retrieval layer needs to understand data freshness, schema volatility, and reliability signals. A financial forecasting agent should not select data that is 48 hours stale. Your metadata layer has to expose those trust signals at query time.

3. Compress — Token Economics Are Real

Context windows are expensive. Every unnecessary token is wasted compute and wasted money. The Compress pillar is about distilling information into its most essential form without losing critical meaning.

The numbers speak for themselves: one documented implementation reduced a raw tool response from 220,000 tokens down to 1,555 tokens, a 99% reduction, whilst improving accuracy.

Compression strategies to know:

Extractive: Preserves exact sentences, vital for legal, compliance, or financial contexts
Abstractive: LLM-driven rewriting into concise summaries
Hierarchical/Recursive: Summarises at natural task boundaries, keeping recent events raw and condensing older history
Query-focused distillation: Compresses content relative to what the agent is actually trying to do right now

For data engineers, this is an evolution of ETL. Your pipelines need to generate contextual summaries, statistical profiles, semantic annotations, anomaly flags, optimised for model consumption rather than human dashboards. Agents do not need to read every row. They need the texture of the data.

4. Isolate — Sandboxing for Reliability and Scale

A single monolithic agent with a massive, cluttered context window is a reliability disaster waiting to happen. The Isolatepillar is about breaking work into specialised sub-agents, each with a lean, focused, sandboxed context.

Anthropic's own multi-agent research system demonstrated this clearly: isolating specialised sub-agents (researcher, citation agent, industry mapper) produced a 90.2% performance improvement over a single agent trying to do everything.

Beyond performance, isolation prevents context poisoning, where one hallucination or tool error contaminates the entire session. When a sub-agent fails, the error is contained. The lead agent re-routes. The system keeps moving.

For data engineers, isolation means designing scoped data access at the agent level, with each agent provisioned with only the tools, tables, and policies it needs for its specific sub-goal. Governance shifts from static, manual approvals to dynamic, executable policy rules that travel with the data itself.

Become An AI Expert In Just 5 Minutes

If you’re a decision maker at your company, you need to be on the bleeding edge of, well, everything. But before you go signing up for seminars, conferences, lunch ‘n learns, and all that jazz, just know there’s a far better (and simpler) way: Subscribing to The Deep View.

This daily newsletter condenses everything you need to know about the latest and greatest AI developments into a 5-minute read. Squeeze it into your morning coffee break and before you know it, you’ll be an expert too.

Subscribe right here. It’s totally free, wildly informative, and trusted by 600,000+ readers at Google, Meta, Microsoft, and beyond.

The Three Things You Should Build Right Now

1. Treat your metadata layer as agent infrastructure.
Semantic definitions, lineage graphs, data quality signals, and policy rules are not just for human analysts anymore. They need to be machine-readable, queryable, and structured for agent consumption. If your data catalogue cannot answer "is this table fresh enough to trust?" at runtime, your agents cannot either.

2. Design compression pipelines, not just data pipelines.
Your transformation layer needs a new output type: contextual summaries. Statistical profiles, anomaly flags, schema digests — pre-compressed representations of your data that agents can reason over without loading raw records into a 200K token window.

3. Implement context health monitoring.
Track token efficiency, relevance scores, and context utilisation rates. The "55% rule" is worth knowing: hallucination rates spike when context utilisation exceeds 55% of the available window due to attention dilution. Treat this like query performance — instrument it, alert on it, tune it.

Subscribe to the Data Radio Show

On the Horizon: Context Unification

The 2026 roadmap points to "Context Unification", consolidating business definitions, structural lineage, and governance guardrails into a single contextual fabric accessible by every agent in your ecosystem. The transition is from context for humans (documentation, wikis, Confluence) to context for agents (executable, machine-readable knowledge graphs).

Data engineers who build that fabric now will be the ones whose AI systems actually work reliably at scale. Everyone else will keep blaming the model.

Quick Reference: The Four Pillars at a Glance

Pillar	What It Solves	Data Engineering Equivalent
Write	Statelessness & memory loss	Data persistence & metadata tagging
Select	Irrelevant context & token waste	Precision retrieval & trust metadata
Compress	Token limits & cost	Contextual ETL & statistical profiling
Isolate	Context poisoning & scale	Scoped access control & agent governance