datapro.news
Posts
The Warehouse Wars Are Over. The Real Fight Is for Your Agents.

The Warehouse Wars Are Over. The Real Fight Is for Your Agents.

THIS WEEK: The five leading cloud data platforms are converging on the same design, and it changes what a data engineer should be optimising for.

Samuel Williams
July 01, 2026

Dear Reader…

Two purchases over the past year tell you more about where data engineering is heading than any product keynote. In May 2025, Databricks paid roughly a billion dollars for Neon, a serverless Postgres company. A few months later, Snowflake bought Crunchy Data, another enterprise Postgres outfit, for around 250 million dollars. Neither Neon nor Crunchy is an analytics tool. Both are transactional databases, the unglamorous workhorses that sit behind ordinary applications. So why are the two most prominent analytics platforms on the planet suddenly buying the very thing they spent a decade telling us to keep separate?

Because the cloud data warehouse has quietly stopped being a place you query and become the runtime your AI agents run inside. That is the single shift behind the 2026 vendor landscape, and it means most teams are still choosing a platform on criteria that stopped mattering about a year ago.

Check out the video edition on the community here

Everyone is racing to the same place

Look across the top five and the pattern is impossible to miss. Snowflake added Unistore for transactional tables, Cortex for SQL-native AI, and, with Crunchy Data, a fully managed Postgres for building agents. Databricks put Neon's serverless Postgres at the heart of Lakebase, runs queries through its Photon engine, and ships Agent Bricks as an execution runtime for autonomous agents inside Unity Catalog. Google wraps BigQuery in BigLake storage virtualisation, Gemini-powered agents, and a Data Engineering Agent that writes pipelines for you. Amazon Redshift decoupled compute and storage with RA3, added native S3 Vectors, and leans on the wider AWS stack. Microsoft Fabric collapses the lot into OneLake, streams operational data in through Mirroring, and serves Power BI straight off the lake with Direct Lake.

Rank	Platform	Standout strength	Best fit	The catch
1	Snowflake	~$5.0bn revenue, 126% net retention; Cortex AI, Unistore, Postgres	SQL-first BI teams adopting generative AI natively, plus cross-cloud sharing	Premium pricing
2	Databricks SQL	$5.4bn ARR run-rate, 140%+ retention; Photon, Lakebase, Agent Bricks	Python and Spark teams unifying ML, streaming and operational data	Photon is closed-source, so local tests differ from production
3	Google BigQuery	Part of GCP's ~$80bn run-rate; serverless, Gemini agents, BQML	GCP shops with spiky, event-driven and real-time workloads	On-demand per-terabyte scans can produce surprise bills
4	Amazon Redshift	Inside AWS's ~$150bn run-rate; 11 years a Gartner leader; RA3, S3 Vectors	AWS-committed teams with steady, predictable query profiles	Operationally rigid, with no multi-cloud sharing
5	Microsoft Fabric	$2bn ARR in about 2.5 years, 31,000+ organisations; OneLake, Direct Lake	Microsoft and Power BI heavy estates wanting SaaS simplicity	Deep Azure lock-in

Ranked on combined incumbency, scale and advanced functionality (2026).

Strip away the brand names and they are all building the same thing: one open storage layer, whether that is Apache Iceberg, Delta Lake or OneLake, with both transactional and analytical workloads sitting on top of it, an AI runtime layered over that, and a single governance catalogue across the whole estate. The neat categories that defined the last decade, the warehouse for analytics, the lake for raw files, the operational database for the app, are collapsing into one converged platform. The "modern data stack" is quietly becoming the modern data singular.

Why the convergence is happening now

The forcing function is agents. A dashboard is happy to wait for the nightly batch. An autonomous agent is not. Agents need real-time state, instant vector retrieval, and the ability to provision, clone and branch a database in the moment, all inside a single low-latency loop. Batch ETL, the high-latency plumbing that has defined data engineering careers for fifteen years, is precisely the bottleneck that breaks them.

One statistic from the research makes the point better than any architecture diagram. Inside Neon, more than 80 per cent of all databases were created not by human developers but by AI agents and code-generation tools spinning up isolated environments on demand. That is the workload the platforms are now building for, and it is why the prize everyone is chasing is "zero-ETL": operational data that lands in the same open storage the analytics and the models already read, with no pipeline, no change-data-capture lag, and no copy to keep in sync.

The obvious objection, and why it is not enough

Here is the fair counter-argument. The engines are still wildly different on raw speed and cost, so surely the old question, which one is fastest and cheapest, still rules. And the spread is genuinely enormous. On a 100-billion-row benchmark, a specialised columnar engine ran the job for under 20 dollars. The same workload on Snowflake cost about 32 times more on a cost-performance basis, and BigQuery on its on-demand, per-terabyte pricing came out more than a thousand times more expensive than the baseline. Anyone who tells you the engine no longer matters has never been handed an unexpected five-figure bill for one unoptimised full-table scan.

So yes, performance and pricing still matter, enormously. But two things follow from that benchmark, and neither is "pick the winner". First, a spread that large is an argument for matching the engine to the workload, not for crowning a single champion across an enterprise. Second, and more important, for the workloads that will define the next three years the binding constraint is rarely raw scan speed. It is pipeline latency, governance, and the simple question of where your operational data already lives. A query that runs in one second instead of three is a rounding error next to an agent that cannot act because the data it needs is still trapped in last night's batch.

The Ultimate Guide for Usage-Based Pricing for SaaS and AI

Implementing usage-based pricing successfully requires more than just a pricing strategy.

Download this guide for practical advice and best practices when considering usage-based pricing.

👉 Get your guide

So what should actually decide it

The decision has moved up the stack. Three things now matter more than the benchmark.

The first is data gravity and lock-in. Each platform pulls hardest inside its own orbit: Fabric for organisations already living in Microsoft 365 and Azure, Redshift for AWS-committed shops, BigQuery for Google Cloud, with Snowflake and Databricks positioning themselves as the more neutral, cross-cloud options. The convergence does not remove lock-in, it just relocates it from the query engine to the ecosystem.

The second is governance. When transactional records, analytical tables, unstructured files, vectors and autonomous agents all sit on one storage layer, the catalogue that governs them becomes the most important component you own. Snowflake Horizon, Databricks Unity Catalog and Microsoft Purview are not back-office features any more, they are the control plane. If you cannot trace lineage and enforce policy across that estate, your agents are a liability, not an asset.

The third is open formats as your hedge. Native support for Iceberg and Delta is the one genuine protection against the lock-in the convergence creates. Keeping your data in an open table format is what lets you change your mind later, and in a market moving this fast, the option to change your mind is worth real money.

What this means for your week, and your career

If you take one practical step from this issue, retire the benchmark beauty contest as your primary selection criterion. Evaluate platforms on the things that will actually bind you: the maturity of their agent runtime, whether they genuinely deliver zero-ETL between operational and analytical data, the strength of their governance catalogue, and their commitment to open storage. Run the cost benchmark second, on your real workload, not on a vendor's slide.

The career implication is sharper still. The skill that is quietly losing value is tuning a single warehouse to win a query race. The skill that is appreciating is designing and governing the unified data-to-AI control plane: deciding where transactional state, analytical tables, vectors and agents should sit, and who is accountable for each. The engineer who can answer that becomes the architect in the room. The one who can only make a query run faster becomes a commodity, because in 2026 the platforms are increasingly making the query fast on your behalf.

The warehouse wars, the years of benchmark one-upmanship, are effectively over, not because one vendor won but because all of them are arriving at the same architecture. The contest that matters now is whose platform becomes the operating system for your agents. That is a question about your data's gravity and your governance, not about who wins a scan-speed race. Choose on the new criteria, because the old ones are already behind you.