The Data Lakehouse: A Middle Ground for Data Management
The humble data lakehouse emerged about eight years ago as organizations sought a middle ground between the anything-goes messiness of data lakes and the locked-down fussiness of data warehouses. The architectural pattern attracted some followers, but the growth wasn’t spectacular. However, as we kick off 2025, the data lakehouse is poised to grow quite robustly, thanks to a confluence of factors.
The Rise of Data Lakes
As the big data era dawned back in 2010, Hadoop was the hottest technology around, as it provided a way to build large clusters of inexpensive industry-standard X86 servers to store and process petabytes of data much more cheaply than the pricey data warehouses and appliances built on specialized hardware that came before them.
The Need for a Middle Ground
By allowing customers to dump large amounts of semi-structured and unstructured data into a distributed file system, Hadoop clusters garnered them the nickname “data lakes.” Customers could process and transform the data for their particular analytical needs on-demand, or what’s called a “structure on read” approach. This was quite different than the “structure on write” approach used with the typical data warehouse of the day.
The Emergence of Data Lakehouses
As the Hadoop experiment progressed, many customers discovered that their data lakes had turned into data swamps. While dumping raw data into HDFS or S3 radically increased the amount of data they could retain, it came at the cost of lower quality data. Specifically, Hadoop lacked the controls that allowed customers to effectively manage their data, which led to lower trust in Hadoop analytics.
The Solution: Data Lakehouses
By the mid-2010s, several independent teams were working on a solution. The first team was led by Vinoth Chandar, an engineer at Uber, who needed to solve the fast-moving file problem for the ride-sharing app. Chandar led the development of a table format that would allow Hadoop to process data more like a traditional database. He called it Hudi, which stood for Hadoop upserts, deletes, and incrementals. Uber deployed Hudi in 2016.
The Rise of Table Formats
A year later, two other teams launched similar solutions for HDFS and S3 data lakes. Netflix engineer Ryan Blue and Apple engineer Daniel Weeks worked together to create a table format called Iceberg that sought to bring ACID-like transaction capabilities and rollbacks to Apache Hive tables. The same year, Databricks launched Delta Lake, which melded the data structure capabilities of data warehouses with its cloud data lake to bring a “good, better, best” to data management and data quality.
The Impact of Polaris and Tabular
The battle between Apache Iceberg and Delta Lake for table format dominance was at a stalemate. Then in June of 2024, Snowflake bolstered its support for Iceberg by launching a metadata catalog for Iceberg called Polaris (now Apache Polaris). A day later, Databricks responded by announcing the acquisition of Tabular, the Iceberg company founded by Blue, Weeks, and former Netflix engineer Jason Reid, for between $1 billion and $2 billion.
The State of the Data Lakehouse
Seven months later, that momentum is still going strong. Last week, Dremio published a new report, titled “State of the Data Lakehouse in the AI Era,” which found growing support for data lakehouses (which are now considered to be Iceberg based, by default).
Conclusion
The data lakehouse is poised to grow quite robustly in 2025, thanks to a confluence of factors. The rise of open, Iceberg-based lakehouse platforms is giving enterprises the freedom to choose the best query engine for their specific needs, rather than being locked into monolithic cloud platforms. As the data architecture landscape continues to evolve, the demand for data lakehouses will only continue to grow.
FAQs
Q: What is a data lakehouse?
A: A data lakehouse is a middle ground between the anything-goes messiness of data lakes and the locked-down fussiness of data warehouses.
Q: What are the benefits of a data lakehouse?
A: Data lakehouses provide a scalable and affordable way to store and process large amounts of data, while also providing the controls and governance needed to ensure data quality and trust.
Q: What are the key players in the data lakehouse market?
A: The key players in the data lakehouse market include Databricks, Snowflake, and AWS.
Q: What is the future of data lakehouses?
A: The future of data lakehouses is bright, with growing support for open, Iceberg-based lakehouse platforms and increasing demand for data lakehouses in the enterprise.

