Data Engineering | Tech News, Tutorials & Expert Insights

article-image-the-small-file-tax-how-compaction-clustering-and-pruning-change-lakehouse-cost

13 Apr 2026

5 min read

The Small-File Tax: How Compaction, Clustering, and Pruning Change Lakehouse Cost

13 Apr 2026

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionSame data, same engine, before and after tuning: what changes when hot partitions stop paying a per-file penalty.A lakehouse can look cheap in storage and still be expensive to read.The clue is usually a query that should be routine: yesterday’s data, one region, one status, a few columns. It hangs longer than it should, not because the engine is doing sophisticated analytics, but because it is working through too many files first. That overhead shows up in file listing, metadata evaluation, file-open cost, and the work required to decide what can be skipped.That is the small-file tax. It builds quietly in the systems we actually run: micro-batches, CDC pipelines, frequent upserts, and incremental merges. Those patterns keep data fresh, but they also fragment the hottest part of the table. The storage bill may barely notice. The read path does.Teams often misdiagnose this as a compute problem. They add more workers, and the query still spends too much time deciding what to read. Bigger clusters help less than they should when the table layout reflects ingest cadence more than query shape.Why small files are expensiveEvery file comes with fixed overhead.Before the engine reads much useful data, it has to discover files, inspect metadata, use statistics, and decide whether partition pruning or file-level skipping can eliminate work. When a table contains thousands of undersized files, that fixed work starts to dominate.The effect is easy to underestimate because it often hides in planning. Small-file tables spend more time getting ready to scan than they should. That leads to higher latency, more files touched, and more bytes read than the query really needed.Predicate pushdown helps inside a file. Pruning decides which files never needed to be read in the first place. If hot partitions are packed with tiny, poorly organized files, pushdown can only do so much.The practical point is simple: the small-file problem is often a planning problem before it becomes a scan problem.Benchmark setupThis piece is best read as a benchmark-informed engineering analysis, not a fresh benchmark report. I am not claiming new measured results here. The goal is to isolate layout as the variable and show how I would structure the comparison honestly.Keep the engine the same. Keep the dataset the same. Change only the table layout.A realistic setup would use one Spark-based fact table with columns such as event_ts, event_date, customer_id, region, event_type, order_status, and amount, partitioned by event_date. Then simulate frequent ingest into recent partitions so the table develops the same failure mode many production systems do: hot partitions filled with small files.Run the same query set across three versions of the table:Baseline: many small files, no layout maintenanceAfter compaction: fewer, better-sized filesAfter clustering: same data, reorganized around common filter pathsThe cleanest metrics are the ones operators already watch in production:● file count in hot partitions● average file size● planning time● total query runtime● files scanned● bytes read● maintenance job runtime or rewritten bytesThat gives you an apples-to-apples way to ask the right question: how much of the query bill is really a file-layout problem?Before tuning: what goes wrongBefore tuning, physical layout usually follows write cadence, not query shape.Data lands every few minutes. Recent partitions collect another pile of small Parquet files. Analysts filter by event_date, region, customer_id, or order_status, while the table is effectively organized by when each write arrived.Partition pruning still helps. It may eliminate older days quickly. But that only gets you down to the hot partitions, which are often the messiest part of the table. If those partitions still contain too many small files, the engine has too many candidates to inspect.That is why small-file tables often feel worse than their raw size suggests. A very large table can behave well if recent partitions are healthy. A much smaller table can feel slow if recent partitions are fragmented and badly laid out.After tuning: what changes with compaction, clustering, and pruningOnce you separate the mechanics, the roles of the three controls become clearer.Compaction reduces file count.This is the first fix because it attacks the per-file penalty directly. Delta’s OPTIMIZE can compact small files into larger ones, and Delta’s auto compaction can do that automatically after writes. Iceberg’s rewrite_data_files does the same class of work through bin-packing. In Hudi, small-file management is broader: write-time auto-sizing and clustering address file layout generally, while compaction in the Hudi-specific sense applies to Merge-on-Read tables and merges log files back into base files.Clustering improves locality.Compaction alone can still leave you with a table that is neat but not selective. Clustering reorganizes data so values that are commonly filtered together live closer together. Delta supports ZORDER, and newer Delta versions also support liquid clustering for incrementally clustering data over time. Iceberg exposes sort-based and zorder(...) layouts through rewrite_data_files. Hudi supports clustering inline or asynchronously, including background operation while ingestion continues.Pruning is where the engine collects the savings.Delta uses automatically collected data-skipping statistics such as min and max values. Iceberg uses hidden partition transforms and metadata-driven planning so queries do not have to know the table’s physical layout. Hudi’s metadata table exists in part to avoid expensive file listing and to expose metadata such as file listings and column statistics for planning. Better layout improves all three paths. The gains will vary by workload. Broad scans often benefit first from file-count reduction. More selective queries often benefit more when layout and statistics align with the columns people actually filter on.What this means in practiceThe operational lesson is not “run maintenance everywhere.” It is “run the right maintenance where the query bill is being generated.”A few rules hold up well in practice:● Measure hot partitions first. Whole-table size often hides where the pain actually lives.● Fix file count before chasing elaborate layout. If the table is badly fragmented, compaction or file sizing is usually the first lever.● Cluster around repeated predicates, not theoretical ones. Layout should follow the workload you really have.● Treat maintenance as a workload. Compaction, clustering, and rewrite jobs consume real compute and rewrite real bytes.One recurring mistake is trying to solve everything with partitioning alone. Delta’s clustering docs explicitly call out cases where a typical partition column would leave the table with too many or too few partitions. Iceberg’s hidden partitioning model exists in part to decouple query logic from rigid physical partition layout.That is the real trade-off: not maintenance versus no maintenance, but where you want the cost to land.Differences across Delta / Iceberg / HudiAll three open table formats help with the same broad problem, but they expose different control surfaces.Delta Lake exposes layout maintenance directly through OPTIMIZE, auto compaction, data skipping, and ZORDER. In newer Delta releases, liquid clustering adds an incremental clustering model for suitable tables, though it comes with its own feature and layout constraints.Apache Iceberg leans heavily on metadata-driven planning. Hidden partitioning, partition evolution, and metadata/manifests help the engine avoid work, while rewrite_data_files gives you bin-packing and sort-based rewrite paths, including zorder(...) support in Spark procedures.Apache Hudi attacks the problem from both sides: it avoids small files during writes where possible, offers clustering as a table service, uses a metadata table to reduce file-listing bottlenecks, and on Merge-on-Read tables uses compaction to merge log files into base files. That makes Hudi especially natural in write-heavy and CDC-style systems.Bottom lineA slow lakehouse is often a file-layout problem wearing a compute bill.Compaction reduces file count. Clustering improves locality. Pruning is where the engine realizes the savings. Put together, they do more than speed up queries. They make read cost more predictable, especially on the hot partitions where modern pipelines do most of their damage.That is why the small-file tax is such a useful way to frame the problem. It gives you a clean question: same data, same engine, before and after layout tuning, what changed in planning overhead, files scanned, and bytes read?If you are working through those trade-offs now, I go deeper on these patterns in Engineering Lakehouses with Open Table Formats.References● Chapter 8 of Engineering Lakehouses with Open Table Formats● Delta Lake Optimizations● Delta Lake Liquid Clustering● Apache Iceberg Partitioning and Hidden Partitioning● Apache Iceberg Spark Procedures (rewrite_data_files)● Apache Hudi Table Metadata● Apache Hudi Compaction● Apache Hudi File Sizing● Apache Hudi ClusteringAuthor BioVinoth Govindarajan is a seasoned data expert and staff software engineer at Apple Inc., where he spearheads data platforms using open-source technologies like Iceberg, Spark, Trino, and Flink. Before this, he worked on designing incremental ETL frameworks for real-time data processing at Uber. He is a dedicated contributor to the open source community in projects such as Apache Hudi and dbt-spark. As a thought leader, Vinoth has shared his expertise through speaking engagements at conferences such as dbt Coalesce and Hudi OSS community meetups. He has published several blogs on building open lakehouses. Holding a bachelor's degree in information technology, Vinoth has also authored multiple research papers published in journals like IEEE. --This text refers to an out of print or unavailable edition of this title.

0
0

article-image-loco-for-coco-what-snowflake-summit-2026-was-really-about

Augusto Rosa

12 Jun 2026

5 min read

Loco for CoCo: What Snowflake Summit 2026 Was Really About

Augusto Rosa

12 Jun 2026

5 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineering. Loco for CoCo: What Snowflake Summit 2026 Was Really About By Augusto Rosa, Snowflake Data SuperHero and Head of Data, Cloud and Security Architecture at Archetype Consulting Tl;dr Summit 2026 was a victory lap for Snowflake CoCo, the coding agent that went from launch to more than 7,100 accounts in four months but the bigger story is what sits underneath it, a stable platform that now carries an enterprise agentic layer easy enough for anyone to use, and an application platform that enterprises are already running internal tools on. The uncomfortable questions on the floor were about BI tools and standalone data catalogs. In my view, BI survives at least next year. Executives still need their KPI dashboards. Catalogs have a harder conversation coming as a standalone tool. CoCo's Breakout Year More than 20,000 people came through Moscone Center over four days, and the energy was the best I have felt at a Summit. Walk the expo floor, and almost every booth led with the same word: agentic. When every vendor reaches for the same adjective, it stops carrying information, but the repetition tells you what everyone is talking about. The side events reflected the same thing. The AI sessions I attended were full of legitimate questions, less what model should I use, more where is the business benefit, how do I get started, and how do I prove it. The product Snowflake chose to celebrate was CoCo. Cortex Code launched in February and grew to more than 7,100 accounts in four months, the fastest-growing product in the company's history. At Summit, it officially picked up the CoCo name, which insiders had been using for a while. The Summit announcements were about meeting builders wherever they work: a desktop app, an Excel plugin, a VS Code extension, a Claude Code marketplace entry, and a Slack bot. For me, CoCo is even better than Cloud Agents. Tasks run in isolated containers inside Snowflake's perimeter, async and scheduled, so a pipeline build keeps going after you close your laptop. That was the difference between the agent that helps me code and the agent I can deploy on calls. I am already busy planning an agent who will do a lot for me when I engage with my clients, and make me even more efficient. Easy agents, boring platform The rebrand pair tells you the strategy. CoCo is the control plane for builders. Snowflake Intelligence became CoWork, the control plane for everyone else: one personal agent with routing, memory, scheduled tasks, and governed artifacts you can certify and publish, with Deep Research soon in GA. CoWork is easy to use because the hard parts are embedded into the platform. Horizon AI guardrails went GA with protection against prompt injection and jailbreaking across both agents. Agent identity, in preview, gives every agent action a traceable identity in the audit log, so you can tell an analyst ran this from an agent ran this at a glance. Intent-driven governance lets you state protect all PII and have Snowflake write and maintain the policies. Underneath all of it sits the platform improvements: Adaptive Compute sizing warehouses from a performance target, a new query compiler with roughly 40x faster compile times. The Snowflake product mantra of making the product easy to implement still applies, and it was clear across the announcements. I still found myself asking the product teams to push even further in places like Iceberg. They are. Snowflake's Application Platform Takes Shape The least flashy announcements were very neat and useful. App Runtime, now in preview, runs Node.js and full React apps next to the data, deployed with a one-line command. Streamlit in Snowflake went GA on the container runtime. Snowflake Postgres is GA, with managed mirroring into the analytical engine in preview. Put it together, and you have data, transformation, agents, and the application itself inside one security perimeter. Enterprises are already using this for internal tools, and that is the right first market as internal tools require more internal data and need to be secured. That progression explains the question I was asked more than once on the floor: what is the point of BI tools now? My answer is that they are still around next year, and not just out of inertia. Tools like Sigma are useful precisely because they are moving in the same direction, letting customers build applications on top of the spreadsheet interface. I have seen teams replace accounting workflows that lived in Excel with Sigma applications. BI may not be dying, but it is being squeezed from two sides: agents are taking the ad hoc questions, and application platforms are absorbing the operational workflows. The middle that remains is smaller than vendors would like, but it is still big. Why Context Is Becoming the Real Moat Shravan Deolalikar posted three takeaways from the Summit that are worth mentioning as well. First, governance is shifting from can this user access this data to should this agent perform this action, which is a different question requiring different machinery. Second, everyone is converging on the same destination: Snowflake, Atlan, ServiceNow, and Salesforce are all positioned as the context orchestration and governance layer for agents. Third, metadata extraction is commoditizing, and the hard part is encoding the business model, so platforms with opinionated industry ontologies will win. One exhibit that caught my attention at Summit. "The Battle for the Dataverse" captured a theme that showed up repeatedly throughout the event: context, interoperability, and who ultimately owns the layer that helps agents understand business data. I agree with all three, and I would push the second one further. Snowflake is betting on keeping context inside the platform. Horizon Context collects semantic views and metadata from dbt, Tableau, and Airflow so agents know what the data means, not just the schema. Cortex Sense enriches that context at runtime from query history and activity, and Snowflake claims it lifts agent accuracy on complex queries from 47% to 83%. The Natoma acquisition adds governed MCP access to more than 100 business systems without leaving the security perimeter. That is a structural problem for vendors whose entire product is a data catalog. If the context layer lives where the data and the agents live, a catalog that only mirrors that context is a feature, not a company. Atlan, for example, now calls itself a Context company, not a catalog. Horizon is not yet a business data catalog. At the pace Snowflake shipped this year, I expect it to get there within twelve months. I see Summit 2026 as Snowflake answering everyone who doubted it could do AI for the enterprise. The agentic platform is live, easy to use, and being adopted fast. The application platform is well on its way and already getting used by enterprises. And CoCo lets you build on both in a quarter of the time it used to take, maybe less. Unlock access to the largest independent learning library in Tech for FREE! If Snowflake Summit 2026 left one message behind, it is that Snowflake is no longer just a data warehouse. It is becoming a platform for governed data, AI, agents, and applications. For readers who want to go deeper into building on that platform, the upcoming Snowflake Cookbook, Second Edition from Packt offers practical recipes for designing governed, intelligent, AI-ready data platforms in the Snowflake AI Data Cloud. You can explore the book here: Author BioAugusto Rosa is a technology leader with 20+ years of experience building and scaling software, data, cloud, and security capabilities. He’s recognized in the Snowflake community as a Snowflake Data Superhero and Snowflake Subject Matter Expert, and he regularly shares practical patterns for modern data engineering and governance.Across consulting and product environments, Augusto has led teams delivering cloud platforms and data solutions across industries, including financial services, telecom, media, and technology. He contributes heavily to the community as a Toronto Snowflake User Group organizer and as a mentor with Rogers Cybersecure Catalyst at Toronto Metropolitan University, supporting cybersecurity and fintech startups in Canada.

0
0

article-image-back-from-data-ai-summit-2026-the-announcements-that-matter-for-a-data-warehouse-modernization-program

Laurent Leturgez, Lead product specialist - Data Warehouse Modernization, Databricks

24 Jun 2026

5 min read

Back from Data + AI Summit 2026: The Announcements That Matter for a Data Warehouse Modernization Program

Laurent Leturgez, Lead product specialist - Data Warehouse Modernization, Databricks

24 Jun 2026

5 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineering. Back from Data + AI Summit 2026: The Announcements That Matter for a Data Warehouse Modernization Program By Laurent Leturgez, Lead Product Specialist - Data Warehouse Modernization, Databricks More than 30,000 attendees filled Moscone this year. Most of the coverage went to agents, but for anyone responsible for modernizing a data warehouse the more useful story was elsewhere: transactional, real-time and analytical workloads converging onto one governed copy of data, open table formats reaching general availability, governance and cost moving to the center of the platform, and sharing becoming an open protocol. The event made the target architecture clearer than it has been for some time, and it lowered several of the practical barriers that usually stall a migration. The mood at Moscone Most booths on the expo floor opened with the same word: agentic. The conversations that held my attention concerned something more practical, namely the infrastructure that makes agents safe in production: where the data lives, which identities (human or machine) are allowed to act on it and how those actions get audited afterward. The co-founders carried most of the stage, with guest appearances from Satya Nadella, OpenAI's Greg Brockman, and PepsiCo on the customer side. The audience has clearly broadened, with application and platform teams now sitting alongside the data engineers. What follows is a grouped summary of the announcements, with a focus on the ones that change the calculus for a warehouse migration or modernization program. Architecture: one governed copy of data The headline architectural theme was consolidation. LTAP, Lake Transactional/Analytical Processing, unifies transactional, analytical and operational workloads on a single copy of storage in the lake under one governance model. The component that makes the idea concrete is Lakebase, serverless Postgres on open object storage, which reached general availability as the low-latency, transactional read/write layer. Alongside it, Lakehouse//RT brought real-time analytics directly onto the same governed data, and Reyden was introduced as the fastest query engine Databricks has built. For two decades the standard pattern placed an OLTP database on one side, a warehouse on the other, ETL in between and copies of data scattered across both. This set of announcements aims at that separation. The direction is consistent: fewer copies, fewer pipelines and one place to apply governance. Open formats and interoperability For anyone worried about lock-in, this was the most reassuring part of the week. Iceberg v3 and Managed Iceberg both reached general availability, geospatial types in Delta and Iceberg v3 went GA and external read access to managed Delta tables entered public preview. OpenSharing, an open protocol contributed to the Linux Foundation, extends the zero-copy approach of Delta Sharing to the agent era, letting agent skills, models and unstructured data move across organizations and platforms without copying files or depending on a proprietary marketplace. The pattern across all of it is that the industry is settling on open table formats and open protocols as the shared substrate, then competing on what gets built above them. That is a healthy signal whichever platform you currently run, and it matters directly for migration, which I come back to below. Governance and control Governance was where the heaviest general-availability work landed. Attribute-based access control reached GA for row filtering and column masking and external lineage went GA across upstream sources and downstream BI tools. A new Governance Hub entered private preview as a single place to monitor posture across data, AI, cost and performance. Unity Catalog also continues to operate as a single governance plane over external catalogs through catalog federation, querying that data in place with consistent access control, lineage and audit. Mastercard presented this running across Databricks and AWS. Cost as a first-class concern A more candid theme this year was cost. The message was that agentic workloads will get expensive, and that teams need visibility and control before the bills arrive. The Unity AI Gateway, now in beta with contextual service policies, governs every model, tool and agent through one set of access controls, cost monitoring and smart routing across both Databricks-hosted and external models. Treating cost discipline as a platform feature rather than an afterthought was a notable shift in tone. Agents and Genie, in brief On the agent side, Genie Ontology was announced as a self-improving context layer that learns business knowledge from data, documents and workplace apps, and Genie One reached general availability as an agentic coworker that answers questions against governed data through SQL and produces reports and artifacts. Omnigent was introduced as an open layer for supervising agents that orchestrate other agents. These are relevant to modernization mainly because they raise the value of having clean, governed, well-modeled data underneath, which is exactly what a good migration delivers. What this means for a migration or modernization program This is the part closest to my own work. Several of the announcements change the economics and the risk profile of moving off a traditional data warehouse. First, open formats lower the cost of moving. With Managed Iceberg and Iceberg v3 generally available, tables are no longer tied to a single engine, a migration becomes then a staged exercise instead of a single high-risk cutover. Second, catalog federation removes the need to lift and shift on day one. You can place a single governance plane over your existing catalogs and query the data in place while you migrate one workload at a time, which carries far less risk than the big-bang approach that has stalled so many programs. Third, consolidation through LTAP, Lakebase and Lakehouse//RT removes part of the original rationale for a separate warehouse. When transactional, real-time and analytical workloads share one governed copy of data, much of the pipeline sprawl that justified a standalone warehouse no longer needs to exist. Fourth, the governance and cost work matters more than it first appears. A migration is rarely blocked by technology alone. It stalls on the inability to predict spend and to prove control. ABAC at GA, external lineage, the Governance Hub and the Unity AI Gateway's cost routing give a program the guardrails that finance and security teams ask for before they sign off. None of this makes migration trivial. The hard parts remain especially on SQL or procedural SQL dialects translation. With the new agentic capabilities of Lakebridge that have been presented during DAIS, this will definitely ease the code migration process while reducing the time to migrate. Many of the themes announced at Data + AI Summit 2026 like open formats, governance, federation, and workload consolidation are the same patterns organizations are using today to modernize their analytics platforms. For readers interested in a deeper dive into the migration strategies and architectural trade-offs behind these shifts, my book, Modernizing Analytics Beyond the Data Warehouse, explores the topic in detail. Closing thoughts Setting the agent narrative aside, Data + AI Summit 2026 was about removing seams: between transactional and analytical processing, between batch and real time, between platforms through open sharing and between separate catalogs through federation. For a modernization program the practical advice holds whatever your vendor preference: commit to open formats, plan the governance and cost layer early and treat migration as a staged journey rather than a single event. Author BioLaurent Léturgez is a data platform specialist with over 20 years of experience in database systems. He works as a Product Specialist for data warehouse migrations at Databricks, where he helps organizations modernize their data warehouses on Databricks. Previously, as an Oracle Certified Master and Oracle ACE, he spent years working with Oracle technologies—from database administration and architecture to performance tuning and consulting across Europe. This rare combination of deep legacy database expertise and modern data engineering knowledge gives him a unique practitioner’s perspective on the challenges and opportunities of data warehouse modernization. He is based in Lille, France.

0
0

Author Posts - Data Engineering

The Small-File Tax: How Compaction, Clustering, and Pruning Change Lakehouse Cost

Loco for CoCo: What Snowflake Summit 2026 Was Really About

Back from Data + AI Summit 2026: The Announcements That Matter for a Data Warehouse Modernization Program

Trending Topics

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access