Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionSame data, same engine, before and after tuning: what changes when hot partitions stop paying a per-file penalty.A lakehouse can look cheap in storage and still be expensive to read.The clue is usually a query that should be routine: yesterday’s data, one region, one status, a few columns. It hangs longer than it should, not because the engine is doing sophisticated analytics, but because it is working through too many files first. That overhead shows up in file listing, metadata evaluation, file-open cost, and the work required to decide what can be skipped.That is the small-file tax. It builds quietly in the systems we actually run: micro-batches, CDC pipelines, frequent upserts, and incremental merges. Those patterns keep data fresh, but they also fragment the hottest part of the table. The storage bill may barely notice. The read path does.Teams often misdiagnose this as a compute problem. They add more workers, and the query still spends too much time deciding what to read. Bigger clusters help less than they should when the table layout reflects ingest cadence more than query shape.Why small files are expensiveEvery file comes with fixed overhead.Before the engine reads much useful data, it has to discover files, inspect metadata, use statistics, and decide whether partition pruning or file-level skipping can eliminate work. When a table contains thousands of undersized files, that fixed work starts to dominate.The effect is easy to underestimate because it often hides in planning. Small-file tables spend more time getting ready to scan than they should. That leads to higher latency, more files touched, and more bytes read than the query really needed.Predicate pushdown helps inside a file. Pruning decides which files never needed to be read in the first place. If hot partitions are packed with tiny, poorly organized files, pushdown can only do so much.The practical point is simple: the small-file problem is often a planning problem before it becomes a scan problem.Benchmark setupThis piece is best read as a benchmark-informed engineering analysis, not a fresh benchmark report. I am not claiming new measured results here. The goal is to isolate layout as the variable and show how I would structure the comparison honestly.Keep the engine the same. Keep the dataset the same. Change only the table layout.A realistic setup would use one Spark-based fact table with columns such as event_ts, event_date, customer_id, region, event_type, order_status, and amount, partitioned by event_date. Then simulate frequent ingest into recent partitions so the table develops the same failure mode many production systems do: hot partitions filled with small files.Run the same query set across three versions of the table:Baseline: many small files, no layout maintenanceAfter compaction: fewer, better-sized filesAfter clustering: same data, reorganized around common filter pathsThe cleanest metrics are the ones operators already watch in production:● file count in hot partitions● average file size● planning time● total query runtime● files scanned● bytes read● maintenance job runtime or rewritten bytesThat gives you an apples-to-apples way to ask the right question: how much of the query bill is really a file-layout problem?Before tuning: what goes wrongBefore tuning, physical layout usually follows write cadence, not query shape.Data lands every few minutes. Recent partitions collect another pile of small Parquet files. Analysts filter by event_date, region, customer_id, or order_status, while the table is effectively organized by when each write arrived.Partition pruning still helps. It may eliminate older days quickly. But that only gets you down to the hot partitions, which are often the messiest part of the table. If those partitions still contain too many small files, the engine has too many candidates to inspect.That is why small-file tables often feel worse than their raw size suggests. A very large table can behave well if recent partitions are healthy. A much smaller table can feel slow if recent partitions are fragmented and badly laid out.After tuning: what changes with compaction, clustering, and pruningOnce you separate the mechanics, the roles of the three controls become clearer.Compaction reduces file count.This is the first fix because it attacks the per-file penalty directly. Delta’s OPTIMIZE can compact small files into larger ones, and Delta’s auto compaction can do that automatically after writes. Iceberg’s rewrite_data_files does the same class of work through bin-packing. In Hudi, small-file management is broader: write-time auto-sizing and clustering address file layout generally, while compaction in the Hudi-specific sense applies to Merge-on-Read tables and merges log files back into base files.Clustering improves locality.Compaction alone can still leave you with a table that is neat but not selective. Clustering reorganizes data so values that are commonly filtered together live closer together. Delta supports ZORDER, and newer Delta versions also support liquid clustering for incrementally clustering data over time. Iceberg exposes sort-based and zorder(...) layouts through rewrite_data_files. Hudi supports clustering inline or asynchronously, including background operation while ingestion continues.Pruning is where the engine collects the savings.Delta uses automatically collected data-skipping statistics such as min and max values. Iceberg uses hidden partition transforms and metadata-driven planning so queries do not have to know the table’s physical layout. Hudi’s metadata table exists in part to avoid expensive file listing and to expose metadata such as file listings and column statistics for planning. Better layout improves all three paths. The gains will vary by workload. Broad scans often benefit first from file-count reduction. More selective queries often benefit more when layout and statistics align with the columns people actually filter on.What this means in practiceThe operational lesson is not “run maintenance everywhere.” It is “run the right maintenance where the query bill is being generated.”A few rules hold up well in practice:● Measure hot partitions first. Whole-table size often hides where the pain actually lives.● Fix file count before chasing elaborate layout. If the table is badly fragmented, compaction or file sizing is usually the first lever.● Cluster around repeated predicates, not theoretical ones. Layout should follow the workload you really have.● Treat maintenance as a workload. Compaction, clustering, and rewrite jobs consume real compute and rewrite real bytes.One recurring mistake is trying to solve everything with partitioning alone. Delta’s clustering docs explicitly call out cases where a typical partition column would leave the table with too many or too few partitions. Iceberg’s hidden partitioning model exists in part to decouple query logic from rigid physical partition layout.That is the real trade-off: not maintenance versus no maintenance, but where you want the cost to land.Differences across Delta / Iceberg / HudiAll three open table formats help with the same broad problem, but they expose different control surfaces.Delta Lake exposes layout maintenance directly through OPTIMIZE, auto compaction, data skipping, and ZORDER. In newer Delta releases, liquid clustering adds an incremental clustering model for suitable tables, though it comes with its own feature and layout constraints.Apache Iceberg leans heavily on metadata-driven planning. Hidden partitioning, partition evolution, and metadata/manifests help the engine avoid work, while rewrite_data_files gives you bin-packing and sort-based rewrite paths, including zorder(...) support in Spark procedures.Apache Hudi attacks the problem from both sides: it avoids small files during writes where possible, offers clustering as a table service, uses a metadata table to reduce file-listing bottlenecks, and on Merge-on-Read tables uses compaction to merge log files into base files. That makes Hudi especially natural in write-heavy and CDC-style systems.Bottom lineA slow lakehouse is often a file-layout problem wearing a compute bill.Compaction reduces file count. Clustering improves locality. Pruning is where the engine realizes the savings. Put together, they do more than speed up queries. They make read cost more predictable, especially on the hot partitions where modern pipelines do most of their damage.That is why the small-file tax is such a useful way to frame the problem. It gives you a clean question: same data, same engine, before and after layout tuning, what changed in planning overhead, files scanned, and bytes read?If you are working through those trade-offs now, I go deeper on these patterns in Engineering Lakehouses with Open Table Formats.References● Chapter 8 of Engineering Lakehouses with Open Table Formats● Delta Lake Optimizations● Delta Lake Liquid Clustering● Apache Iceberg Partitioning and Hidden Partitioning● Apache Iceberg Spark Procedures (rewrite_data_files)● Apache Hudi Table Metadata● Apache Hudi Compaction● Apache Hudi File Sizing● Apache Hudi ClusteringAuthor BioVinoth Govindarajan is a seasoned data expert and staff software engineer at Apple Inc., where he spearheads data platforms using open-source technologies like Iceberg, Spark, Trino, and Flink. Before this, he worked on designing incremental ETL frameworks for real-time data processing at Uber. He is a dedicated contributor to the open source community in projects such as Apache Hudi and dbt-spark. As a thought leader, Vinoth has shared his expertise through speaking engagements at conferences such as dbt Coalesce and Hudi OSS community meetups. He has published several blogs on building open lakehouses. Holding a bachelor's degree in information technology, Vinoth has also authored multiple research papers published in journals like IEEE. --This text refers to an out of print or unavailable edition of this title.
Read more