Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Author Posts - Data Engineering

2 Articles
article-image-loco-for-coco-what-snowflake-summit-2026-was-really-about
Augusto Rosa
12 Jun 2026
5 min read
Save for later

Loco for CoCo: What Snowflake Summit 2026 Was Really About

Augusto Rosa
12 Jun 2026
5 min read
Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineering. Loco for CoCo: What Snowflake Summit 2026 Was Really About  By Augusto Rosa, Snowflake Data SuperHero and Head of Data, Cloud and Security Architecture at Archetype Consulting Tl;dr  Summit 2026 was a victory lap for Snowflake CoCo, the coding agent that went from launch to more than 7,100 accounts in four months but the  bigger story is what sits underneath it, a stable platform that now carries an enterprise agentic layer easy enough for anyone to use, and an application platform that enterprises are already running internal tools on. The uncomfortable questions on the floor were about BI tools and standalone data catalogs. In my view, BI survives at least next year. Executives still need their KPI dashboards. Catalogs have a harder conversation coming as a standalone tool. CoCo's Breakout Year More than 20,000 people came through Moscone Center over four days, and the energy was the best I have felt at a Summit. Walk the expo floor, and almost every booth led with the same word: agentic. When every vendor reaches for the same adjective, it stops carrying information, but the repetition tells you what everyone is talking about. The side events reflected the same thing. The AI sessions I attended were full of legitimate questions, less what model should I use, more where is the business benefit, how do I get started, and how do I prove it.  The product Snowflake chose to celebrate was CoCo. Cortex Code launched in February and grew to more than 7,100 accounts in four months, the fastest-growing product in the company's history. At Summit, it officially picked up the CoCo name, which insiders had been using for a while. The Summit announcements were about meeting builders wherever they work: a desktop app, an Excel plugin, a VS Code extension, a Claude Code marketplace entry, and a Slack bot. For me, CoCo is even better than Cloud Agents. Tasks run in isolated containers inside Snowflake's perimeter, async and scheduled, so a pipeline build keeps going after you close your laptop. That was the difference between the agent that helps me code and the agent I can deploy on calls. I am already busy planning an agent who will do a lot for me when I engage with my clients, and make me even more efficient. Easy agents, boring platform The rebrand pair tells you the strategy. CoCo is the control plane for builders. Snowflake Intelligence became CoWork, the control plane for everyone else: one personal agent with routing, memory, scheduled tasks, and governed artifacts you can certify and publish, with Deep Research soon in GA. CoWork is easy to use because the hard parts are embedded into the platform. Horizon AI guardrails went GA with protection against prompt injection and jailbreaking across both agents. Agent identity, in preview, gives every agent action a traceable identity in the audit log, so you can tell an analyst ran this from an agent ran this at a glance. Intent-driven governance lets you state protect all PII and have Snowflake write and maintain the policies. Underneath all of it sits the platform improvements: Adaptive Compute sizing warehouses from a performance target, a new query compiler with roughly 40x faster compile times. The Snowflake product mantra of making the product easy to implement still applies, and it was clear across the announcements. I still found myself asking the product teams to push even further in places like Iceberg. They are. Snowflake's Application Platform Takes Shape The least flashy announcements were very neat and useful. App Runtime, now in preview, runs Node.js and full React apps next to the data, deployed with a one-line command. Streamlit in Snowflake went GA on the container runtime. Snowflake Postgres is GA, with managed mirroring into the analytical engine in preview. Put it together, and you have data, transformation, agents, and the application itself inside one security perimeter. Enterprises are already using this for internal tools, and that is the right first market as internal tools require more internal data and need to be secured. That progression explains the question I was asked more than once on the floor: what is the point of BI tools now? My answer is that they are still around next year, and not just out of inertia. Tools like Sigma are useful precisely because they are moving in the same direction, letting customers build applications on top of the spreadsheet interface. I have seen teams replace accounting workflows that lived in Excel with Sigma applications. BI may not be dying, but it is being squeezed from two sides: agents are taking the ad hoc questions, and application platforms are absorbing the operational workflows. The middle that remains is smaller than vendors would like, but it is still big. Why Context Is Becoming the Real Moat Shravan Deolalikar posted three takeaways from the Summit that are worth mentioning as well. First, governance is shifting from can this user access this data to should this agent perform this action, which is a different question requiring different machinery. Second, everyone is converging on the same destination: Snowflake, Atlan, ServiceNow, and Salesforce are all positioned as the context orchestration and governance layer for agents. Third, metadata extraction is commoditizing, and the hard part is encoding the business model, so platforms with opinionated industry ontologies will win.  One exhibit that caught my attention at Summit. "The Battle for the Dataverse" captured a theme that showed up repeatedly throughout the event: context, interoperability, and who ultimately owns the layer that helps agents understand business data. I agree with all three, and I would push the second one further. Snowflake is betting on keeping context inside the platform. Horizon Context collects semantic views and metadata from dbt, Tableau, and Airflow so agents know what the data means, not just the schema. Cortex Sense enriches that context at runtime from query history and activity, and Snowflake claims it lifts agent accuracy on complex queries from 47% to 83%. The Natoma acquisition adds governed MCP access to more than 100 business systems without leaving the security perimeter. That is a structural problem for vendors whose entire product is a data catalog. If the context layer lives where the data and the agents live, a catalog that only mirrors that context is a feature, not a company. Atlan, for example, now calls itself a Context company, not a catalog. Horizon is not yet a business data catalog. At the pace Snowflake shipped this year, I expect it to get there within twelve months. I see Summit 2026 as Snowflake answering everyone who doubted it could do AI for the enterprise. The agentic platform is live, easy to use, and being adopted fast. The application platform is well on its way and already getting used by enterprises. And CoCo lets you build on both in a quarter of the time it used to take, maybe less. Unlock access to the largest independent learning library in Tech for FREE! If Snowflake Summit 2026 left one message behind, it is that Snowflake is no longer just a data warehouse. It is becoming a platform for governed data, AI, agents, and applications. For readers who want to go deeper into building on that platform, the upcoming Snowflake Cookbook, Second Edition from Packt offers practical recipes for designing governed, intelligent, AI-ready data platforms in the Snowflake AI Data Cloud. You can explore the book here:  Author BioAugusto Rosa is a technology leader with 20+ years of experience building and scaling software, data, cloud, and security capabilities. He’s recognized in the Snowflake community as a Snowflake Data Superhero and Snowflake Subject Matter Expert, and he regularly shares practical patterns for modern data engineering and governance.Across consulting and product environments, Augusto has led teams delivering cloud platforms and data solutions across industries, including financial services, telecom, media, and technology. He contributes heavily to the community as a Toronto Snowflake User Group organizer and as a mentor with Rogers Cybersecure Catalyst at Toronto Metropolitan University, supporting cybersecurity and fintech startups in Canada.
Read more
  • 0
  • 0

article-image-the-small-file-tax-how-compaction-clustering-and-pruning-change-lakehouse-cost
Vinoth Govindarajan
13 Apr 2026
5 min read
Save for later

The Small-File Tax: How Compaction, Clustering, and Pruning Change Lakehouse Cost

Vinoth Govindarajan
13 Apr 2026
5 min read
Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionSame data, same engine, before and after tuning: what changes when hot partitions stop paying a per-file penalty.A lakehouse can look cheap in storage and still be expensive to read.The clue is usually a query that should be routine: yesterday’s data, one region, one status, a few columns. It hangs longer than it should, not because the engine is doing sophisticated analytics, but because it is working through too many files first. That overhead shows up in file listing, metadata evaluation, file-open cost, and the work required to decide what can be skipped.That is the small-file tax. It builds quietly in the systems we actually run: micro-batches, CDC pipelines, frequent upserts, and incremental merges. Those patterns keep data fresh, but they also fragment the hottest part of the table. The storage bill may barely notice. The read path does.Teams often misdiagnose this as a compute problem. They add more workers, and the query still spends too much time deciding what to read. Bigger clusters help less than they should when the table layout reflects ingest cadence more than query shape.Why small files are expensiveEvery file comes with fixed overhead.Before the engine reads much useful data, it has to discover files, inspect metadata, use statistics, and decide whether partition pruning or file-level skipping can eliminate work. When a table contains thousands of undersized files, that fixed work starts to dominate.The effect is easy to underestimate because it often hides in planning. Small-file tables spend more time getting ready to scan than they should. That leads to higher latency, more files touched, and more bytes read than the query really needed.Predicate pushdown helps inside a file. Pruning decides which files never needed to be read in the first place. If hot partitions are packed with tiny, poorly organized files, pushdown can only do so much.The practical point is simple: the small-file problem is often a planning problem before it becomes a scan problem.Benchmark setupThis piece is best read as a benchmark-informed engineering analysis, not a fresh benchmark report. I am not claiming new measured results here. The goal is to isolate layout as the variable and show how I would structure the comparison honestly.Keep the engine the same. Keep the dataset the same. Change only the table layout.A realistic setup would use one Spark-based fact table with columns such as event_ts, event_date, customer_id, region, event_type, order_status, and amount, partitioned by event_date. Then simulate frequent ingest into recent partitions so the table develops the same failure mode many production systems do: hot partitions filled with small files.Run the same query set across three versions of the table:Baseline: many small files, no layout maintenanceAfter compaction: fewer, better-sized filesAfter clustering: same data, reorganized around common filter pathsThe cleanest metrics are the ones operators already watch in production:●       file count in hot partitions●       average file size●       planning time●       total query runtime●       files scanned●       bytes read●       maintenance job runtime or rewritten bytesThat gives you an apples-to-apples way to ask the right question: how much of the query bill is really a file-layout problem?Before tuning: what goes wrongBefore tuning, physical layout usually follows write cadence, not query shape.Data lands every few minutes. Recent partitions collect another pile of small Parquet files. Analysts filter by event_date, region, customer_id, or order_status, while the table is effectively organized by when each write arrived.Partition pruning still helps. It may eliminate older days quickly. But that only gets you down to the hot partitions, which are often the messiest part of the table. If those partitions still contain too many small files, the engine has too many candidates to inspect.That is why small-file tables often feel worse than their raw size suggests. A very large table can behave well if recent partitions are healthy. A much smaller table can feel slow if recent partitions are fragmented and badly laid out.After tuning: what changes with compaction, clustering, and pruningOnce you separate the mechanics, the roles of the three controls become clearer.Compaction reduces file count.This is the first fix because it attacks the per-file penalty directly. Delta’s OPTIMIZE can compact small files into larger ones, and Delta’s auto compaction can do that automatically after writes. Iceberg’s rewrite_data_files does the same class of work through bin-packing. In Hudi, small-file management is broader: write-time auto-sizing and clustering address file layout generally, while compaction in the Hudi-specific sense applies to Merge-on-Read tables and merges log files back into base files.Clustering improves locality.Compaction alone can still leave you with a table that is neat but not selective. Clustering reorganizes data so values that are commonly filtered together live closer together. Delta supports ZORDER, and newer Delta versions also support liquid clustering for incrementally clustering data over time. Iceberg exposes sort-based and zorder(...) layouts through rewrite_data_files. Hudi supports clustering inline or asynchronously, including background operation while ingestion continues.Pruning is where the engine collects the savings.Delta uses automatically collected data-skipping statistics such as min and max values. Iceberg uses hidden partition transforms and metadata-driven planning so queries do not have to know the table’s physical layout. Hudi’s metadata table exists in part to avoid expensive file listing and to expose metadata such as file listings and column statistics for planning. Better layout improves all three paths. The gains will vary by workload. Broad scans often benefit first from file-count reduction. More selective queries often benefit more when layout and statistics align with the columns people actually filter on.What this means in practiceThe operational lesson is not “run maintenance everywhere.” It is “run the right maintenance where the query bill is being generated.”A few rules hold up well in practice:●  Measure hot partitions first. Whole-table size often hides where the pain actually lives.●  Fix file count before chasing elaborate layout. If the table is badly fragmented, compaction or file sizing is usually the first lever.●  Cluster around repeated predicates, not theoretical ones. Layout should follow the workload you really have.●  Treat maintenance as a workload. Compaction, clustering, and rewrite jobs consume real compute and rewrite real bytes.One recurring mistake is trying to solve everything with partitioning alone. Delta’s clustering docs explicitly call out cases where a typical partition column would leave the table with too many or too few partitions. Iceberg’s hidden partitioning model exists in part to decouple query logic from rigid physical partition layout.That is the real trade-off: not maintenance versus no maintenance, but where you want the cost to land.Differences across Delta / Iceberg / HudiAll three open table formats help with the same broad problem, but they expose different control surfaces.Delta Lake exposes layout maintenance directly through OPTIMIZE, auto compaction, data skipping, and ZORDER. In newer Delta releases, liquid clustering adds an incremental clustering model for suitable tables, though it comes with its own feature and layout constraints.Apache Iceberg leans heavily on metadata-driven planning. Hidden partitioning, partition evolution, and metadata/manifests help the engine avoid work, while rewrite_data_files gives you bin-packing and sort-based rewrite paths, including zorder(...) support in Spark procedures.Apache Hudi attacks the problem from both sides: it avoids small files during writes where possible, offers clustering as a table service, uses a metadata table to reduce file-listing bottlenecks, and on Merge-on-Read tables uses compaction to merge log files into base files. That makes Hudi especially natural in write-heavy and CDC-style systems.Bottom lineA slow lakehouse is often a file-layout problem wearing a compute bill.Compaction reduces file count. Clustering improves locality. Pruning is where the engine realizes the savings. Put together, they do more than speed up queries. They make read cost more predictable, especially on the hot partitions where modern pipelines do most of their damage.That is why the small-file tax is such a useful way to frame the problem. It gives you a clean question: same data, same engine, before and after layout tuning, what changed in planning overhead, files scanned, and bytes read?If you are working through those trade-offs now, I go deeper on these patterns in Engineering Lakehouses with Open Table Formats.References●       Chapter 8 of Engineering Lakehouses with Open Table Formats●       Delta Lake Optimizations●       Delta Lake Liquid Clustering●       Apache Iceberg Partitioning and Hidden Partitioning●       Apache Iceberg Spark Procedures (rewrite_data_files)●       Apache Hudi Table Metadata●       Apache Hudi Compaction●       Apache Hudi File Sizing●       Apache Hudi ClusteringAuthor BioVinoth Govindarajan is a seasoned data expert and staff software engineer at Apple Inc., where he spearheads data platforms using open-source technologies like Iceberg, Spark, Trino, and Flink. Before this, he worked on designing incremental ETL frameworks for real-time data processing at Uber. He is a dedicated contributor to the open source community in projects such as Apache Hudi and dbt-spark. As a thought leader, Vinoth has shared his expertise through speaking engagements at conferences such as dbt Coalesce and Hudi OSS community meetups. He has published several blogs on building open lakehouses. Holding a bachelor's degree in information technology, Vinoth has also authored multiple research papers published in journals like IEEE. --This text refers to an out of print or unavailable edition of this title.
Read more
  • 0
  • 0
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon