Data Engineering | Tech News, Tutorials & Expert Insights

article-image-mastering-slowly-changing-dimensions-in-snowflake-data-warehouses

17 Feb 2026

10 min read

Mastering Slowly Changing Dimensions in Snowflake Data Warehouses

17 Feb 2026

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionIn modern analytics environments, few modeling challenges are as important or as misunderstood as Slowly Changing Dimensions (SCDs). Business attributes do not stand still. Customers change status, relocate, upgrade tiers, and evolve over time. The real challenge is not just recording those changes. It is preserving the right version of history so that facts remain analytically correct.In Snowflake-powered data warehouses, modeling dimensions properly means balancing performance, maintainability, and historical accuracy. Should attributes be overwritten, versioned, split into mini-dimensions, or tracked with effective dates and surrogate keys?This article explores the full spectrum of SCD strategies from Type 0 through Type 7 using a practical CUSTOMER example. You will see how different change-tracking techniques impact reporting, how surrogate keys and validity intervals maintain point-in-time accuracy, and how Snowflake-native features can be used to implement these patterns efficiently.If you would like to follow along, the scripts used in these examples are available here: https://github.com/PacktPublishing/Data-Modeling-with-Snowflake2E/tree/main/ch14— Recommended Workshop —If you're building with LLMs, agents, or GenAI workflows, most failures aren’t model failures - they’re system design failures. The Machine Learning & Generative AI System Design Workshop by Sairam Sundaresan is a 4.5-hour live session focused on designing AI systems that survive production.Workshop Live on 28 February 2026. Learn practical system design frameworks, cost/latency/reliability trade-offs, and evaluation beyond simple accuracy metrics.Use code ML35 for 35% off →https://www.eventbrite.com/e/machine-learning-and-generative-ai-system-design-workshop-tickets-1975103644168This section is separate from the editorial content.Dimensions overviewA dimension unifies (or conforms) similar attributes from one or various source systems into a single table under a common unique identifier known as a business key. A single surrogate key can also be used in place of multi-column business or primary keys. The unique key of a dimension table plays a critical role in identifying dimension records and allows the database team to track and maintain changes over time. A dimension table can be structured in predetermined ways to allow for different types of change tracking depending on the business requirement.SCD typesAttributes within a dimension have differing requirements for durability and change tracking. Some attributes are updated directly, while others require historical snapshots, yet others cannot change at all. This section will cover the types of SCDs, or update profiles, that a given attribute in a dimension can have.It’s important to note that the dimension type may not necessarily apply across all dimension attributes equally. Within the same dimension table, some attributes may be overwritten while others may not. By understanding SCD types and when to use them, database developers can implement the proper table structure and update techniques to satisfy the organization’s reporting and analytics requirements.Example scenarioTo explain the various SCD types, we will use a simplified CUSTOMER dimension as an example and track the change as it would appear under each configuration.Suppose our fact table stores order details from customer X, made on the first of every month in2022. Thanks to X’s patronage, their customer status went from new to active midway through the year. Not only do we want to track when the change occurred but we want to tie the correct status to the recorded sales facts (that is, the customer is active today, but half their orders were made with a status of new).The change in customer status is displayed here as it currently appears in the source system and data warehouse landing area:Figure: A changed record in the source system and data warehouse raw schema With this scenario in mind, let’s explore the SCD types.Type 0: Maintain originalIronically, the first SCD—Type 0—does not change. Type 0 dimensions are intended for durable attributes that cannot change due to their business nature. Examples of Type 0 attributes include birth dates, calendar dates, and any attribute recorded at record creation that needs to be tracked as a baseline, such as original price, weight, or date of first login.Type 1: OverwriteType 1 attributes do not require historical tracking and may be directly overwritten with an UPDATE statement. Sometimes, the latest attribute value is all that the business cares about. For example, our organization demands the latest customer status, and previous values are irrelevant. Maintaining a Type 1 dimension is relatively simple—for example, if the status changes, it is updated directly in the customer dimension, as illustrated here:Figure: New updated for STATUS change in a Type 1 SCDHowever, overwriting values is often not enough—a historical value must also be preserved.Type 2: Add a new rowFor some attributes, an organization must register the latest value and maintain prior historical records. Type 2 attributes generate a new row every time a change is recorded. Generating new rows for a given business key means that uniqueness is violated unless a time dimension (the effective date) is added to the primary key. The effective date of a Type 2 SCD not only separates historical values for a given business key but also allows those records to be tied to fact tables at a given point in time.Maintaining a Type 2 SCD requires creating new rows when record changes are detected and additional metadata columns to track them. A single record in our example would generate the following change in a Type 2 table:Figure: New row generated for a change in a Type 2 SCDThe following metadata fields make working with Type 2 attributes easier:Validity intervals: Because the business key is being duplicated with each change, another column must be added to the primary key to maintain uniqueness. Validity intervals (also named valid_from/to, start/end_date) provide the additional unique value for the primary key and timestamp when the change occurred, allowing facts to be linked with the correct point-in-time dimension value. The TO_DATE column also provides a flag for identifying the latest record using the standard surrogate high date of 9999-12-31.Hash: Using a hashing function, such as MD5, provides a quick and standard way to identify when record changes occur. This concept is borrowed from Data Vault (discussed in Chapter 18, Scaling Data Models Through Modern Techniques). When there are many Type2 attributes in a table, instead of checking for changes one by one, hash all of them into a single column and compare them in a single go, as follows:Create the hash field: SELECT MD5 (Col1 || Col2 || ... || ColN) AS hashCompare the hash field: IFF(hash_new = hash_old, 'same', 'changed')Type 3: Add a new columnType 3 dimensions track changes by adding a new column to store the previous value when a change occurs. The original column is updated and not renamed to avoid breaking any existing downstream references. An effective date metadata column records the time of the change, allowing analytics processes to use the new or historical value based on their validity period.An example of a status update in a Type 3 attribute is given here:Figure: New row column created for a change in a Type 3 SCDAlthough Type 3 is easier to maintain than Type 2, the limitation is storing multiple changes. While Type 2 attributes can change as often as needed, generating new rows each time, Type 3 can only show one change without creating additional columns—not a scalable design if regular changes occur.Type 4: Add a mini dimensionWhen SCDs become quickly changing dimensions—due to rapidly changing attributes—the number of records that Type 2 dimensions generate can cause performance issues. This is especially true with dimensions containing many records—as in millions of rows or more.In a Type 4 scenario, the solution is to split the frequently changing attributes into a separate mini dimension. To further curtail the number of records, the values in the mini dimension can be banded within business-agreed value ranges that provide a meaningful breakdown for analysis. The mini dimension has its own surrogate key and does not contain the main dimension foreign key—allowing both to retain a relatively low cardinality. However, to tie the main dimension to the mini, the mini dimension foreign key must be included in the fact table (as the main dimension appears at the time of the generated fact).On a diagram, the arrangement of a Type 4 dimension would look like this:Figure: A Type 4 SCD on a relational diagramFor our example, the business wants to track the length in months for how long a customer has been active, as well as their total yearly spend at the time of each sale. To avoid generating a record for each month and order placed, the business teams have agreed to group the MONTHS_ACTIVE attribute into two categories (less than or greater than five months) and band the sales volume into three groups. The mini dimension would need to contain every possible (or allowable by existing business rules) combination of groupings.Our example would look like this (notice how the profile ID changes throughout the year as a function of the customer’s attributes):Figure: Mini dimension and foreign key in a fact table in a Type 4 SCDWhile this arrangement satisfies the reporting requirement, bridging dimension tables via a fact encumbers analysis on the dimension itself. To unify the main and mini dimensions into one, a Type 5 SCD is used.Type 5: Type 4 mini dimension + Type 1A Type 5 SCD is an extension of the Type 4 mini-dimension technique—adding the mini-dimension key as a Type 1 attribute in the main dimension (hence the name, 4+1 = 5). This approach affords the performance gains of a Type 4 dimension by avoiding the explosive growth of rapidly changing Type 2 records and gives users a simple way to unify the main dimension with the mini dimension through a common join column.On a diagram, the arrangement of a Type 5 dimension would look like this:Figure: A Type 5 SCD and related view on a relational diagramNotice that to further simplify the user experience, a view is created over the main and mini dimensions to give the users a single entity to work with. Analysis of the fact table becomes more versatile by allowing users to join on one entity (the view) instead of the main and mini dimensions if historical values are not required.The same scenario described in the section on Type 4 would look like this under Type 5:Figure: Mini-dimension and a related view in a Type 5 SCDUnfortunately, Type 4, and by extension, Type 5, suffer from the inconvenience of calculating the mini-dimension value to include it as part of each fact. The performance implications involved in adding the mini-dimension foreign key to the fact table should outweigh the performance gain in reducing the number of dimension records through the use of the mini dimension.Type 6: The Type 1, 2, and 3 hybridA Type 6 SCD is so named because it combines the techniques of Type 1, 2, and 3 (1+2+3 = 6) dimensions into one table. Based on business needs, users will demand different levels of historical values to achieve a balance of detail and flexibility in their analytics.Suppose our customer X from previous examples began to relocate—moving headquarters to Mexico in 2023, then to Brazil in 2024. A Type 6 approach yields a dimension table that gives analysts every possible temporal attribute value in every snapshot: a Type 1 current value, a Type 2 effective dated value, and a Type 3 previous value.To recap the status and country changes mentioned in this example, a snapshot of the source system over time is presented here:Figure: Source system showing changes for customer XIn a business scenario where the customer status needed Type 2 and the country was presented as Type 1, 2, and 3, the resulting table would look like this (the HASH column is now calculated as a function of STATUS and COUNTRY):Figure: Type 1, 2, and 3 columns combine in a Type 6 SCDType 7: Complete as-at flexibilityBusiness users across all cultures and industries have a penchant for changing their minds. The Type 7 approach gives database modelers a way to deliver the needed historical attribute no matter the criteria or temporal reference point requested.A Type 7 dimension (unimaginatively named as the number that follows 6) includes a natural key and a surrogate key in a Type 2 table structure and embeds both in the fact table.A method for generating surrogate keysAn efficient—and data vault-inspired—way to generate a surrogate key for Type 2 records is to use an MD5 hash on the compound primary key (in this example, CUSTOMER_ID and FROM_DATE):SELECT MD5(customer_id || from_date) AS customer_skeyIn a Type 7 configuration, a surrogate key is added to an otherwise Type 2 structure and is embedded in the fact (the latest SKEY as of the creation of each fact record). Based on the example scenario from the Type 6 section, the tables would look like this:Figure: A Type 7 SCD offers complete analytical flexibilityA Type 7 SCD allows business users to select the appropriate customer attributes based on the following criteria:The most recent or current information (that is, TO_DATE = '9999-12-31')The primary effective date on the fact record (that is, LOAD_DATE)When the user changes their mind, any date associated with the fact record (that is, ORDER_ DATE or SHIPMENT_DATE)Here is how those queries might look:--get current SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_id) WHERE c.to_date = '9999-12-31' --get dimension values as at the primary effective date on the fact record SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_skey) --get dimension values as-at any date on the fact record --example will use SHIPMENT_DATE SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_skey) AND o.shipment_date BETWEEN c.from_date AND c.to_dateNow that you have a general understanding of the different SCD types, let’s recap before detailing the Snowflake recipes used to construct them.Overview of SCD typesThe following screenshot summarizes the seven SCD types covered in the previous section, including their maintenance strategy and usage. While eight (including Type 0) SCDs may seem like a lot, most database designs rarely go beyond Type 3, as the first four SCD types strike an acceptable balance of performance, maintainability, and historical reporting needs:Figure: A comparison of SCD typesNow, let’s see how to build SCDs with maximal efficiency using Snowflake-specific features.ConclusionSlowly Changing Dimensions are not a one-size-fits-all solution. Some attributes should never change (Type 0), others can be safely overwritten (Type 1), and many require full historical tracking (Type 2). As business complexity grows, hybrid approaches such as Types 4, 5, 6, and 7 offer increasing levels of analytical flexibility. They help organizations answer not only what is true now, but also what was true at the time of each transaction.Choosing the right SCD type is ultimately a business decision supported by sound modeling principles. The goal remains consistent: preserve the integrity of historical analysis while keeping data structures scalable and performant.For a deeper dive into implementing SCDs in Snowflake, including practical SQL recipes, hashing strategies, dynamic approaches, and performance considerations, you can learn more in the book Data Modeling with Snowflake, Second Edition by Serge Gershkovich. Modeling guides are often steeped in theory. This book’s innovative approach combines practical modeling concepts with Snowflake best practices and unique features - allowing you to create efficient designs that leverage the power of the Data Cloud. Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.

0
0

article-image-acid-transactions-in-lakehouse-architectures-mvcc-occ-and-conflict-resolution-with-iceberg-hudi-and-delta-lake

Dipankar Mazumdar, Vinoth Govindarajan

29 Jan 2026

10 min read

ACID Transactions in Lakehouse Architectures: MVCC, OCC, and Conflict Resolution with Iceberg, Hudi, and Delta Lake

Dipankar Mazumdar, Vinoth Govindarajan

29 Jan 2026

10 min read

0
0

article-image-airflow-ops-best-practices-observation-and-monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

Airflow Ops Best Practices: Observation and Monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

This article is an excerpt from the book, "Apache Airflow Best Practices", by Dylan Intorf, Kendrick van Doorn, Dylan Storey. With practical approach and detailed examples, this book covers newest features of Apache Airflow 2.x and it's potential for workflow orchestration, operational best practices, and data engineering.IntroductionIn this article, we will continue to explore the application of modern “ops” practices within Apache Airflow, focusing on the observation and monitoring of your systems and DAGs after they’ve been deployed.We’ll divide this observation into two segments – the core Airflow system and individual DAGs. Each segment will cover specific metrics and measurements you should be monitoring for alerting and potential intervention.When we discuss monitoring in this section, we will consider two types of monitoring – active and suppressive.In an active monitoring scenario, a process will actively check a service’s health state, recording its state and potentially taking action directly on the return value.In a suppressive monitoring scenario, the absence of a state (or state change) is usually meaningful. In these scenarios, the monitored application sends an active schedule to a process to inform it that it is OK, usually suppressing an action (such as an alert) from occurring.This chapter covers the following topics:Monitoring core Airflow componentsMonitoring your DAGsTechnical requirementsBy now, we expect you to have a good understanding of Airflow and its core components, along with functional knowledge in the deployment and operation of Airflow and Airflow DAGs.We will not be covering specific observability aggregators or telemetry tools; instead, we will focus on the activities you should be keeping an eye on. We strongly recommend that you work closely with your ops teams to understand what tools exist in your stack and how to configure them for capture and alerting your deployments.Monitoring core Airflow componentsAll of the components we will discuss here are critical to ensuring a functioning Airflow deployment. Generally, all of them should be monitored with a bare minimum check of Is it on? and if a component is not, an alert should surface to your team for investigation. The easiest way to check this is to query the REST API on the web server at `/health/`; this will return a JSON object that can be parsed to determine whether components are healthy and, if not, when they were last seen.SchedulerThis component needs to be running and working effectively in order for tasks to be scheduled for execution.When the scheduler service is started, it also starts a `/health` endpoint that can be checked by an external process with an active monitoring approach.The returned signal does not always indicate that the scheduler is working properly, as its state is simply indicative that the service is up and running. There are many scenarios where the scheduler may be operating but unable to schedule jobs; as a result, many deployments will include a canary dag to their deployment that has a single task, acting to suppress an external alert from going off.Import metrics that airflow exposes for you include the following:scheduler.scheduler_loop_duration: This should be monitored to ensure that your scheduler is able to loop and schedule tasks for execution. As this metric increases, you will see tasks beginning to schedule more slowly, to the point where you may begin missing SLAs because tasks fail to reach a schedulable state.scheduler.tasks.starving: This indicates how many tasks cannot be scheduled because there are no slots available. Pools are a mechanism that Airflow uses to balance large numbers of submitted task executions versus a finite amount of execution throughput. It is likely that this number will not be zero, but being high for extended periods of time may point to an issue in how DAGs are being written to schedule work.scheduler.tasks.executable: This indicates how many tasks are ready for execution (i.e., queued). This number will sometimes not be zero, and that is OK, but if the number increases and stays high for extended periods of time, it indicates that you may need additional computer resources to handle the load. Look at your executor to increase the number of workers it can run. Metadata databaseThe metadata database is used to store and track all of the metadata for your Airflow deployments’ previous DAG/task executions, along with information about your environment’s roles and permissions. Losing data from this database can interrupt normal operations and cause unintended consequences, with DAG runs being repeated.While critical, because it is architecturally ubiquitous, the database is also least likely to encounter issues, and if it does, they are absolutely catastrophic in nature.We generally suggest you utilize a managed service for provisioning and operating your backing database, ensuring that a disaster recovery plan for your metadata database is in place at all times.Some active areas to monitor on your database include the following:Connection pool size/usage: Monitor both the connection pool size and usage over time to ensure appropriate configuration, and identify potential bottlenecks or resource contention arising from Airflow components’ concurrent connections.Query performance: Measure query latency to detect inefficient queries or performance issues, while monitoring query throughput to ensure effective workload handling by the database.Storage metrics: Monitor the disk space utilization of the metadata database to ensure that it has sufficient storage capacity. Set up alerts for low disk space conditions to prevent database outages due to storage constraints.Backup status: Monitor the status of database backups to ensure that they are performed regularly and successfully. Verify backup integrity and retention policies to mitigate the risk of data loss if there is a database failure.TriggererThe Triggerer instance manages all of the asynchronous operations of deferrable operators in a deferred state. As such, major operational concerns generally relate to ensuring that individual deferred operators don’t cause major blocking calls to the event loop. If this occurs, your deferrable tasks will not be able to check their state changes as frequently, and this will impact scheduling performance.Import metrics that airflow exposes for you include the following:triggers.blocked_main_thread: The number of triggers that have blocked the main thread. This is a counter and should monotonically increase over time; pay attention to large differences between recording (or quick acceleration) counts, as it’s indicative of a larger problem.triggers.running: The number of triggers currently on a triggerer instance. This metric should be monitored to determine whether you need to increase the number of triggerer instances you are running. While the official documentation claims that up to tens of thousands of triggers can be on an instance, the common operational number is much lower. Tune at your discretion, but depending on the complexity of your triggers, you may need to add a new instance for every few hundred consistent triggers you run.Executors/workersDepending on the executor you use, you will need to monitor your executors and workers a bit differently.The Kubernetes executor will utilize the Kubernetes API to schedule tasks for execution; as such, you should utilize the Kubernetes events and metrics servers to gather logs and metrics for your task instances. Common metrics to collect on an individual task are CPU and memory usage. This is crucial for tuning requests or mutating individual task resource requests to ensure that they execute safely.The Celery worker has additional components and long-lived processes that you need to metricize. You should monitor an individual Celery worker’s memory and CPU utilization to ensure that it is not over- or under-provisioned, tuning allocated resources accordingly. You also need to monitor the message broker (usually Redis or RabbitMQ) to ensure that it is appropriately sized. Finally, it is critical to measure the queue length of your message broker and ensure that too much “back pressure” isn’t being created in the system. If you find that your tasks are sitting in a queued state for a long period of time and the queue length is consistently growing, it’s a sign that you should start an additional Celery worker to execute on scheduled tasks. You should also investigate using the native Celery monitoring tool Flower (https://flower.readthedocs.io/en/latest/) for additional, more nuanced methods of monitoring.Web serverThe Airflow web server is the UI for not just your Airflow deployment but also the RESTful interface. Especially if you happen to be controlling Airflow scheduling behavior with API calls, you should keep an eye on the following metrics:Response time: Measure the time taken for the API to respond to requests. This metric indicates the overall performance of the API and can help identify potential bottlenecks.Error rate: Monitor the rate of errors returned by the API, such as 4xx and 5xx HTTP status codes. High error rates may indicate issues with the API implementation or underlying systems.Request rate: Track the rate of incoming requests to the API over time. Sudden spikes or drops in request rates can impact performance and indicate changes in usage patterns.System resource utilization: Monitor resource utilization metrics such as CPU, memory, disk I/O, and network bandwidth on the servers hosting the API. High resource utilization can indicate potential performance bottlenecks or capacity limits.Throughput: Measure the number of successful requests processed by the API per unit of time. Throughput metrics provide insights into the API’s capacity to handle incoming traffic.Now that you have some basic metrics to collect from your core architectural components and can monitor the overall health of an application, we need to monitor the actual DAGs themselves to ensure that they function as intended.Monitoring your DAGsThere are multiple aspects to monitoring your DAGs, and while they’re all valuable, they may not all be necessary. Take care to ensure that your monitoring and alerting stack match your organizational needs with regard to operational parameters for resiliency and, if there is a failure, recovery times. No matter how much or how little you choose to implement, knowing that your DAGs work and if and how they fail is the first step in fixing problems that will arise.LoggingAirflow writes logs for tasks in a hierarchical structure that allows you to see each task’s logs in the Airflow UI. The community also provides a number of providers to utilize other services for backing log storage and retrieval. A complete list of supported providers is available at https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html.Airflow uses the standard Python logging framework to write logs. If you’re writing custom operators or executing Python functions with a PythonOperator, just make sure that you instantiate a Python logger instance, and then the associated methods will handle everything for you.AlertingAirflow provides mechanisms for alerting on operational aspects of your executing workloads that can be configured within your DAG:Email notifications: Email notifications can be sent if a task is put into a marked or retry state with the `email_on_failure` or `email_on_retry` state, respectively. These arguments can be provided to all tasks in the DAG with the `default_args` key work in the DAG, or individual tasks by setting the keyword argument individually.Callbacks: Callbacks are special actions that are executed if a specific state change occurs. Generally, these callbacks should be thoughtfully leveraged to send alerts that are critical operationally:on_success_callback: This callback will be executed at both the task and DAG levels when entering a successful state. Unless it is critical that you know whether something succeeds, we generally suggest not using this for alerting.on_failure_callback: This callback is invoked when a task enters a failed state. Generally, this callback should always be set and, in critical scenarios, alert on failures that require intervention and support.on_execute_callback: This is invoked right before a task executes and only exists at the task level. Use sparingly for alerting, as it can quickly become a noisy alert when overused.on_retry_callback: This is invoked when a task is placed in a retry state. This is another callback to be cautious about as an alert, as it can become noisy and cause false alarms.sla_miss_callback: This is invoked when a DAG misses its defined SLA. This callback is only executed at the end of a DAG’s execution cycle so tends to be a very reactive notification that something has gone wrong.SLA monitoringAs awesome of a tool as Airflow is, it is a well-known fact in the community that SLAs, while largely functional, have some unfortunate details with regard to implementation that can make them problematic at best, and they are generally regarded as a broken feature in Airflow. We suggest that if you require SLA monitoring on your workflows, you deploy a CRON job monitoring tool such as healthchecks (https://github.com/healthchecks/healthchecks) that allows you to create suppressive alerts for your services through its rest API to manage SLAs. By pairing this third- party service with either HTTP operators or simple requests from callbacks, you can ensure that your most critical workflows achieve dynamic and resilient SLA alerting.Performance profilingThe Airflow UI is a great tool for profiling the performance of individual DAGs:The Gannt chart view: This is a great visualization for understanding the amount of time spent on individual tasks and the relative order of execution. If you’re worried about bottlenecks in your workflow, start here.Task duration: This allows you to profile the run characteristics of tasks within your DAG over a historical period. This tool is great at helping you understand temporal patterns in execution time and finding outliers in execution. Especially if you find that a DAG slows down over time, this view can help you understand whether it is a systemic issue and which tasks might need additional development.Landing times: This shows the delta between task completion and the start of the DAG run. This is an un-intuitive but powerful metric, as increases in it, when paired with stable task durations in upstream tasks, can help identify whether a scheduler is under heavy load and may need tuning.Additional metrics that have proven to be useful (but may need to be calculated) include the following:Task startup time: This is an especially useful metric when operating with a Kubernetes executor. To calculate this, you will need to calculate the difference between `start_date` and `execution_date` on each task instance. This metric will especially help you identify bottlenecks outside of Airflow that may impact task run times.Task failure and retry counts: Monitoring the frequency of task failures and retries can help identify information about the stability and robustness of your environment. Especially if these types of failure can be linked back to patterns in time or execution, it can help debug interactions with other services.DAG parsing time: Monitoring the amount of time a DAG takes to parse is very important to understand scheduler load and bottlenecks. If an individual DAG takes a long time to load (either due to heavy imports or long blocking calls being executed during parsing), it can have a material impact on the timeliness of scheduling tasks.ConclusionIn this article, we covered some essential strategies to effectively monitor both the core Airflow system and individual DAGs post-deployment. We highlighted the importance of active and suppressive monitoring techniques and provided insights into the critical metrics to track for each component, including the scheduler, metadata database, triggerer, executors/workers, and web server. Additionally, we discussed logging, alerting mechanisms, SLA monitoring, and performance profiling techniques to ensure the reliability, scalability, and efficiency of Airflow workflows. By implementing these monitoring practices and leveraging the insights gained, operators can proactively manage and optimize their Airflow deployments for optimal performance and reliability.Author BioDylan Intorf is a solutions architect and data engineer with a BS from Arizona State University in Computer Science. He has 10+ years of experience in the software and data engineering space, delivering custom tailored solutions to Tech, Financial, and Insurance industries.Kendrick van Doorn is an engineering and business leader with a background in software development, with over 10 years of developing tech and data strategies at Fortune 100 companies. In his spare time, he enjoys taking classes at different universities and is currently an MBA candidate at Columbia University.Dylan Storey has a B.Sc. and M.Sc. from California State University, Fresno in Biology and a Ph.D. from University of Tennessee, Knoxville in Life Sciences where he leveraged computational methods to study a variety of biological systems. He has over 15 years of experience in building, growing, and leading teams; solving problems in developing and operating data products at a variety of scales and industries.

2
0
62994

article-image-essential-sql-for-data-engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization. Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.

2
0
44542

article-image-connecting-cloud-object-storage-with-databricks-unity-catalog

Pulkit Chadha

22 Oct 2024

10 min read

Connecting Cloud Object Storage with Databricks Unity Catalog

Pulkit Chadha

22 Oct 2024

10 min read

0
0
47916

article-image-mastering-semi-structured-data-in-snowflake

Serge Gershkovich

27 Jun 2024

7 min read

Mastering Semi-Structured Data in Snowflake

Serge Gershkovich

27 Jun 2024

7 min read

This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.The Benefits of Semi-Structured Data in Snowflake Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects. For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description. To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance. Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.Getting Hands-On with Semi-Structured Data Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities. Creating a Table for Semi-Structured Data Here is a sample SQL script to create a table with semi-structured data:CREATE TABLE pirates_data ( id NUMBER AUTOINCREMENT PRIMARY KEY, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, data VARIANT ); ``` In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.Loading Semi-Structured Data To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:COPY INTO pirates_data FROM @my_stage/pirates_data.json FILE_FORMAT = (TYPE = 'JSON'); ```Querying Semi-Structured Data Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation: SELECT data:id::NUMBER AS pirate_id, data:crew AS crew, data:weapons AS weapons FROM pirates_data; ```This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.Converting Semi-Structured Data into Relational Data Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function. Here's an example of how to flatten a JSON array into a relational table:SELECT value:id::NUMBER AS pirate_id, value:name::STRING AS name, value:rank::STRING AS rank FROM pirates_data, LATERAL FLATTEN(input => data:crew); ``` This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.Schema-on-Read vs. Schema-on-Write One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.Benefits of Schema-on-ReadFlexibility: You can ingest data without worrying about its structure, which is particularly useful for unstructured or semi-structured data sources.Adaptability: Schema changes do not require re-ingestion of data, as the schema is applied at read time.Speed: Data can be loaded more quickly, as there is no need to enforce a schema during the ingestion process.Example: Using Schema-on-Read with VARIANT Data Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake: SELECT data:id::NUMBER AS pirate_id, data:ship.name::STRING AS ship_name, data:ship.type::STRING AS ship_type FROM pirates_data; ```In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.Handling Nested and Repeated Data Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.Example: Flattening Nested Data Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query: SELECT data:id::NUMBER AS pirate_id, value:date::DATE AS voyage_date, value:destination::STRING AS voyage_destination FROM pirates_data, LATERAL FLATTEN(input => data:previous_voyages); ```This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.Performance Considerations When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.Tips for Optimizing PerformanceUse Caching: Take advantage of Snowflake’s caching mechanisms to reduce query times for frequently accessed data.Optimize Queries: Write efficient SQL queries, avoiding unnecessary complexity and ensuring that only the required data is processed.Monitor Usage: Regularly monitor your Snowflake usage and performance metrics to identify and address potential bottlenecks.ConclusionHandling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.

0
0
19613

How-To Tutorials - Data Engineering

Mastering Slowly Changing Dimensions in Snowflake Data Warehouses

ACID Transactions in Lakehouse Architectures: MVCC, OCC, and Conflict Resolution with Iceberg, Hudi, and Delta Lake

Airflow Ops Best Practices: Observation and Monitoring

Essential SQL for Data Engineers

Connecting Cloud Object Storage with Databricks Unity Catalog

Mastering Semi-Structured Data in Snowflake

Trending Topics

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access