How to Prepare Data Using AWS Glue

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.
Subscribe here to stay ahead in data engineering

Introduction

Preparing data for analytics can become challenging as organizations deal with growing data volumes, varied data sources, and increasingly complex transformation requirements. AWS Glue helps simplify this process by offering both visual and code-based approaches to data preparation. In this article, we explore how AWS Glue Studio enables users to build ETL workflows visually, apply transformations, configure data quality checks, and prepare datasets for downstream analytics without needing to manage infrastructure.

Data preparation using AWS Glue

It is normal for data to grow continuously over time in terms of volume and complexity, considering the huge number of applications and devices generating data in a typical organization. With this ever-growing data, a tremendous amount of resources is required to ingest and prepare this data – both in terms of manpower and compute resources.

AWS Glue makes it easy for individuals with varying levels of skill to collaborate on data preparation tasks. For instance, novice users with no programming skills can take advantage of AWS

Glue Studio (_{https://docs.aws.amazon.com/glue/latest/dg/author-job-glue.html}), a visual interface that allows novice data professionals to interact with and prepare the data using a variety of pre-built transformations and filtering mechanisms even without writing any code. AWS Glue Studio also provides advanced users to author custom transformations to achieve desired outcomes.

AWS Glue Studio is a great tool for preparing data using a graphical user interface (GUI), there are some use cases where the built-in transformations may not be flexible enough or the user may prefer a programmatic approach to prepare data over using the GUI-based approach. In such cases, AWS Glue enables users to prepare data using AWS Glue ETL. Users can leverage AWS Glue Studio to author, execute, and monitor ETL workloads. Although Glue Studio offers a GUI, users may still require programmatic knowledge of AWS Glue’s transformation extensions and APIs to implement data preparation workloads, especially when implementing custom transformations using SQL or source code.

Now that we know about the different data preparation options that are available in AWS Glue, let’s dive deep into each of them while looking at practical examples to understand them.

Visual data preparation using AWS Glue Studio

AWS Glue makes it possible to prepare data using a visual interface through AWS Glue Studio. Previously, the preferred approach was to make use of AWS Glue DataBrew for visual data preparation as highlighted in the previous edition of this book. However, the features available in AWS Glue DataBrew have been implemented in AWS Glue Studio and as such, all users regardless of their skill level can make use of AWS Glue Studio to build their ETL workflows through a unified interface. AWS Glue Studio allows us to author recipes similar to AWS Glue DataBrew and even allows us to import any recipes that were built using AWS Glue DataBrew.

Getting started with AWS Glue Studio is quite simple. To author a new job using visual ETL, you can use the Visual ETL option in the AWS Glue Studio UI to open the visual job editor. In this interface, you can start designing the ETL workflow based on your requirements by adding required data sources, transformations, and target nodes with a simple drag-and-drop.

As mentioned in the previous chapters, we can also ingest data from a wide range of external Software-as-a-Service (SaaS) providers via Amazon AppFlow. There are several external SaaS providers using native connectors, including Adobe Analytics, Asana, Datadog, Google Analytics, Dynatrace, Marketo, Salesforce, ServiceNow, Slack, and Zendesk, to name a few. This data can be further integrated with datasets from other data stores or SaaS applications. This helps the users take a holistic approach to analyzing and gathering insights from their datasets, which have been spread across different data stores or SaaS platforms. The following screenshot shows the visual editor interface available in AWS Glue Studio, which allows users to drag and drop different components of an ETL job:

how-to-prepare-data-using-aws-glue-img-0

Figure 4.1: AWS Glue Studio Visual Job Editor Interface

In the visual editor interface (Figure 4.1), we can start by dragging and dropping data source(s) we wish to read from and start adding transformations and configure each node as per our requirements. This step is optional if we are moving the data from one location to another without applying transformations. Once we are happy with the transformations, we can add data target(s) to our job to write the output. We have to provide a name to the job, select an IAM role for execution under the job details tab, and the job is ready to be saved now. If our job requires an AWS Glue connection, for instance a relational database, on-premises data source, or SaaS data source, we can specify the connections to be included under the job details tab.

Let’s build an ETL job using the visual ETL editor which reads from an Amazon S3 location, applies simple transformations, and writes the data to an Amazon Redshift data warehouse. Before we begin authoring the job, let’s set up an AWS Glue Connection to connect to our data warehouse cluster. To create an AWS Glue Connection, use the Connections option under Data Catalog in the sidebar and click Create connection. In the data source selection page, search for Amazon Redshift and click Next. If your Amazon Redshift cluster is in a different AWS account, use a JDBC data source instead. Configure the connection by specifying the cluster details and IAM role. Review the connection details and save the connection. The following screenshot (Figure 4.2) shows a sample AWS Glue connection configuration used to connect to an Amazon Redshift Cluster:

how-to-prepare-data-using-aws-glue-img-1

Figure 4.2: AWS Glue Connection Configuration page

Once the connection is saved, let’s head back to the visual editor and add an Amazon Redshift Source node from the Node picker. Click on the added node and select the connection that you created in the previous step. You can choose to select a single table or enter a custom query. In the Data preview section, you can select an IAM role and start a session to preview your data. The following screenshot (Figure 4.3) shows how data can be previewed at each node in the graph to get an idea of the resultant dataset:

how-to-prepare-data-using-aws-glue-img-2

Figure 4.3: Data Preview of a node in visual editor

Use the node picker and add necessary transformations. To transform data using AWS Glue DataBrew style editor, you can click the plus button and add Data Preparation Recipe transform from the node picker and click on the added node to display the node properties. Click Author Recipe to open the familiar grid interface to begin creating the data preparation recipe.

how-to-prepare-data-using-aws-glue-img-3

Figure 4.4: Author Recipe button available in Data Preparation Recipe transform

In our example, we are adding a filter on quantity of items sold to fetch sales records with a quantity of 3 or higher. The following screenshot (Figure 4.5) shows the filter transformation in the data preparation recipe authoring window:

how-to-prepare-data-using-aws-glue-img-4

Figure 4.5: Applying transformations in Data Preparation Recipe interface

Once all the necessary transformations are applied, you can use the Done authoring recipe button to exit the grid interface and return to AWS Glue Studio visual editor. After all necessary transformations have been applied, we can add target node(s) to the graph to ensure the output is saved to the target data store. In our example, we will configure the target node to write output in Parquet format by selecting Parquet in the Format field of the target node properties, save it into Amazon S3 location, and create a table in AWS Glue Data catalog. The following screenshot (Figure 4.6) depicts the same:

how-to-prepare-data-using-aws-glue-img-5

Figure 4.6: Amazon S3 target data store configuration

Now that we have defined the data source, transformations, and a data target we can save the ETL job and it is ready to be executed.

Note: It is important to note that Data Preparation Recipe transformation can be used only with certain versions of AWS Glue. As of writing this book, we can use AWS Glue version 4.0 or higher to execute ETL jobs with such transformations.

While the Data Preparation Recipe transformation offers a familiar grid interface, if you are not particularly interested in that experience, you can choose to add transformations directly from the node picker. For example, to achieve the same outcome as the job we created, we can directly add the Filter transformation after we add the Amazon Redshift data source in the graph instead.

The following screenshot (Figure 4.7) shows how we can configure the Filter transformation to achieve the same outcome as the data preparation recipe:

how-to-prepare-data-using-aws-glue-img-6

Figure 4.7: Visual ETL without using Data preparation recipe transformation

While the ETL job we designed will work as expected, it is always a good idea to protect our workload from noisy data. For example, if the sales quantity column doesn’t exist in our source data or if the column is not an integer as we expected, or it contains null values, we should be able to detect such anomalies and take necessary actions. AWS Glue enables users to use Evaluate Data Quality to create data quality rules to evaluate the output. During this process, it emits events to AWS EventBridge, which can be captured and acted upon. The configuration is quite flexible. You can enrich the existing dataset with data quality information or choose to write the quality information to a separate destination. You can also emit AWS CloudWatch metrics for data quality checks to track how a job is trending with regard to data quality. Additionally, you can decide what to do with the job if the evaluation fails. For example, you can choose to fail the job before or after loading the data into the target data store, or continue with the job without failing it.

To implement the data quality rule described in the example above, you can add the Evaluate

Data Quality transform from the node picker (or use the Edit Data Quality Configuration button in the target node).

how-to-prepare-data-using-aws-glue-img-7

Figure 4.8: Evaluate Data Quality transform available in the node picker

Data Quality rules can be defined using Data Quality Definition Language (DQDL) syntax. Detailed documentation on DQDL can be found in https://docs.aws.amazon.com/glue/latest/ dg/dqdl.html. The rule we described earlier will look like this:

Rules = [
    # check if col exists
    ColumnExists “qtysold”,     # check for null values     IsComplete “qtysold”,
    # Make sure the values are integers
    ColumnDataType “qtysold” = “Integer” ]

The following screenshot (Figure 4.9) will show how we can configure the Evaluate Data Quality transform to implement the above data quality rules:

how-to-prepare-data-using-aws-glue-img-8

Figure 4.9: Evaluate Data Quality Transformation

Now that we have seen how we can use the visual editor to build an ETL job from scratch, in the next section we will be exploring how we can use AWS Glue to build ETL jobs using source code, and we will also see some of the built-in transformations available in AWS Glue with examples.

Source code-based approach to data preparation using AWS Glue

While AWS Glue Studio primarily offers a visual interface-based approach to tackle data preparation tasks in a data integration workflow, it can also be used to author complex ETL workflows using advanced (and even custom) transformations. AWS Glue ETL in general requires us to have some level of Glue/Spark programming knowledge to implement ETL jobs, which aids in data preparation as we get a much higher level of flexibility compared to using just the grid interface in data preparation Recipe. With the data preparation recipe approach, we can only use pre-built transformations to prepare data. Since there are no such restrictions in AWS Glue ETL, we can design and develop custom transformations based on our requirements using existing Glue/ Spark ETL APIs and extensions.

Conclusion

AWS Glue provides a flexible and scalable way to prepare data, whether users prefer a visual interface through AWS Glue Studio or a more programmatic approach using AWS Glue ETL. With support for drag-and-drop job creation, data preparation recipes, built-in transformations, data quality rules, and integration with services such as Amazon S3 and Amazon Redshift, AWS Glue helps teams streamline data preparation across different skill levels and use cases.

This article is an excerpt from the book Serverless ETL and Analytics with AWS Glue, Second Edition, which offers a deeper look at building, managing, and optimizing data integration workflows using AWS Glue.

Author Bio

Subramanya Vajiraya is a Senior Cloud Engineer at AWS Sydney specialized in AWS Glue. He obtained his Bachelor of Engineering degree focused on Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India in 2015 and obtained his Master of Information Technology degree focused on Internetworking from University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implement scalable data integration and analytics pipelines on AWS.

Noritaka Sekiyama is an experienced big data engineer working at Data and AI company. He is responsible for building scalable data platform with unified governance on Cloud. He is passionate about software engineering, cloud computing, big data technologies, distributed systems, data platform, system monitoring and automation.

Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services (AWS). He specializes in data infrastructure with hands-on customer engagement experience including migrations, performance tuning, and production troubleshooting. His areas of expertise include Apache Spark, Apache Iceberg, and AWS Analytics services such as AWS Glue, Amazon EMR and Amazon Athena. He actively contributes to the Apache Iceberg open-source project and speaks at community events and conferences to help customers adopt Iceberg in practice.

Ishan Gaur is a Principal Big Data Cloud Engineer at Amazon Web Services (AWS) with over 16 years of experience architecting and building distributed systems and scalable data integration pipelines. As a subject matter expert in AWS Glue and Apache Spark, he specializes in helping enterprise customers design and implement large-scale data processing solutions across the AWS ecosystem, including Amazon EMR, AWS Glue, and Amazon Athena.

Throughout his career, Ishan has worked extensively with distributed computing frameworks and ETL technologies including Apache Spark, Scala, Ab Initio, and DataStage. His expertise spans the full lifecycle of data engineering—from architecture design and pipeline development to performance optimization and troubleshooting at scale. At AWS, he partners with customers to modernize their data platforms, optimize workloads, and leverage cloud-native services to achieve operational excellence and cost efficiency in their data processing environments.

Albert Quiroga is a Senior Solutions Architect at Amazon, where he creates solutions and architectural designs for one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR, Athena, Glue and SageMaker. His 11 years of experience in the industry have empowered him to work with several Fortune 500 companies to overcome large-scale data and analytics challenges, and he has helped launch and develop features for several AWS services.

Akira Ajisaka is a software engineer and has more than 10 years of engineering experience in big data. He likes troubleshooting and contributing to OSS.