Home Data Mastering Tableau 2023 - Fourth Edition

Mastering Tableau 2023 - Fourth Edition

By Marleen Meier
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Getting Your Data Ready
About this book
This edition of the bestselling Tableau guide will teach you how to leverage Tableau's newest features and offerings in various paradigms of the BI domain. Updated with fresh topics, including the newest features in Tableau Server, Prep, and Desktop, as well as up-to-date examples, this book will take you from mastering essential Tableau concepts to advance functionalities. A chapter on data governance has also been added. Throughout this book, you'll learn how to use Tableau Hyper files and Prep Builder to easily perform data preparation and handling, as well as complex joins, spatial joins, unions, and data blending tasks using practical examples. You'll also get to grips with executing data densification and explore other expert-level examples to help you with calculations, mapping, and visual design using Tableau extensions. Later chapters will teach you all about improving dashboard performance, connecting to Tableau Server, and understanding data visualization with examples. Finally, you'll cover advanced use cases, such as self-service analysis, time series analysis, geo-spatial analysis, and how to connect Tableau to Python and R to implement programming functionalities within Tableau. By the end of this book, you'll have mastered Tableau 2023 and be able to tackle common and advanced challenges in the BI domain.
Publication date:
August 2023
Publisher
Packt
Pages
684
ISBN
9781803233765

 

Getting Your Data Ready

Have you ever asked yourself whether your data is clean enough to be analyzed? It’s likely that everyone who works with data has, which is why this chapter is dedicated to getting your data ready for analysis, otherwise known as data cleaning.

The first part of this chapter is theory-oriented and does not include exercises. A careful reading of this information is encouraged since it provides a foundation for greater insight. The latter portion of the chapter provides various exercises specifically focused on data preparation.

Now let’s dive into this fascinating topic with the goal of enriching our understanding and becoming ever-better data stewards.

In this chapter, we will discuss the following topics:

  • Understanding Hyper
  • Focusing on data preparation
  • Surveying data
  • Cleaning messy data

Since Tableau Desktop 10.5 has been on the market for some time, you may already have heard of Hyper. Regardless of whether you have or not, continue reading for a primer on this feature!

 

Understanding Hyper

In this section, we will explore Tableau’s data-handling engine, and how it enables structured yet organic data mining processes in enterprises. Since the release of Tableau 10.5, we can make use of Hyper, a high-performing database, allowing us to query source data faster than ever before. Hyper is usually not well understood, even by advanced developers, because it’s not an overt part of day-to-day activities; however, if you want to truly grasp how to prepare data for Tableau, this understanding is crucial.

Hyper originally started as a research project at the University of Munich in 2008. In 2016, it was acquired by Tableau and appointed as the dedicated data engine group of Tableau, maintaining its base and employees in Munich. Initially in Tableau 10.5, Hyper replaced the earlier data-handling engine only for extracts. It is still true that live connections are not touched by Hyper, but Tableau Prep Builder now runs on the Hyper engine too, with more use cases to follow. As stated on tableau.com, “Hyper can slice and dice massive volumes of data in seconds, you will see up to 5X faster query speed and up to 3X faster extract creation speed.” And if you still can’t get enough, there is always the option to use Hyper through API calls in your preferred programming language: https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_reference.html.

But what makes Hyper so fast? Let’s have a look under the hood!

The Tableau data-handling engine

The vision shared by the founders of Hyper was to create a high-performing, next-generation database—one system, one state, no trade-offs, and no delays. And it worked—today, Hyper can serve general database purposes, data ingestion, and analytics at the same time.

Memory prices have decreased exponentially. The same goes for CPUs; transistor counts increased according to Moore’s law, while other features stagnated. Memory is cheap but processing still needs to be improved.

Moore’s Law is the observation made by Intel co-founder Gordon Moore that the number of transistors on a chip doubles every two years while the costs are halved. Information on Moore’s Law can be found on Investopedia at https://www.investopedia.com/terms/m/mooreslaw.asp.

While experimenting with Hyper, the founders measured that handwritten C code is faster than any existing database engine, so they came up with the idea to transform Tableau queries into C code and optimize it simultaneously, all behind the scenes, so the Tableau user won’t notice it. This translation and optimization come at a cost; traditional database engines can start executing code immediately. Tableau needs to first translate queries into code, optimize that code, then compile it into machine code, after which it can be executed. The big question is, is it still faster? As proven by many tests on Tableau Public and other workbooks, the answer is yes!

Furthermore, if there is a query estimated to be faster if executed without the compilation to machine code, Tableau has its own virtual machine (VM) on which the query will be executed right away. And next to this, Hyper can utilize 99% of available CPU computing power, whereas other parallel processes can only utilize 29% of available CPU compute. This is due to the unique and innovative technique of morsel-driven parallelization.

For those of you that want to know more about morsel-driven parallelization, a paper, which later on served as a baseline for the Hyper engine, can be found at https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf.

If you want to know more about the Hyper engine, I highly recommend the following video at https://youtu.be/h2av4CX0k6s.

Hyper parallelizes three steps of traditional data warehousing operations:

  • Transactions and Continuous Data Ingestion (Online Transaction Processing, or OLTP)
  • Analytics (Online Analytical Processing, or OLAP)
  • Beyond Relational (Online Beyond Relational Processing, or OBRP)

Executing those steps simultaneously makes Hyper more efficient and more performant, as opposed to traditional systems where those three steps are separated and executed one after the other.

To sum up, Hyper is a highly specialized database engine that allows us as users to get the best out of our queries. If you recall, in Chapter 1, Reviewing the Basics, we already saw that every change on a sheet or dashboard, including drag and drop pills, filters, and calculated fields, among others, is translated into a query. Those queries are pretty much SQL lookalikes; however, in Tableau we call the querying engine VizQL.

VizQL, another hidden gem on your Tableau Desktop, is responsible for visualizing data in a chart format and is fully executed in memory. The advantage is that no additional space on the database side is required here. VizQL is generated when a user places a field on a shelf. VizQL is then translated into SQL, MDX, or Tableau Query Language (TQL) and passed to the backend data source with a driver.

Hyper takeaways

This overview of the Tableau data-handling engine demonstrates a flexible approach to interfacing with data. Knowledge of the data-handling engine can reduce data preparation and data modeling efforts, thus helping us streamline the overall data mining life cycle. Don’t worry too much about data types and data that can be calculated based on the fields you have in your database. Tableau can do all the work for you in this respect. In the next section, we will discuss what you should consider from a data source perspective.

 

Focusing on data preparation

Tableau can be used effectively with various data preparation phases. Unfortunately, a single chapter is not sufficient to thoroughly explore how Tableau can be used in each phase. Indeed, such a thorough exploration may be worthy of an entire book! Our focus, therefore, will be directed to ward data preparation, since that phase has historically accounted for up to 60% of the data mining effort. Our goal will be to learn how Tableau can be used to streamline that effort.

Surveying data

Tableau can be a very effective tool for simply surveying data. Sometimes in the survey process, you may discover ways to clean the data or populate incomplete data based on existing fields. Sometimes, regretfully, there are simply not enough pieces of the puzzle to put together an entire dataset. In such cases, Tableau can be useful to communicate exactly what the gaps are, and this, in turn, may incentivize the organization to more fully populate the underlying data.

In this exercise, we will explore how to use Tableau to quickly discover the percentage of null values for each field in a dataset. Next, we’ll explore how data might be extrapolated from existing fields to fill in the gaps.

Establishing null values

The following are the steps to survey the data:

  1. If you haven’t done so just yet, navigate to https://public.tableau.com/profile/marleen.meier to locate and download the workbook associated with this chapter.
  2. Navigate to the worksheet entitled Surveying & Exploring Data and select Happiness Report data source.
  3. Drag Region and Country to the Rows shelf. Observe that in some cases the Region field has Null values for some countries:

Figure 2.1: Null regions

  1. Right-click and Edit the parameter entitled Select Field. Note that the Data Type is set to Integer and we can observe a list that contains an entry for each field name in the dataset:
Graphical user interface

Description automatically generated

Figure 2.2: Editing a parameter

  1. In the Data pane, right-click on the parameter we just created and select Show Parameter Control.
  2. Create a calculated field entitled % Populated and write the following calculation:
    SUM([Number of Records]) / TOTAL(SUM([Number of Records]))
    
  3. In the Data pane, right-click on % Populated and select Default Properties | Number Format…:
Graphical user interface, application

Description automatically generated

Figure 2.3: Adjusting default properties

  1. In the resulting dialog box, choose Percentage.
  2. Create a calculated field entitled Null & Populated and add the following code. Note that the complete case statement is fairly lengthy but also repetitive.

    In cases requiring a lengthy but repetitive calculation, consider using Excel to more quickly and accurately write the code. By using Excel’s CONCATENATE function, you may be able to save time and avoid typos.

    In the following code block, the code lines represent only a percentage of the total but should be sufficient to enable you to produce the whole block:

    CASE [Select Field]
    WHEN 1 THEN IF ISNULL ([Country]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 2 THEN IF ISNULL ([Region]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 3 THEN IF ISNULL ([Economy (GDP per Capita)]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 4 THEN IF ISNULL ([Family]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 5 THEN IF ISNULL ([Freedom]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 6 THEN IF ISNULL ([Happiness Rank]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 7 THEN IF ISNULL ([Happiness Score]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 8 THEN IF ISNULL ([Health (Life Expectancy)]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 9 THEN IF ISNULL ([Standard Error]) THEN 'Null Values' ELSE
    'Populated Values' END
    END
    
  1. Remove Region and Country from the Rows shelf.
  2. Place Null & Populated on the Rows and Color shelves and % Populated on the Columns and Label shelves:

Figure 2.4: Populated values

  1. Change the colors to red for Null Values and green for Populated Values if desired. You can do so by clicking on Color in the Marks card and Edit Colors.
  2. Click on the arrow in the upper-right corner of the Select Field parameter on your sheet and select Single Value List.
  3. Select various choices in the Select Field parameter and note that some fields have a high percentage of null values. For example, in the following diagram, 32.98% of records do not have a value for Region:
A picture containing graphical user interface

Description automatically generated

Figure 2.5: Comparing null and populated values

Building on this exercise, let’s explore how we might clean and extrapolate data from existing data using the same dataset.

Extrapolating data

This exercise will expand on the previous exercise by cleaning existing data and populating some of the missing data from known information. We will assume that we know which country belongs to which region. We’ll use that knowledge to fix errors in the Region field and also to fill in the gaps using Tableau:

  1. Starting from where the previous exercise ended, create a calculated field entitled Region Extrapolated with the following code block:
    CASE [Country]
    WHEN 'Afghanistan' THEN 'Southern Asia'
    WHEN 'Albania' THEN 'Central and Eastern Europe'
    WHEN 'Algeria' THEN 'Middle East and Northern Africa'
    WHEN 'Angola' THEN 'Sub-Saharan Africa'
    WHEN 'Argentina' THEN 'Latin America and Caribbean'
    WHEN 'Armenia' THEN 'Central and Eastern Europe'
    WHEN 'Australia' THEN 'Australia and New Zealand'
    WHEN 'Austria' THEN 'Western Europe'
    //complete the case statement with the remaining fields in the data set
    END
    

    As an alternative to a CASE statement, you could use an IF statement like:

    If [Country] = 'Afghanistan' then 'Southern Asia' 
    ELSEIF [Country] = 'Albania' thenEND
    

    To speed up the tedious creation of a long calculated field, you could download the data to an Excel file and create the calculated field by concatenating the separate parts, as shown here:

    Graphical user interface, text, application

Description automatically generated

    Figure 2.6: Compiling a calculation in Excel

    You can then copy them from Excel into Tableau. However, for this exercise, I have created a backup field called Backup, which can be found in the Tableau workbook associated with this chapter, which contains the full calculation needed for the Region Extrapolated field. Use this at your convenience. The Solutions dashboard also contains all of the countries. You can therefore copy the Region Extrapolated field from that file too.

  1. Add a Region Extrapolated option to the Select Field parameter:

Figure 2.7: Adding Region Extrapolated to the parameter

  1. Add the following code to the Null & Populated calculated field:
    WHEN 10 THEN IF ISNULL ([Region Extrapolated]) THEN 'Null Values' ELSE 'Populated Values' END
    
  2. Note that the Region Extrapolated field is now fully populated:

Figure 2.8: Fully populated Region Extrapolated field

Now let’s consider some of the specifics from the previous exercises.

Let’s look at the following code block.

Note that the complete CASE statement is several lines long. The following is a representative portion:

CASE [% Populated]
WHEN 1 THEN IF ISNULL ([Country]) THEN 'Null Values' ELSE
'Populated Values' END
...

This case statement is a row-level calculation that considers each field in the dataset and determines which rows are populated and which are not. For example, in the representative line of the preceding code, every row of the Country field is evaluated for nulls. The reason for this is that a calculated field will add a new column to the existing data—only in Tableau, not in the data source itself—and every row will get a value. These values can be N/A or null values.

The following code is the equivalent of the quick table calculation Percent of Total:

SUM([Number of Records]) / TOTAL(SUM([Number of Records]))

In conjunction with the Null & Populated calculated field, it allows us to see what percentage of our fields are actually populated with values.

It’s a good idea to get into the habit of writing table calculations from scratch, even if an equivalent quick table calculation is available. This will help you more clearly understand the table calculations.

The following CASE statement is an example of how you might use one or more fields to extrapolate what another field should be:

CASE [Country]
WHEN 'Afghanistan' THEN 'Southern Asia'
... END

For example, the Region field in the dataset had a large percentage of null values, and even the existing data had errors. Based on our knowledge of the business (that is, which country belongs to which region), we were able to use the Country field to achieve 100% population of the dataset with accurate information.

Nulls are a part of almost every extensive real dataset. Understanding how many nulls are present in each field can be vital to ensuring that you provide accurate business intelligence. It may be acceptable to tolerate some null values when the final results will not be substantially impacted, but too many nulls may invalidate results. However, as demonstrated here, in some cases, one or more fields can be used to extrapolate the values that should be entered into an underpopulated or erroneously populated field.

As demonstrated in this section, Tableau gives you the ability to effectively communicate to your data team which values are missing, which are erroneous, and how possible workarounds can be invaluable to the overall data mining effort. Next, we will look into data that is a bit messier and not in a nice column format. Don’t worry, Tableau has us covered.

Cleaning messy data

The United States government provides helpful documentation for various bureaucratic processes. For example, the Department of Health and Human Services (HSS) provides lists of ICD-9 codes, otherwise known as International Statistical Classification of Diseases and Related Health Problems codes. Unfortunately, these codes are not always in easily accessible formats.

As an example, let’s consider an actual HHS document known as R756OTN, which can be found at https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R756OTN.pdf.

Cleaning the data

Navigate to the Cleaning the Data worksheet in the workbook accompanying this chapter and execute the following steps:

  1. Within the Data pane, select the R756OTN Raw data source:
Graphical user interface, text, application

Description automatically generated

Figure 2.9: Selecting the raw file

  1. Drag Diagnosis to the Rows shelf and choose Add all members. Click on the AZ sign to sort the Diagnosis column. Note the junk data that occurs in some rows:

Figure 2.10: Adding Diagnosis to Rows

  1. Create a calculated field named DX with the following code:
    SPLIT([Diagnosis], " ", 1)
    
  2. Create a calculated field named Null Hunting with the following code:
    INT(MID([DX],2,1))
    
  3. In the Data pane, drag Null Hunting from Measures to Dimensions.
  4. Drag Diagnosis, DX, and Null Hunting to the Rows shelf. Observe that Null is returned when the second character in the Diagnosis field is not numeric:

Figure 2.11: Ordering fields on Rows

  1. Create a calculated field named Exclude from ICD codes containing the following code:
    ISNULL([Null Hunting])
    
  2. Clear the sheet of all fields, as demonstrated in Chapter 1, Reviewing the Basics, and set the Marks card to Shape.
  3. Place Exclude from ICD Codes on the Color, and Shape shelves, and then place DX on the Rows shelf. Observe the rows labeled as True:

Figure 2.12: Excluding junk data

  1. In order to exclude the junk data (that is, those rows where Exclude from ICD Codes equates to True), place Exclude from ICD Codes on the Filter shelf and deselect True.
  2. Create a calculated field named Diagnosis Text containing the following code:
    REPLACE([Diagnosis],[DX] + "","")
    
  3. Place Diagnosis Text on the Rows shelf after DX. Also, remove Exclude from ICD Codes from the Rows shelf and the Marks card, and set the mark type to Automatic:

Figure 2.13: Observing the cleaned data

Now that we’ve completed the exercise, let’s take a moment to consider the code we have used:

  • The SPLIT function was introduced in Tableau 9.0:
    SPLIT([Diagnosis], " ", 1)
    
  • As described in Tableau’s help documentation about the function, the function does the following:

Returns a substring from a string, as determined by the delimiter extracting the characters from the beginning or end of the string.

  • This function can also be called directly in the Data Source tab when clicking on a column header and selecting Split. To extract characters from the end of the string, the token number (that is, the number at the end of the function) must be negative.
  • Consider the following code, which we used to create the Null Hunting field:
    INT(MID([DX],2,1))
    
  • The use of MID is quite straightforward and much the same as the corresponding function in Excel. The use of INT in this case, however, may be confusing. Casting an alpha character with an INT function will result in Tableau returning Null. This satisfactorily fulfills our purpose, since we simply need to discover those rows not starting with an integer by locating the nulls.
  1. ISNULL is a Boolean function that simply returns TRUE in the case of Null:
    ISNULL([Null Hunting])
    
  2. The REPLACE function was used while creating the Diagnosis Text field:
    REPLACE([Diagnosis],[DX] + "","")
    
  3. This calculated field uses the ICD-9 codes isolated in DX to remove those same codes from the Diagnosis field and thus provides a fairly clean description. Note the phrase fairly clean. The rows that were removed were initially associated with longer descriptions that thus included a carriage return. The resulting additional rows are what we removed in this exercise. Therefore, the longer descriptions are truncated in this solution using the replace calculation.

The final output for this exercise could be to export the data from Tableau as an additional source of data. This data could then be used by Tableau and other tools for future reporting needs. For example, the DX field could be useful in data blending.

Does Tableau offer a better approach that might solve the issue of truncated data associated with the preceding solution? Yes! Let’s turn our attention to the next exercise, where we will consider regular expression functions.

Extracting data

Although, as shown in the previous exercise, Cleaning the data, the SPLIT function can be useful for cleaning clean data, regular expression functions are far more powerful, representing a broadening of the scope from Tableau’s traditional focus on visualization and analytics to also include data cleaning capabilities.

Let’s look at an example that requires us to deal with some pretty messy data in Tableau. Our objective will be to extract phone numbers.

The following are the steps:

  1. If you have not already done so, please download the workbook from https://public.tableau.com/profile/marleen.meier and open it in Tableau.
  2. Select the Extracting the Data tab.
  3. In the Data pane, select the String of Data data source and drag the String of Data field to the Rows shelf. Observe the challenges associated with extracting the phone numbers:

Figure 2.14: Extracting data from a messy data format

  1. Access the underlying data by clicking the View data button and copying several rows:

Figure 2.15: Accessing underlying data

  1. Navigate to http://regexpal.com/ and paste the data into the pane labeled Test String—that is, the second pane:
Graphical user interface, text, application, email

Description automatically generated

Figure 2.16: Regexpal

  1. In the first pane (the one labeled Regular Expression), type the following:
    \([0-9]{3}\)-[0-9]{3}-[0-9]{4}
    
  2. Return to Tableau and create a calculated field called Phone Number with the following code block. Note the regular expression nested in the calculated field:
    REGEXP_EXTRACT([String of Data (String of Data)],'(\([0-9]{3}\)-[0-9]{3}-[0-9]{4})')
    
  3. Place Phone Number on the Rows shelf, and observe the result:
Text

Description automatically generated with medium confidence

Figure 2.17: Extracting data final view

Now let’s consider some of the specifics from the preceding exercise in more detail:

  • Consider the following code block:
    REGEXP_EXTRACT([String of Data],'()')
    
  • The expression pattern is purposely excluded here as it will be covered in detail later. The ‘()' code acts as a placeholder for the expression pattern. The REGEXP_EXTRACT function used in this example is described in Tableau’s help documentation as follows:

Returns a substring of the given string that matches the capturing group within the regular expression pattern.

  • Note that as of the time of writing, the Tableau documentation does not communicate how to ensure that the pattern input section of the function is properly delimited. For this example, be sure to include ‘()' around the pattern input section to avoid a null output.
  • Nesting within a calculated field that is itself nested within a VizQL query can affect performance (if there are too many levels of nesting/aggregation).
  • There are numerous regular expression websites that allow you to enter your own code and help you out, so to speak, by providing immediate feedback based on sample data that you provide. http://regexpal.com/ is only one of those sites, so search as desired to find one that meets your needs!
  • Now, consider the expression:
    \([0-9]{3}\)-[0-9]{3}-[0-9]{4}
    

In this context, the \ indicates that the next character should not be treated as special but as literal. For our example, we are literally looking for an open parenthesis. [0-9] simply declares that we are looking for one or more digits. Alternatively, consider \d to achieve the same results. The {3} designates that we are looking for three consecutive digits.

As with the opening parenthesis at the beginning of the pattern, the \ character designates the closing parentheses as a literal. The - is a literal that specifically looks for a hyphen. The rest of the expression pattern should be decipherable based on the preceding information.

After reviewing this exercise, you may be curious about how to return just the email address. According to http://www.regular-expressions.info/email.html, the regular expression for email addresses adhering to the RFC 5322 standard is as follows:

(?:[a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_'{|}~-
]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-
\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-
]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-
z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-
5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-
\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])

Emails do not always adhere to RFC 5322 standards, so additional work may be required to truly clean email address data.

Although I won’t attempt a detailed explanation of this code, you can read all about it at http://www.regular-expressions.info/email.html, which is a great resource for learning more about regular expressions. Also, YouTube has several helpful regular expression tutorials.

The final output for this exercise should probably be used to enhance existing source data. Data dumps such as this example do not belong in data warehouses; however, even important and necessary data can be hidden in such dumps, and Tableau can be effectively used to extract it.

 

Summary

We began this chapter with a discussion of the Tableau data-handling engine. This illustrated the flexibility Tableau provides in working with data. The data-handling engine is important to understand in order to ensure that your data mining efforts are intelligently focused. Otherwise, your effort may be wasted on activities not relevant to Tableau.

Next, we focused on data preparation. We considered using Tableau to survey and also clean data. The data cleaning capabilities represented by the regular expression functions are particularly intriguing and are worth further investigation.

Having completed our first data-centric discussion, we’ll continue with Chapter 3, Using Tableau Prep Builder, looking at one of the newer features Tableau has brought to the market. Tableau Prep Builder is a dedicated data pre-processing interface that is able to greatly reduce the amount of time you need for pre-processing. We’ll take a look at cleaning, merging, filtering, joins, and the other functionality Tableau Prep Builder has to offer.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://packt.link/tableau

About the Author
  • Marleen Meier

    Marleen Meier is an accomplished analyst and author with a passion for statistics and data. By using traditional methodologies and approaches such as Machine Learning and AI, Marleen is dedicated to driving meaningful insights. Currently working as the APAC Data CoE Lead for ABN AMRO Clearing, Marleen is at the forefront of innovation and implementing data-driven strategies in a global financial environment. She has lived and worked in multiple countries, including Germany, the Netherlands, the USA, and Singapore, allowing her to bring a diverse and global perspective to her work. Through her writing and speaking engagements, she aims to empower individuals and organizations to unlock the full potential of their data assets.

    Browse publications by this author
Mastering Tableau 2023 - Fourth Edition
Unlock this book and the full library FREE for 7 days
Start now