Understanding the Core Data Terminologies
Welcome, dear reader!
This book has been prepared based on the knowledge that you need to pass the Azure DP-900 Data Platform Fundamentals exam. So, you will find detailed use cases, hand's-on exercises, as well as sample questions and answers to help you during the exam.
This book will not only prepare you for certification but also complement the knowledge needed for planning and working in a data organization. You can look forward to learning about transactional and analytical database concepts, SQL and NoSQL, when to use each option, and the most modern tools and techniques for implementation on Azure.
Data generation and data processing have been growing exponentially in recent years. Data is being generated and processed everywhere: in information systems, cell phones, smart watches, smart TVs, city buses, subways, and cars, among others. Knowing how to capture and process this data to generate intelligence provides today’s main competitive advantage in the market.
To start understanding how these technologies and solutions work, it is necessary to know the concepts of data storage and processing, which we will cover in this introductory chapter.
By the end of this chapter, you will be able to understand the following:
- The types of data and how to store it
- Relational and non-relational data
- Data Analytics
- How to differentiate the data workloads
Understanding the core data concepts
What is data?
Data is a record, also called a fact, which can be a number, text, or description used to make decisions. Data only generates intelligence when processed and then this data is called information or insights.
Data is classified into three basic formats: structured, semi-structured, and unstructured data. We will learn about them all in the following sections.
Structured data is formatted and typically stored in a table represented by columns and rows. This data is found in relational databases, which organize their table structures in a way that creates relationships between these tables.
The following figure shows an example of a simple table with structured data:
Figure 1.1 – Example of structured data in a database
In this example, the table called
CUSTOMER has seven columns and six records (rows) with different values.
## JSON FILE - Document 1 ##
"STREET": "54, rue Royale",
In this example, we can see that each JSON file contains a record, like the rows of the structured data table, but there are other formats of JSON and similar files that contain multiple records in the same file.
In addition to the JSON format, there is data in key-value and graph databases, which are considered semi-structured data, too.
The key-value database stores data in a related array format. These arrays have a unique identification key per record. Values written to a record can have a variety of formats, including numbers, text, and even full JSON files.
The following is an example of a key-value database:
Figure 1.2 – Example of a key-value database
As you can see in the preceding figure, each record can contain different attributes. They are stored in a single collection, with no predefined schema, tables, or columns, and no relationships between the entities; this differentiates the key-value database from the relational database.
The graph database is used to store data that requires complex relationships. A graph database contains nodes (object information) and edges (object relationship information). It means that the graph database predetermines what objects are and the relationships they will have with each other, but the records can contain different formats. The following is a representation of nodes and edges in a graph database of sales and deliveries:
Figure 1.3 – Example of a graph database
The diagram demonstrates how the relations around the ORDER entity are created in a graph database, considering the CUSTOMER, LOCATION, SUPPLIER, and PRODUCT nodes in the process. It represents an interesting acceleration in terms of query processing in the database because the graph is already structured to deliver the relations faster.
This data can also be processed to generate information, but the type of storage and processing for this is different from that of structured and semi-structured data. It is common, for example, for unstructured data such as audio to be transcribed using artificial intelligence, generating a mass of semi-structured data for processing.
Now that you understand the basics of data types, let’s look at how that data is stored in a cloud environment.
How is data stored in a modern cloud environment?
Depending on the data format, structured, semi-structured, and unstructured cloud platforms have different solutions. In Azure, we can count on Azure SQL Database, Azure SQL Database for PostgreSQL, Azure Database for MySQL, and database servers installed on virtual machines, such as SQL Server on a virtual machine in Azure, to store structured data. These are called relational databases.
Semi-structured data can be stored in Azure Cosmos DB and unstructured data (such as videos and images) can be stored in Azure Blob storage in a platform called Azure Data Lake Storage, optimized for queries and processing.
These services are delivered by Azure in the following formats:
- Infrastructure as a service (IaaS) – Databases deployed on virtual machines
- Platform as a service (PaaS) – Managed database services, where the responsibility for managing the virtual machine and the operating system lies with Azure
For these database services to be used, they must be provisioned and configured to receive the data properly.
One of the most important aspects after provisioning a service is the access control configuration. Azure allows you to create custom access role control, but in general, we maintain at least three profiles:
- Read-only – Users can read existing data on that service, but they cannot add new records or edit or delete them
- Read/Write – Users can read, create, delete, and edit records
- Owner – Higher access privilege, including the ability to manage permission for other users to use this data
With these configured profiles, you will be able to add users to the profiles to access the data storage/databases.
Let’s look at an example. You are an administrator of a
CUSTOMER database, and you have the
Owner profile. So, you configure access to this database for the leader of the commercial area to
Read/Write, and for salespeople to
In addition to the permissions configuration, it is important to review all network configurations, data retention, and backup patterns, among other administrative activities. These management tasks will be covered in Chapter 7, Provisioning and Configuring Relational Database Services in Azure.
In all database scenarios, we will have different access requirements, and it is important (as in the example) to accurately delimit the access level needs of each profile.
Describing a data solution
There are two types of database solutions: transactional solutions and analytical solutions. In the following sections, we will understand in detail what these solutions are and the requirements for choosing between them.
Transactional databases are used by systems for basic operations: creating, reading, updating, and deleting. Transactional systems are considered the core of the informatization of business processes. With these basic operations, we can create entities such as customers, products, stores, and sales transactions, among others, to store important data.
A transactional database is commonly known as online transaction processing (OLTP) considering that this type of database serves online transactional operations between the application and the database.
For an organization, transactional databases usually have their data segmented into entities, which can be tables (or not), with or without a relationship between these entities to facilitate the correlation between this data.
For example, an e-commerce database can be structured with a table called
Shopping_Cart, which represents the products that are being selected in the store during user navigation, and another called
Purchases with the completed transaction records.
The process of segmenting entities in a database is called normalization, which will be covered in Chapter 3, Working with Relational Data.
The format of a normalized transactional database is optimized for transactional operations, but it is not the best format for data exploration and analysis.
Figure 1.4 – Example of a relational transactional database
The preceding figure demonstrates a relational database of transactional workloads in a sales and delivery system. We can see the main entity, Orders, joined to Employees, Shippers, Customers, and Order Details, which then detail all products of this order in the relationship with the Products entity, which looks for information in the Categories and Suppliers entities.
When the data solution requires a good interface for queries, explorations, and data analysis, the data storage organization is different from transactional databases. To meet this requirement, we prioritize the data aggregations and relationships for data consumption and exploration; this specialized data storage is called an analytical database.
Analytical databases are constituted through a process of data ingestion, and they are responsible for processing and transforming the data into insights and information and then making this processed information available for consumption. The following steps describe this process:
- Data ingestion – The process responsible for connecting to transactional databases or other data sources to collect raw transaction information and include it in the analytical database
- Data processing – The process performed by the OLAP platform to create a data model, organize entities, perform indicator calculations, and define metrics for data consumption
- Data query – After the data model is loaded with the proper organization for querying, data manipulation and reporting tools can connect to the OLAP platform to perform your queries
Figure 1.5 – Example of an analytical relationship
Figure 1.6 – Data flow between OLTP and OLAP
The preceding figure demonstrates the traditional flow of data, which is sourced and stored in transactional OLTP databases and then moved to OLAP analytical databases for data intelligence generation.
There are modern data storage platforms that aim to unite OLTP and OLAP on the same platform, but these databases, often called NewSQL, still need to mature their structures to deliver the best of transactional and analytical worlds in the same database. The industry standard is to keep transactional and analytical data structures separate.
In this section, we defined what transactional and analytical data solutions are and the characteristics of each solution. In the next section, we will detail the recommended data types and storage for each of these types.
Defining the data type and proper storage
Categorizing the data to identify its types and best solutions for your storage is an important process for a data solution, and not just for evaluating whether it is structured, unstructured, or semi-structured. In this section, you will learn about the characteristics of different types of data.
Characteristics of relational and non-relational databases
Relational databases are the most traditional and used database format, as they have an easy-to-understand design and a simple tabular data model like other simple platforms such as Excel spreadsheets. Relational databases have predefined schemas, which are the structures of their tables, containing columns, the data type of each column, and other parameters such as primary and secondary keys used in relationships.
However, relational databases with these rigid schemas can pose challenges, as presented in the following example.
Your CRM system has a database structure with a
CUSTOMER table, where you intend to store customer data:
ZIP_CODE. To do this, you start by creating a
CUSTOMER table with five fields:
Figure 1.7 – Example of a CUSTOMER table in a relational database
However, after setting up this table, you realize that you have clients that have more than one address and zip code, and even more than one mobile phone number. How can you solve this issue?
To face problems like this one, we can use normalization one more time. Normalization is done when there is a need to split a table (
CUSTOMER, in this example) into more child tables that are correlated to the initial table.
Therefore, we can change the
CUSTOMER table as follows:
Figure 1.8 – A relationship model in a transactional database
Non-relational databases allow you to store data in its original format without having a predefined schema as in relational databases. The most common non-relational storage format is document storage, where each record in the database is an independent file. The benefit is that each file can have different and unique attributes.
Going back to our
CUSTOMER entity example in a relational database, when two or more customers live at one address, the database records that relationship, and the normalized database only keeps one address record. But in a non-relational database, if two customers live at the same address, this address will be presented in the records of the first customer and the second customer as well, independently.
Let’s now analyze how this storage could be structured in a relational database, using the concept of normalization:
Figure 1.9 – Example of data structured into tables
Now let’s analyze the same data in a
CUSTOMER table, but in the format of a non-relational database:
## JSON FILE - CUSTOMER ##
"FIRST_NAME": " MARK",
"LAST_NAME": " HUGGS"
"STREET": "1200, Harper Str"
## JSON FILE – CUSTOMER2 ##
"FIRST_NAME": " KRISTI",
"LAST_NAME": " LAMP"
"STREET": "1200, Harper Str"
Thus, we can observe that the same data can be stored in relational and non-relational structures.
Therefore, to decide between a relational or non-relational data storage solution, you must evaluate the behavior of the application or the user that will use that database, the relationships between the entities, and possible normalization processes.
Both relational and non-relational databases should be used primarily for transactional workloads. In the upcoming sections, we will understand the differences between these transactional workloads and analytical workloads.
A transactional workload
Relational and non-relational databases can be used as solutions for transactional workloads, which are the databases used to perform basic data storage operations: create, read, update, and delete (CRUD). Transactional operations must be done in sequence, with a transaction control that only confirms the conclusion of this transaction (a process called a commit) when the entire operation is successfully executed. If this does not occur, the transaction is canceled, and all processes are not performed, thus generating a process called rollback.
An important idea to help understand the difference between relational and non-relational databases is ACID, present in most database technologies. These properties are as follows:
- Atomicity: This is the property that controls the transaction and defines whether it was successfully performed completely to commit or must be canceled by performing a rollback. Database technology should ensure atomicity.
- Consistency: For a running transaction, it is important to evaluate consistency between the database state before receiving the data and the database state after receiving the data. For example, in a bank transfer, when funds are added to an account, those funds must have a source. Therefore, it is important to know this source and whether the fund’s source exit process has already been performed before confirming the inclusion in this new account.
- Isolation: This property evaluates whether there are multiple executions of transactions similar to the current one and if so, it keeps the database in the same state. It then evaluates whether the execution of transactions was sequential. In the bank transfer example, if multiple transactions are sent simultaneously, it checks whether the amounts have already left the source for all transactions, or you need to review one by one, transaction per transaction.
- Durability: This is responsible for evaluating whether a transaction remains in the committed database even if there is a failure during the process, such as a power outage or latency at the time of recording the record.
ACID properties are not unique to transactional databases; they are also found in analytic databases. At this point, the most important thing is to understand that these settings exist, and you can adjust them as per the requirements of your data solution use case.
Since we are talking about databases, let’s understand an acronym that is widely used to represent database software: DBMS.
Database management systems
Database management systems (DBMSs), which are database software, have ACID properties within their architecture, and in addition to performing these controls, they need to manage several complex situations. For example, if multiple users or systems try to access or modify database records, the database systems need to isolate transactions, perform all necessary validations quickly, and maintain the consistency of the data stored after the transaction is committed. For this, some DBMS technologies work with temporary transaction locks, so that actions are done sequentially. This lock is done during the process of an action executing in that record; for example, in an edit of a field in a table, the lock ends as soon as the commit is executed, confirming that transaction.
Some DBMSs are called distributed databases. These databases have their architecture distributed in different storage and processing locations, which can be on-premises in the company’s data center or a different data center in the cloud. Distributed database solutions are widely used to maintain consistency in databases that will serve applications in different geographic locations, but this consistency doesn’t need to be synchronous. For example, a mobile game can be played in the United States and Brazil, and the database of this game has some entities (categories, game modes, and so on) that must be shared among all players. But the transactions from the United States player do not necessarily need to appear to the player in Brazil in a real-time way; this transactional data will be synchronized from the United States to Brazil, but in an asynchronous process. Let’s understand this process next.
All transactions in distributed databases take longer to process than in undistributed databases because it is necessary to replicate the data across all nodes in this distributed system. So, to maintain an adequate replication speed, the distributed databases only synchronize the data that is needed. This is the concept of eventual consistency, which configures ACID to perform replication between the distributed nodes asynchronously, after the confirmation of the transaction on the main node of the database is created. This technique can lead to temporary inconsistencies between database nodes. Ideally, the application connected to a distributed database does not require a guarantee of data ordering. It means that the data relating to this eventual consistency may appear to users with an eventual delay as well. Distributed databases are widely used by social media platforms, for news feeds, likes, and shares, among other features.
Let’s use the following figure to understand the behavior of a database with eventual consistency:
Figure 1.10 – Diagram of an eventual consistency database
The preceding diagram shows behavior that we can observe when querying information in a database with eventual consistency. Instead of fetching the ball in a sequential way, the hero who retrieved it made the query of the ball in a future frame, generating a momentary duplication of the ball. In the end, only one ball was retrieved, after the sync was done.
This is an analogy for an eventual consistency database, where queries do not need to be made on entities that are already synchronized between all replicas of the database, and sometimes, this momentary duplication happens until the asynchronous process data update is complete.
In addition to transactional, relational, or non-relational databases, we also have another data workload, the analytical workload, which we will address in the next section.
An analytical workload
The second category of data solutions is the analytical workloads. These analytical solutions are based on high-volume data processing platforms, optimized for querying and exploring, and not for CRUD transactions or with ACID properties. In analytical databases, we aggregate various data sources, such as more than one transactional database, as well as logs, files, images, videos, and everything that can generate information for a business analyst.
This raw data is processed and aggregated, thus generating summaries, trends, and predictions that can support decision-making.
An analytical workload can be based on a specific time or a sequence of dated events. In these workloads, it’s common to evaluate only the data that is relevant to the analysis. For example, if you have a sales system with a transactional database (source) with several tables recording all sales, products, categories, and customers, among others, it is important to evaluate which of these tables can be used for the analytical database (destination) and then perform the data connections.
To create an analytical database, it is necessary to perform data ingestion, a process of copying data from sources to the analytical base. For this, a technique called extract, transform, and load (ETL), or the more recent extract, load, and transform (ELT), is used. The following figure demonstrates this process with an example of a transactional database as the data source and the analytical database as the destination:
Figure 1.11 – Data flow between a transactional database and an analytical database
In the preceding diagram, we can see that transactional databases are storages of information systems that automate business processes. Analytical databases act on simple and advanced data analysis, using, for example, statistical models with the application of machine learning, a branch of artificial intelligence. The data ingestion process is an important process for assembling an analytical database that meets the data solution. In the next section, we will understand what data ingestion is and the different types of this ingestion.
Understanding data ingestion
Data ingestion is the process of copying operational data from data sources to organize it in an analytical database. There are two different techniques for performing this copy: batching data and online data streaming.
It is important to identify latency requirements between the time when the data is generated in the source database and the data availability in the analytical database.
Understanding batch load
When batching the data, the operation is offline. You must define the periodicity for creating the data batch load, collecting data in the data source, and then inserting it into the analytical database.
The periodicity can be hourly, daily, or even monthly, if the requirement of analysis of this data is met. Events that can trigger a batch load can be a new record on a table entity in the database, an action triggered by a user in an application, a manual trigger, and more.
An example of batch processing might be the way we get vote counts in elections. The votes are not counted one by one the moment after the voter has voted, but they are inserted in lots that are processed during election day until the completion of all charges and the definition of the results.
Advantages of batch load
- It is the most used method by companies that have multiple transactional systems with large volumes of data. This is because due to scheduling loads, it can be made at the most convenient time, such as outside business hours when transactional servers are in lower demand.
- You can monitor the loads to verify where you need to optimize a script or a method independently, so if you need to prioritize one specific load performance, you can manipulate your computing resources to prioritize that load.
Constraints of batch load
- There is a delay between the time of data generation on the transactional database and the availability of this data on the analytical database, which sometimes makes it impossible to follow up and immediately make a decision based on the numbers
- The full batch of data must be completed to then begin copying, and if there is any data unavailability, inconsistent data, or network latency between transactional and analytical bases, among other situations, the batch load will fail
Batch loads can be our default data consumption for legacy databases, file repositories, and other types of data sources. But there are business requirements to consume some data in near real time, for monitoring and quick decision-making. And to meet these needs, we have another technique, called data streaming, which loads data online.
Understanding data streaming
In data-streaming-based data ingestion, there is an online connection between the data source and the analytical database, and the pieces of data are processed one by one, in events, right after their generation at this source. For example, for a sales tracking monitoring solution, sales managers need to track sales data in near real time on a dashboard for immediate decision-making. The sales transaction database is linked through a streaming load to the analytical database that receives this data, processes it, and demonstrates it on a monitoring dashboard.
Another example could be a stock exchange and its real-time stock tracking panels. These dashboards receive processed information from purchase and sale transaction data for stock papers in a data stream. See the following figure with the data flow in this scenario:
Figure 1.12 – Stock market example diagram
The load on data streaming is not always done online; it can also be done at intervals that load a portion of data. Data streaming is a continuous window of data ingestion between the source and the destination, while in the batch load, each batch opens and closes the connection to the process.
Let us now evaluate the advantages and disadvantages of the data streaming technique.
Advantages of data streaming
- The delay between data creation and analytical processing can be minimal
- The latency between the source and the target in the order of seconds or milliseconds
- Analytical solutions can demonstrate both past data and performance trends, which assists in immediate decision-making while events are happening
Constraints of streaming data load
- Most transactional database technologies do not have a native streaming data export technology, so you need to implement this technique through manual control of what has already been ingested and what has not yet been ingested. This generates great complexity.
- The size of each event is usually small to avoid having a very robust infrastructure to maintain this event's queue during the streaming. This makes it impossible to ingest large files, videos, audio, and photos, among others. These loads are often best implemented in batch loads.
To understand in practice how these concepts are applied, let’s now evaluate a case study of a complete data solution.
Webshoes is a fictitious sales company of shoes and accessories that is being created. The company’s business areas have defined that Webshoes will have an online store and that the store will need to have personalized experiences. The requirements that the business areas have passed to the project development team are as follows:
- Online store – The online store should have a simple catalog with the 12 different products of the brand
- Smart banner – If the customer clicks on a product, similar products should appear in a Recommended banner, with products that have the same characteristics as the one selected, but only products that the customer has not purchased yet
- Sales conversion messages – If the customer does not complete the sale and has logged into the portal, the online store should contact the customer via email and a message on their cell phone later, with the triggering of a few messages created for conversion of the sale
By analyzing these business requirements, we can do the following technical decomposition to select the appropriate data storage:
- Online store – A repository to store the product catalog, a repository to register the sales through the shopping cart, and a repository to store customer login
- Smart banner – Depending on the customer and product selected, a near real-time interaction of banner customization
- Sales conversion messages – Will be processed after the customer leaves the online store (closing their login session) and depends on their actions while browsing the website and purchase history
Come on, let’s go! Here are the solutions:
- Online store – Transactional workload. A SQL relational or NoSQL database can assist in this scenario very well, as it will have product entities, customers, login information, and shopping carts, among others, already related in the database.
- Smart banner – Analytical workload. For near real-time processing, data streaming is required, capturing the behavior of the client and crossing it with the other historical data. In this case, an analytical base can process the information and return the application/banner to the appropriate message for customization.
- Sales conversion messages – Analytical workload. In this case, the customer will have left the store, and we do not need to work with data streaming but rather a batch load of data. It is important to evaluate with the business area how long it is optimal to send messages to target customers, and the analytical base will process the information, generating the message list to be fired.
Therefore, each use case can define a different data workload type, which influences our database decision. In the next chapters, we will detail the Azure solutions for SQL transactional databases, NoSQL, and analytical databases, and the understanding of the different use cases will be simpler for sure.
In this chapter, we reviewed the core data concepts about data storage and processing, the different data types, and data solutions. We went through the explanation of relational, non-relatable, transactional, and analytical data, their particularities, and application cases.
Now you know how to differentiate a transactional database from an application and an analytical database. In the following chapters, we will go into the details of each of these workloads and of the Azure services that are implemented for this. But before we detail these structures, in the next chapter, we will understand the different roles and responsibilities in a data domain.
Sample questions and answers
Let’s evaluate some sample questions related to the content of this chapter:
- What type of workload is an OLAP model?
- Analytical workload
- Transactional workload
- Relational database
- How is data in a relational table organized?
- Rows and columns
- Header and footer
- Pages and paragraphs
- Connections and arrows
- Which of the following is an example of unstructured data?
- Audio and video files
- A table within a relational database
- A stored procedure in a database
- What type of cloud service is a database deployed in a virtual machine?
1-A 2-A 3-A 4-B