Machine Learning with BigQuery ML

By Alessandro Marrandino
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Introduction to Google Cloud and BigQuery

About this book

BigQuery ML enables you to easily build machine learning (ML) models with SQL without much coding. This book will help you to accelerate the development and deployment of ML models with BigQuery ML.

The book starts with a quick overview of Google Cloud and BigQuery architecture. You'll then learn how to configure a Google Cloud project, understand the architectural components and capabilities of BigQuery, and find out how to build ML models with BigQuery ML. The book teaches you how to use ML using SQL on BigQuery. You'll analyze the key phases of a ML model's lifecycle and get to grips with the SQL statements used to train, evaluate, test, and use a model. As you advance, you'll build a series of use cases by applying different ML techniques such as linear regression, binary and multiclass logistic regression, k-means, ARIMA time series, deep neural networks, and XGBoost using practical use cases. Moving on, you'll cover matrix factorization and deep neural networks using BigQuery ML's capabilities. Finally, you'll explore the integration of BigQuery ML with other Google Cloud Platform components such as AI Platform Notebooks and TensorFlow along with discovering best practices and tips and tricks for hyperparameter tuning and performance enhancement.

By the end of this BigQuery book, you'll be able to build and evaluate your own ML models with BigQuery ML.

Publication date:
June 2021
Publisher
Packt
Pages
344
ISBN
9781800560307

 

Chapter 1: Introduction to Google Cloud and BigQuery

The adoption of the public cloud enables companies and users to access innovative and cost-effective technologies. This is particularly valuable in the big data and Artificial Intelligence (AI) areas, where new solutions are providing possibilities that seemed impossible to achieve with on-premises systems only a few years ago. In order to be effective in the day-to-day business of a company, the new AI capabilities need to be shared between different roles and not concentrated only with technicians. Most cloud providers are currently addressing the challenge of democratizing AI across different departments and employees with different skills.

In this context, Google Cloud provides several services to accelerate the processing of large amounts of data and build Machine Learning (ML) applications that can make better decisions.

In this chapter, we'll gradually introduce the main concepts that will be useful in the upcoming hands-on activities. Using an incremental approach, we'll go through the following topics:

  • Introducing Google Cloud Platform
  • Exploring AI and ML services on GCP
  • Introducing BigQuery
  • Discovering BigQuery ML
  • Understanding BigQuery pricing
 

Introducing Google Cloud Platform

Starting from 1998 with the launch of Google Search, Google has developed one of the largest and most powerful IT infrastructures in the world. Today, this infrastructure is used by billions of users to use services such as Gmail, YouTube, Google Photo, and Maps. After 10 years, in 2008, Google decided to open its network and IT infrastructure to business customers, taking an infrastructure that was initially developed for consumer applications to public service and launching Google Cloud Platform (GCP).

The 90+ services that Google currently provides to large enterprises and small- and medium-sized businesses cover the following categories:

  • Compute: Used to support workloads or applications with virtual machines such as Google Compute Engine, containers with Google Kubernetes Engine, or platforms such as AppEngine.
  • Storage and databases: Used to store datasets and objects in an easy and convenient way. Some examples are Google Cloud Storage, Cloud SQL, and Spanner.
  • Networking: Used to easily connect different locations and data centers across the globe with Virtual Private Clouds (VPCs), firewalls, and fully managed global routers.
  • Big data: Used to store and process large amounts of information in a structured, semi-structured, or unstructured format. Among these services are Google DataProc, the Hadoop services offered by GCP, and BigQuery, which is the main focus of this book.
  • AI and machine learning: This product area provides various tools for different kinds of users, enabling them to leverage AI and ML in their everyday business. Some examples are TensorFlow, AutoML, Vision APIs, and BigQuery ML, the main focus of this book.
  • Identity, security, and management tools: This area includes all the services that are necessary to prevent unauthorized access, ensure security, and monitor all other cloud infrastructure. Identity Access Management, Key Management Service, Cloud Logging, and Cloud Audit Logs are just some of these tools.
  • Internet of Things (IoT): Used to connect plants, vehicles, or any other objects to the GCP infrastructure, enabling the development of modern IoT use cases. The core component of this area is Google IoT Core.
  • API management: Tools to expose services to customers and partners through REST APIs, providing the ability to fully leverage the benefits of interconnectivity. In this pillar, Google Apigee is one of the most famous products and is recognized as the leader of this market segment.
  • Productivity: Used to improve productivity and collaboration for all companies that want to start working with Google and embracing its way of doing business through the powerful tools of Google Workplace (previously GSuite).

Interacting with GCP

All the services just mentioned can be accessed through four different interfaces:

  • Google Cloud Console: The web-based user interface of GCP, easily accessible from compatible web browsers such as Google Chrome, Edge, or Firefox. For the hands-on exercises in this book, we'll mainly use Google Cloud Console:
Figure 1.1 – Screenshot of Google Cloud Console

Figure 1.1 – Screenshot of Google Cloud Console

  • Google Cloud SDK: The client SDK can be installed in order to interact with GCP services through the command line. It can be very useful to automate tasks and operations by scheduling them into scripts.
  • Client libraries: The SDK also includes some client libraries to interact with GCP using the most common programming languages, such as Python, Java, and Node.js.
  • REST APIs: Any task or operation performed on GCP can be executed by invoking a specific REST API from any compatible software.

Now that we've learned how to interact with GCP, let's discover how GCP is different from other cloud providers.

Discovering GCP's key differentiators

GCP is not the only public cloud provider on the market. Other companies have embarked on this kind of business, for example, with Amazon Web Services (AWS), Microsoft Azure, IBM, and Oracle. For this reason, before we get too deep into this book, it could be valuable to understand how GCP is different from the other offerings in the cloud market.

Each cloud provider has its own mission, strategy, history, and strengths. Let's take a look at why Google Cloud can be considered different from all the other cloud providers.

Security

Google provides an end-to-end security model for its data centers across the globe, using customized hardware developed and used by Google, and application encryption is enabled by default. The security best practices adopted by Google for GCP are the same as those developed to run applications with more than 1 billion users, such as Gmail and Google Maps.

Global network and infrastructure

At the time of writing, Google's infrastructure is available in 24 different regions, 74 availability zones, and 144 network edge locations, enabling customers to connect to Google's network and ensuring the best experience in terms of bandwidth, network latency, and security. This network allows GCP users to move data across different regions without leaving Google's proprietary network, minimizing the risk of sending information across the public internet. As of today, it is estimated that about 40% of internet traffic goes through Google's proprietary network.

In the following figure, we can see how GCP regions are distributed across the globe:

Figure 1.2 – A map of Google's global availability

Figure 1.2 – A map of Google's global availability

The latest version of the map can be seen at the following URL: https://cloud.google.com/about/locations.

Serverless and fully managed approach

Google provides a lot of fully managed and serverless services to allow its customers to focus on high-value activities rather than maintenance operations. A great example is BigQuery, the serverless data warehouse that will be introduced in the next section of this chapter.

Environmental sustainability

100% of the energy used for Google's data centers comes from renewable energy sources. Furthermore, Google has committed to being the first major company to operate carbon-free for all its operations, such as its data centers and campuses, by 2030.

Pervasive AI

Google is a pioneer of the AI industry and is leveraging AI and ML to improve its consumer products, such as Google Photos, but also to improve the performance and efficiency of its data centers. All of Google's expertise in terms of AI and ML can be leveraged by customers through adopting GCP services such as AutoML and BigQuery ML. That will be the main focus of this book.

Now that we have discussed some of the key elements of GCP as a service, let's look at AI and ML more specifically.

 

Exploring AI and ML services on GCP

Before we get too deep into our look at all of the AI and ML tools of GCP, it is very important to remember that Google is an AI company and embeds AI and ML features within many of its other products, providing the best user experience to its customers. Simply looking at Google's products, we can easily perceive how AI can be a key asset for a business. Some examples follow:

  • Gmail Smart Reply allows users to quickly reply to emails, providing meaningful suggestions according to the context of the conversation.
  • Google Maps is able to precisely predict our time of arrival when we move from one place to another by combining different data sources.
  • Google Translate provides translation services for more than one hundred languages.
  • YouTube and the Google Play Store are able to recommend the best video to watch or the most useful mobile application to install according to user preferences.
  • Google Photos recognizes people, animals, and places in our pictures, simplifying the job of archiving and organizing our photos.

Google proves that leveraging AI and ML capabilities in our business opens new opportunities for us, increases our revenue, saves money and time, and provides better experiences to our customers.

To better understand the richness of the GCP portfolio in terms of AI and ML services, it is important to emphasize that GCP services are able to address all the needs that emerge in a typical life cycle of an ML model:

  1. Ingestion and preparation of the datasets
  2. Building and training of the model
  3. Evaluation and validation
  4. Deployment
  5. Maintenance and further improvements of the model

In the following figure, you can see the entire AI and ML GCP portfolio:

Figure 1.3 – GCP AI and ML services represented by their icons

Figure 1.3 – GCP AI and ML services represented by their icons

Each one of the previously mentioned five stages can be fully managed by the user or delegated to the automation capabilities of GCP, according to the customer's needs and skills. For this reason, it is possible to divide the AI and ML services provided by GCP into three subcategories:

  • Core platform services
  • AI Applications
  • Solutions

For each of these subcategories, we'll go through the most important services currently available and some typical users that could benefit from them.

Core platform services

The core AI and ML services are the most granular items that a customer can use on GCP to develop AI and ML use cases. They provide the most control and flexibility to their users in exchange for less automation; users will also need to have more expertise in ML.

Processing units (CPU, GPU, and TPU)

With a traditional Infrastructure-as-a-Service (IaaS) approach, developers can equip their Google Compute Engine instances with powerful processing units to accelerate the training phases of ML models that might otherwise take a long time to run, particularly if complex contexts or large amounts of data need to be processed. Beyond the Central Processing Units (CPUs) that are available on our laptops, GCP offers the use of high-performance Graphical Processing Units (GPUs) made by Nvidia and available in the cloud to speed up computationally heavy jobs. Beyond that, there are Tensor Processing Units (TPUs), which are specifically designed to support ML workloads and perform matrix calculations.

Deep Learning VM Image

One of the biggest challenges for data scientists is quickly provisioning environments to develop their ML models. For this reason, Google Cloud provides pre-configured Google Compute Engine (GCE) images that can be easily provisioned with a pre-built set of components and libraries dedicated to ML.

In the following screenshot, you can see how these Virtual Machines (VMs) are presented in the GCP marketplace:

Figure 1.4 – Deep Learning VM in the GCP marketplace

Figure 1.4 – Deep Learning VM in the GCP marketplace

Deep Learning VM Image is also optimized for ML workloads and is already pre-configured to use GPUs. When a GCE image is provisioned from the GCP marketplace, it is already configured with the most common ML frameworks and programming languages, such as Python, TensorFlow, scikit-learn, and others. This allows data scientists to focus on the development of the model rather than on the provisioning and configuration of the development environment.

TensorFlow

TensorFlow is an open source framework for math, statistics, and ML. It was launched by Google Brain for internal use at Google and then released under the Apache License 2.0. It is still the core of the most successful Google products. The framework natively supports Python but can be used also with other programming languages such as Java, C++, and Go. It requires ML expertise, but it allows users to achieve great results in terms of customization and flexibility to develop the best ML model.

AI Platform

AI Platform is an integrated service of GCP that provides serverless tools to train, evaluate, deploy, and maintain ML models. With this service, data scientists are able to focus only on their code, simplifying all the side activities of ML development, such as provisioning, maintenance, and scalability.

AI Platform Notebooks

AI Platform Notebooks is a fully managed service that provides data scientists with a JupyterLab environment already integrated and connected with all other GCP resources. Similar to Deep Learning VM Image, AI Platform Notebooks instances come pre-configured with the latest versions of the AI and ML frameworks and allow you to develop an ML model with diagrams and written explanations.

All the services described so far require good knowledge of ML and proven experience in hand-coding with the most common programming languages. The core platform services address the needs of data scientists and ML engineers who need full control over and flexibility with the solutions that they're building and who already have strong technical skills.

Building blocks

On top of the core platform services, Google Cloud provides pre-built components that can be used to accelerate the development of new ML use cases. This category encompasses the following aspects:

AutoML

Unlike the services outlined in the previous section, AutoML offers the ability to build ML models even if you have limited expertise in the field. It leverages Google's ML capabilities and allows users to provide their data to train customized versions of algorithms already developed by Google. AutoML currently provides the ability to train models for images (AutoML Vision), video (AutoML Video Intelligence), free text (AutoML Natural Language), translation (AutoML Translation), and structured data (AutoML Tables). When the ML model is trained and ready to use, it is automatically deployed and made available through a REST endpoint.

Pre-built APIs

Google Cloud provides pre-built APIs that leverage ML technology under the surface but are already trained and ready to use. The APIs are exposed through a standard REST interface that can be easily integrated into applications to work with images (Vision API), videos (Video API), free text (Natural Language API), translations (Translation API), e-commerce data (Recommendations AI), and conversational scenarios (Speech-to-Text API, Text-to-Speech API, and Dialogflow). Using a pre-built ML API is the best choice for general-purpose applications where generic training datasets can be used.

BigQuery ML

As BigQuery ML will be discussed in detail in the following sections of this chapter, for the moment you just need to know that this component enables users to build ML models with SQL language, using structured data stored in BigQuery and a list of supported algorithms.

None of the building blocks described here requires any specific knowledge of ML or any proven coding experience with programming languages. In fact, these services are intended for developers or business analysts who are not very familiar with ML but want to start using it quickly and with little effort. On the other hand, a data scientist with ML expertise can also leverage the building blocks to accelerate the development of a model, reducing the time to market of a solution.

To see a summary of the building blocks, their usage, and their target users, let's take a look at the following table:

Figure 1.5 – Building blocks summary table

Figure 1.5 – Building blocks summary table

Now that we've learned the basics of building blocks, let's take a look at the solutions offered by GCP.

Solutions

Following the incremental approach, building blocks and core platform services are also bundled to provide out-of-the-box solutions. These pre-built modules can be adopted by companies and immediately used to improve their business. These solutions are covered in this section.

AI Hub

Google Cloud's AI Hub acts as a marketplace for AI components. It can be used in public mode to share and use assets developed by the community, which actively works on GCP, or it can be used privately to share ML assets inside your company. The goal of this service is to simplify the sharing of valuable assets across different users, favoring re-use and accelerating the deployment of new use cases.

In the following screenshot, you can see AI Hub's home page:

Figure 1.6 – Screenshot of AI Hub on GCP

Figure 1.6 – Screenshot of AI Hub on GCP

Now that we've understood the role of AI Hub, let's look at Cloud Talent Solution.

Cloud Talent Solution

Cloud Talent Solution is basically a solution for HR offices that improves the candidate discovery and hiring processes using AI. We will not go any further with the description of this solution, but there will be a link in the Further resources section at the end of this chapter.

Contact Center AI

Contact Center AI is a solution that can be used to improve the effectiveness of the customer experience with a contact center powered by AI and automation. The solution is based on Dialogflow and the Text-to-Speech and Speech-to-Text APIs.

Document AI

This solution is focused on document processing to extract relevant information and streamline business processes that usually require manual effort. The solution is able to parse PDF files, images, and handwritten text to convert this information into a digitally structured format, making them accessible and researchable.

As can be easily seen from their descriptions, the AI solutions provided by Google are more business-oriented and designed to solve specific challenges. They can be configured and customized but are basically dedicated to business users.

Before going on, let's take a look at the following table, which summarizes the concepts explained in this section and provides a clear overview of the different AI and ML service categories:

Figure 1.7 – Summary of GCP AI and ML services

Figure 1.7 – Summary of GCP AI and ML services

Tip

When you need to develop a new use case, we recommend using pre-built solutions and building blocks before trying to reinvent the wheel. If a building block already satisfies all the requirements of your use case, it can be extremely valuable to use it. It will save time and effort during the development and maintenance phases. Start considering the use of core services only if the use case is complex or so particular that it cannot be addressed with building blocks or solutions.

As we've seen in this section, GCP's AI and ML services are extensive. Now, let's take a closer look at the main topic of this book: Google BigQuery.

 

Introducing BigQuery

Google BigQuery is a highly scalable, serverless, distributed data warehouse technology built internally by Google in 2006 and then released for public use on GCP in 2010. Thanks to its architecture, it can store petabytes of data and query them with high performance and on-demand scaling. Due to its serverless nature, users who store and query data on BigQuery don't have to manage the underlying infrastructure and can focus on implementing the logic that brings the business value, saving time and resources.

BigQuery is currently used by many large enterprises that leverage it to make data-driven decisions, including Twitter, The Home Depot, and Dow Jones.

BigQuery architecture

BigQuery has a distributed architecture running on thousands of nodes across Google's data centers. Your datasets are not stored in a unique server but are chunked and replicated across different regions to guarantee maximum performance and availability.

The storage and compute layers are fully decoupled in BigQuery. This means that the query engine runs on different servers from the servers where the data is stored. This feature enables BigQuery to provide great scalability both in terms of data volume and query execution. This decoupled paradigm is only possible thanks to Google's Petabit network, which moves data very quickly from one server to another, leveraging Google's proprietary fiber cables across the globe.

Now let's look deeper into how BigQuery manages storage and the compute engine.

Storage layer

Unlike traditional data warehouses, BigQuery stores data in columnar format rather than in row format. This approach enables you to do the following:

  • Achieve a better compression ratio for each column, because the data in a column is typically homogeneous and simpler to compress.
  • Reduce the amount of data to read and get the best possible performance for data warehouse use cases that are usually based on a small selection of columns in a table and aggregating operations such as sums, average, and maximum.

All the data is stored in Google's proprietary distributed filesystem named Google File System (codename Colossus). The distribution of the data allows it to guarantee faster I/O performance and better availability of data in the case of failures. Google File System is based on two different server types:

  • Master servers: Nodes that don't store data but are responsible for managing the metadata of each file, such as the location and available number of replicas of each chunk that compose a file.
  • Chunk servers: Nodes that actually store the chunks of files that are replicated across different servers.

In the following diagram, you can see how Google File System manages data:

Figure 1.8 – Google File System (Colossus) storage strategy

Figure 1.8 – Google File System (Colossus) storage strategy

Now that we've learned how BigQuery handles large volumes of data, let's see how this data can be accessed by the compute layer.

Compute (query) layer

Fully decoupled from storage, the compute layer is responsible for receiving query statements from BigQuery users and executing them in the fastest way. The query engine is based on Dremel, a technology developed by Google and then published in a paper in 2010. This engine leverages a multi-level tree architecture:

  1. The root node of the tree receives the query to execute.
  2. The root node splits and distributes the query to other intermediate nodes named mixers.
  3. Mixer nodes have the task of rewriting queries before passing them to the leaf nodes or to other intermediate mixer nodes.
  4. Leaf nodes are responsible for parallelizing the reading of the chunks of data from Google File System.
  5. When the right chunks of data are extracted from the filesystem, leaf nodes perform computations on the data and eventually shuffle them across other leaf nodes.
  6. At the end of the computation, each leaf node produces a result that is returned to the parent node.
  7. When all the results are returned to the root node, the outcome of the query is sent to the user or application that requested the execution.

The execution process of a query on BigQuery based on the multi-level tree is represented in the following diagram:

Figure 1.9 – The BigQuery engine is a multi-level tree

Figure 1.9 – The BigQuery engine is a multi-level tree

Each node provides a number of processing units called BigQuery slots to execute the business logic of the query. A BigQuery slot can be considered a virtual CPU on a Dremel node. The calculation of the slots needed to perform a specific query is automatically managed by BigQuery depending on the complexity of the query and the impacted data volumes.

BigQuery's advantages over traditional data warehouses

Now that we've learned about the technical architecture underneath BigQuery, let's take a look at how this architecture translates into benefits for the enterprises that use it to become data-driven companies compared to other traditional on-premises data warehouses.

Serverless

As we have mentioned, BigQuery has a serverless architecture that saves users from having to manage not only the provisioning and maintenance of the servers but also all the maintenance operations related to the upgrading and patching of the operative system and of the database software that supports the functioning of the data warehouse. Thanks to the serverless approach, the user can easily start ingesting data into BigQuery and using it without having to perform capacity planning or any hardware and software provisioning upfront. This is particularly important for prototyping and to enable a fail-fast approach that favors a culture of innovation and experimentation.

Scalability

It doesn't matter if you need to store megabytes of data or petabytes; BigQuery can provide you with maximum flexibility and scalability in terms of both data storage and processing. Thanks to its multi-tenant architecture, a small or medium-sized business can leverage the same innovative capabilities as the largest enterprises, or they can start with a small use case to scale later, according to business needs. Traditional data warehouse technologies leverage the same servers to store and compute. For this reason, they are not particularly suitable for unbalanced use cases, such as when large volumes of data storage are needed but high computing performance is not required, or vice versa. Thanks to its decoupled architecture, as we've seen in the previous section, BigQuery is designed to independently scale storage and compute power according to the user's actual requirements, reducing the total cost of ownership of the solution.

Availability

Thanks to its resilient, distributed architecture, BigQuery is able to offer a Service Level Agreement (SLA) of monthly uptime percentage greater than 99.99%. This very high availability standard is granted by Google without any extra effort from the BigQuery users, who don't need to take care of high availability or disaster recovery strategies.

Performances

The BigQuery engine offers the ability to query terabytes of data in seconds and petabytes in minutes. This kind of performance is very hard to achieve with a traditional on-premises data warehouse. Higher performance means getting insights faster, as well as processing large volumes of data that very often would be impossible to manage on-premises without huge hardware and software investments. To further improve performance, BigQuery offers the possibility of enabling BigQuery BI Engine. BigQuery BI Engine is an in-memory analytics layer that can be activated on top of BigQuery to execute queries faster, with sub-second latencies.

Real-time

Traditional data warehouses are designed for long batch operations and are often unable to manage real-time workloads. BigQuery, however, provides a specific interface to ingest data in real time, making it immediately available for analysis. This feature opens up new possibilities to companies that want to accelerate their analytics and overcome the typical approach of data warehouses, which usually involves dealing with the business situation of the day before.

Format flexibility

BigQuery stores files in a compressed and optimized format in Google File System but provides the option to load data in various formats that are typically used in data lake technologies. Users can load data into BigQuery using Avro, ORC, CSV, and JSON formats.

Innovative capabilities

BigQuery offers two SQL extensions that are not available in traditional data warehouses:

  • BigQuery GIS: Provides the ability to easily manage geospatial data with SQL statements to execute geographic calculations and operations
  • BigQuery ML: Allows users to train, evaluate, and run ML models leveraging SQL language with no programming experience, accelerating the development of innovative use cases with ML

Security

By default, BigQuery automatically encrypts and decrypts customer's data before storing it in Google File System. BigQuery is also responsible for managing and rotating the encryption and decryption keys. To further improve security, BigQuery provides the option to use Customer-Managed Encryption Keys (CMEKs). In this case, the keys are managed directly by the customer in Google Cloud Key Management System.

Unauthorized access and use of data can be prevented by setting the right roles in Google Identity Access Management System (IAM).

Integration with other GCP services

One of the great benefits of using BigQuery is the native integration with a lot of other GCP services:

Figure 1.10 – BigQuery integration with other GCP services

Figure 1.10 – BigQuery integration with other GCP services

As we can see in the preceding screenshot, Google BigQuery can read data from Google Bigtable, Google Cloud Storage, Google Cloud SQL, and Google Drive using external tables. This feature can simplify data ingestion into BigQuery from other databases, which can also be performed using ETL/ELT tools such as Google Dataflow and Data Fusion. When the datasets are stored in BigQuery, they can be accessed from other GCP components, such as Google DataProc, DataPrep for data processing and preparation, Data Studio, Looker, and Google Sheets for data visualization. BigQuery is integrated with AI Platform Notebooks to allow data scientists and engineers to easily access data from their Jupyter environments.

To summarize, bringing data into BigQuery opens a vast plethora of options that can be used according to the user's needs.

Rich ecosystem of partners

Beyond the integration with other GCP services, Google's partners provide connectors and integrations with BigQuery, creating a rich data management ecosystem. Some examples include Informatica, Tableau, Fivetran, Talend, and Confluent.

Public datasets

If you want to start from scratch using BigQuery, you can leverage existing public tables that are available from the BigQuery Public Datasets Program. The program contains interesting datasets coming from different industries and different countries about different topics. Some examples that we'll use in the next chapter to train our ML models include transactions from bike sharing services, open data on New York City, and records of taxi trips.

Interacting with BigQuery

There are different ways to interact with BigQuery. They are:

  • The BigQuery web UI in the Google Cloud Console, the graphical user interface accessible from web browsers, represents the easiest way to interact with BigQuery.
  • The bq command line, available with the installation of the Google Cloud SDK. It can be used to automate jobs and commands by including them in scripts.
  • BigQuery REST APIs. The API layer natively provided by BigQuery can be used to integrate this service with other applications.
  • Client libraries to favor the use of the most common programming languages, such as C#, Go, Java, Node.js, PHP, Python, and Ruby.
  • JDBC/ODBC drivers, developed by Google's partner, Magnitude Simba Driver, are available for Windows, macOS, and Linux systems.
  • Third parties and Google's partners have developed BigQuery connectors for their applications, such as Tableau for business intelligence, Informatica, and Talend for data ingestion and integration.

For our purposes, we'll take a look at the BigQuery web UI available in Google Cloud Console that will be used in the coming chapters to develop with BigQuery ML.

In the following screenshot, you can see how the BigQuery UI appears in the GCP console:

Figure 1.11 – BigQuery web and graphical user interface available in the GCP console

Figure 1.11 – BigQuery web and graphical user interface available in the GCP console

In the left column, the main BigQuery features are available and the datasets are listed and visible to the user. In this case, there is only one dataset.

The remaining part of the screen is occupied by the development canvas with the results and outcomes at the bottom. We'll learn how to use the BigQuery web UI in Chapter 2, Setting Up Your GCP and BigQuery Environment, when we'll create a GCP project and start using BigQuery.

BigQuery data structures

BigQuery structures, such as tables, views, and ML models, are organized in datasets. Each dataset is a container for different structures and can be used to control access to underlying data structures. A dataset is directly linked to the following:

  • A GCP project that hosts the dataset itself and is usually linked to the billing account where the storage cost is billed
  • A geographic location (regional or multi-regional) that is defined at creation time and cannot be changed later
  • A specific name assigned to the dataset that should be unique in the GCP project

In the following diagram, you can see an example of a hierarchy composed of projects, datasets, tables, and BigQuery ML models:

Figure 1.12 – BigQuery hierarchy: datasets, tables, views, and ML models

Figure 1.12 – BigQuery hierarchy: datasets, tables, views, and ML models

A dataset hosted in GCP, Project A, can also be queried by users linked to another GCP project, Project B, if the right permissions are set. In this case, the storage costs are charged to Project A, which hosts the dataset structures, while the compute costs are billed on the billing account related to Project B. This is exactly what will happen when we use BigQuery public datasets in future chapters for hands-on exercises.

Tip

Remember, your queries can only include tables that reside in the same region. In BigQuery, you cannot query tables that are stored in different geographic locations. If you want to execute queries on tables located in different regions, you need to export and import the data into a dataset in the same region, passing through Google Cloud Storage.

Now that we've learned the main characteristics of BigQuery, let's focus more specifically on the core of this book: BigQuery ML.

 

Discovering BigQuery ML

Developing a new ML model can require a lot of effort and can be a time-consuming activity. It usually requires different skills and is a complex activity, especially in large enterprises. The typical journey of an ML model can be summarized with the following flow:

Figure 1.13 – An ML model's typical development life cycle

Figure 1.13 – An ML model's typical development life cycle

The first two steps involve preliminary raw data analyses and operations:

  1. In the Data Exploration and Understanding phase, the data engineer or data scientist takes a first look at the data, tries to understand the meaning of all the columns in the dataset, and then selects the fields to take into consideration for the new use case.
  2. During Data Preparation, the data engineer filters, aggregates, and cleans up the datasets, making them available and ready to use for the subsequent training phase.

After these two first stages, the actual ML developing process starts:

  1. Leveraging ML frameworks such as TensorFlow and programming languages such as Python, the data scientist will engage in the Design the ML model step, experimenting with different algorithms on the training dataset.
  2. When the right ML algorithm is selected, the data scientist performs the Tuning of the ML model step, applying feature engineering techniques and hyperparameter tuning to get better performance out of the ML model.
  3. When the model is ready, a final Evaluation step is executed on the evaluation dataset. This phase proves the effectiveness of the ML model on a new dataset that's different from the training one and eventually leads to further refinements of the asset.
  4. After the development process, the ML model is generally deployed and used in a production environment with scalability and robustness requirements.
  5. The ML model is also eventually updated in a subsequent stage due to different incoming data or to apply further improvements.

All of these steps require different skills and are based on the collaboration of different stakeholders, such as business analysts for data exploration and understanding, data engineers for data preparation, data scientists for the development of the ML model, and finally the IT department to make the model usable in a safe, robust, and scalable production environment.

BigQuery ML simplifies and accelerates the entire development process of a new ML model, allowing you to do the following:

  • Design, train, evaluate, and serve the ML model, leveraging SQL and the existing skills in your company.
  • Automate most of the tuning activities that are usually highly time-consuming to get an effective model.
  • Ensure that you have a robust, scalable, and easy-to-use ML model, leveraging all the native features of BigQuery that we've already discussed in the BigQuery's advantages over traditional data warehouses section of this chapter.

In the following diagram, you can see the life cycle of an ML model that uses BigQuery ML:

Figure 1.14 – An ML model's development life cycle with BigQuery ML

Figure 1.14 – An ML model's development life cycle with BigQuery ML

Now that we've learned the basics of BigQuery ML, let's take a look at the main benefits that it can bring.

BigQuery ML benefits

BigQuery ML can bring both business and technical benefits during the life cycle of an ML model:

  • Business users and data analysts can evolve from a traditional descriptive and reporting approach to a new predictive approach to take better decisions using their existing SQL skills. 
  • Technical users can benefit from the automation of BigQuery ML during the tuning phase of the model, using a unique, centralized tool that can accelerate the entire development process of an ML model.
  • The development process is further sped up because the datasets required to build the ML model are already available to the right users and don't need to be moved from one data repository to another, which carries compliance and data duplication risks.
  • The IT department does not need to manage the infrastructure to serve and use the ML model in a production environment because the BigQuery serverless architecture natively supports the model in a scalable, safe, and robust manner.

After our analysis of the benefits that BigQuery ML can bring, let's now see what the supported ML algorithms are.

BigQuery ML algorithms

The list of ML algorithms supported by BigQuery ML is growing quickly. Currently, the following supervised ML techniques are currently supported:

  • Linear regression: To forecast numerical values with a linear model
  • Binary logistic regression: For classification use cases when the choice is between only two different options (Yes or No, 1 or 0, True or False)
  • Multiclass logistic regression: For classification scenarios when the choice is between multiple options
  • Matrix factorization: For developing recommendation engines based on past information
  • Time series: To forecast business KPIs leveraging timeseries data from the past
  • Boosted tree: For classification and regression use cases with XGBoost
  • AutoML table: To leverage AutoML capabilities from the BigQuery SQL interface
  • Deep Neural Network (DNN): For developing TensorFlow models for classification or regression scenarios, avoiding any lines of code

When the training dataset doesn't contain labeled data, the learning is defined as unsupervised. BigQuery ML currently supports the following:

  • K-means clustering: For data segmentation of similar objects (people, facts, events)

In addition to what is listed, BigQuery ML allows you to import and use pre-trained TensorFlow models using SQL statements.

 

Understanding BigQuery pricing

In this section, the pricing model for BigQuery and BigQuery ML is explained. Since the pricing of GCP services is ever-evolving, we suggest that you consult https://cloud.google.com/bigquery/pricing to get the latest updates.

Let's look at the models for BigQuery.

BigQuery pricing

BigQuery pricing is scalable according to the use of this technology. There are three main cost drivers:

  • Storage: Price calculated on the volumes of data stored in BigQuery.
  • Compute: Resources used to query, transform, and process the data or to train, evaluate, and use ML models.
  • Streaming: Price calculated on the number of records that are ingested through the BigQuery streaming API.

Storage

BigQuery storage costs are calculated based on the uncompressed size of your datasets. BigQuery offers two layers of storage:

  • Active: All data stored in tables that have been inserted or updated in the last 90 days is considered active. At the time of writing, active storage is charged at $20.00 per terabyte per month.
  • Long-term: All data stored in tables that have not been modified in the last 90 days is considered long-term storage. At the time of writing, long-term storage is charged at $10.00 per terabyte per month. The change to long-term storage is automatically applied by Google and does not require any action by the BigQuery user.

    Tip

    Thanks to BigQuery long-term storage, it is no longer necessary to transfer archived data to Google Cloud Storage to save money. You can keep your data online and accessible at a very low cost.

Compute

BigQuery compute costs are calculated based on the volumes of data that are processed by the executed queries. The compute cost can vary according to the model that the customer has chosen:

  • On-demand: This is the default option. In this case, the user is charged only for the resources that are actually consumed.
  • Flat rate: This option can be enabled by users or companies that want to have a precise estimation of BigQuery costs and want to keep them stable over time. In this case, a reservation for a specific timeframe is needed and a fixed number of BigQuery slots are assigned to one or multiple GCP projects. As of October 2020, the minimum number of slots that can be reserved is 100 and the minimum time of commitment is 60 seconds. Currently, Google allows you to choose different flat rate options according to the actual requirements. This option can be enabled with a monthly, annual, or flex commitment. The flex option allows you to purchase BigQuery slots for short durations (a minimum of 60 seconds).

    Tip

    Keep in mind that you're not charged to store BigQuery public datasets: you pay only to access and query them. For this reason, using BigQuery public datasets can be a cost-effective way to perform tests on BigQuery, paying only for compute and not for storage.

Streaming

Loading data into BigQuery is usually free, apart from the ingestion processes that happen through the BigQuery streaming API. As of October 2020, you will be charged $0.010 for every 200 MB ingested with this interface. Each row is treated as a minimum of 1 KB.

Tip

If your use case doesn't require you to ingest data in real time, we suggest you use the bulk loading mechanism to ingest data into BigQuery, which is always free of charge.

BigQuery ML pricing

The pricing model of BigQuery ML is similar to that for BigQuery compute costs. As we saw in the previous section, customers can choose between the following options:

  • On-demand (pay-as-you-go) pricing model
  • Flat rate pricing model

If the customer has already chosen to activate flat rate mode with a fixed number of BigQuery slots available, BigQuery slots are also leveraged to train, evaluate, and run the BigQuery ML models.

If the customer is using the on-demand pricing model, it is necessary to split the BigQuery ML algorithms into two different categories:

  • External models: This category includes boosted trees, DNNs, TensorFlow, and AutoML. These models are trained on other GCP services integrated with BigQuery and are charged differently.
  • Internal models: This category includes all the remaining algorithms mentioned before that are trained directly on BigQuery.

At the time of writing, the pricing of internal models is based on the volumes of data processed during the main stages of the ML life cycle (training, evaluation, and prediction):

Figure 1.15 – BigQuery ML pricing for internal ML models

Figure 1.15 – BigQuery ML pricing for internal ML models

The pricing of external models is based on the cost of the external AI Platform resources used for the training of the model plus an additional BigQuery ML fee applied on top:

Figure 1.16 – BigQuery ML pricing for external ML models

Figure 1.16 – BigQuery ML pricing for external ML models

Prices are always under review and subject to change on GCP. For this reason, we suggest consulting https://cloud.google.com/bigquery-ml/pricing.

Free operations and free tiers

BigQuery offers a wide variety of operations free of charge, as well as free tiers to experiment with this technology at no cost.

The following operations are always free in BigQuery:

  • Loading data
  • Copying data (apart from the additional storage requested for the copy)
  • Exporting data
  • Deleting datasets, tables, views, partitions, or functions
  • Any metadata operations
  • Reading metadata tables and columns
  • Creating or replacing User-Defined Functions (UDFs)

To encourage experimentation with BigQuery, every month a user has the ability to leverage a free budget of operations under a certain threshold, as seen in the following table:

Figure 1.17 –  BigQuery ML free tiers

Figure 1.17 – BigQuery ML free tiers

Now that we've seen the BigQuery free tiers that we can use, let's take a look at the pricing calculator.

Pricing calculator

If you want to have a good estimation of the cost of using BigQuery with on-demand pricing, you can use the Google Cloud pricing calculator: https://cloud.google.com/products/calculator. The following screenshot shows the monthly cost of storing, ingesting through streaming, and processing the following data volumes:

  • Uncompressed storage volume: 10 TB
  • Volume of streaming inserts: 1 GB
  • Data processed by queries: 150 TB:
Figure 1.18 – BigQuery pricing calculator

Figure 1.18 – BigQuery pricing calculator

You can use the pricing calculator to estimate the consumption of all the other GCP services to get a better understanding of your GCP costs.

 

Summary

Throughout this first chapter, we've taken the first steps into learning what GCP offers, how it is different from other public cloud providers, and how Google is building on its ubiquitous applications such as Gmail and Google Maps to provide great services to companies via GCP.

We've also discovered that Google's proven experience in AI and ML, developed through the making of products such as Google Photos, also forms part of the services of GCP. Each AI and ML service can address various use cases and different types of users according to their skills and background. For example, most technical users, such as data scientists, can leverage TensorFlow to have great flexibility and control over their developed ML models, while business users can use Google's solutions to solve specific challenges with Document AI and Contact Center AI. The intermediate category is composed of AI and ML building blocks; these services can accelerate the development of new ML use cases or spread the usage of innovative techniques through a company.

One of these building blocks is BigQuery: its extension, BigQueryML, enables the development of ML models by leveraging existing SQL skills. The use of BigQuery ML can bring great benefits to companies that want to democratize ML, enabling a large segment of employees to participate by simplifying the heaviest and most time-consuming activities that usually require the involvement of different stakeholders, skills, and tools.

In the next chapter, we will get hands-on by creating a new Google Cloud project and accessing BigQuery for the first time.

 

About the Author

  • Alessandro Marrandino

    Alessandro Marrandino is a Google Cloud customer engineer. He helps various enterprises in the digital transformation journey through the adoption of cloud technologies. He is actively focused on and experienced in data management and smart analytics solutions. He has spent his entire career on data and artificial intelligence projects for global companies in different industries.

    Browse publications by this author
Book Title
Unlock this book and the full library for FREE
Start free trial