The goal of this first introductory chapter is to give you an overview of the Google Cloud Platform (GCP). We start by explaining why machine learning (ML) and cloud computing go hand in hand as the demand for ever more hungry computing resources grows for today's ML applications. We then proceed with a 360° presentation of the platform's data-related services. Account and project creation as well as role allocation close the chapter.
A data science project follows a regular set of steps: in extracting the data, exploring, cleaning it, extracting information, training and assessing models, and finally building machine-learning-enabled applications. For each step of the data science flow, there are one or several services in the GCP that are adequate.
But, before we present the overall mapping of the GCP data-related services, it is important to understand why ML and cloud computing are truly made for each other.
In this chapter, we will cover the following topics:
- ML and the cloud
- Introducing the GCP
- Data services of the Google platform
In short, artificial intelligence (AI) requires a lot of computing resources. Cloud computing addresses those concerns.
ML is a new type of microscope and telescope, allowing each of to us to push the boundaries of human knowledge and human activities. With ever more powerful ML platforms and open tools, we are able to conquer new realms of knowledge and grow new types of businesses. From the comfort of our laptops, at home, or at the office, we can better understand and predict human behavior in a wide range of domains. Think health care, transportation, energy, financial markets, human communication, human-machine interaction, social network dynamics, economic behavior, and nature (astronomy, global warming, or seismic activity). The list of domains affected by the explosion of AI is truly unlimited. The impact on society? Astounding.
With so many resources available to anyone with an online connection, the barrier to joining the AI revolution has never been lower than it is now. Books, tutorials, MOOCs, and meet-ups, as well as open source libraries in a myriad of languages, are freely available to both the seasoned and the beginner data scientist.
As veteran data scientists know well, data science is always hungry for more computational resources. Classification on the Iris or the MINST image datasets or predictive modeling on Titanic passengers does not reflect real-world data. Real-world data is by essence dirty, incomplete, noisy, multi-sourced, and more often than not, in large volumes. Exploiting these large datasets requires computational power, storage, CPUs, GPUs, and fast I/O.
However, more powerful machines are not sufficient to build meaningful ML applications. Grounded in science, data science requires a scientific mindset with concepts such as reproducibility and reviewing. Both aspects are made easier by working with online accessible resources. Sharing datasets and models and exposing results is always more difficult when the data lives on one person's computer. Reproducing results and maintaining models with new data also requires easy accessibility to assets. And as we work on ever more personalized and critical data (for instance in healthcare), privacy and security concerns become all the more important to the project stakeholders.
This is where the cloud comes in, by offering scalability and accessibility while providing an adequate level of security.
Before diving into GCP, let's learn a bit more about the cloud.
ML projects are resource intensive. From storage to computational power, training models sometimes require resources that cannot be found on a simple standalone computer. Physical limitations in terms of storage have shrunk in recent years. As we now enjoy reliable terabyte storage accessible at reduced prices, storage is no longer an issue for most data projects that are not in the realm of big data. Computing power has also increased so much that what required expensive workstations a few years ago can now run on laptops.
However, despite all this amazingly rapid evolution, the power of the standalone PC is finite. There is an upper limit to the volume of data you can store on your machine and to the time you're willing to wait to get your model trained. New frontiers in AI, with speech-to-text, video captioning in real time, self-driving cars, music generation, or chatbots that can fool a human being and pass the turing test, require ever larger resources. This is especially true of deep learning models, which are too slow on standard CPUs and require GPU-based machines to train in a reasonable amount of time.
ML in the cloud does not face these limitations. What you get with cloud computing is direct access to high-performance computing (HPC). Before the cloud (roughly before AWS launched its Elastic Computing Cloud (EC2) service in 2006), HPC was only available via supercomputers, such as the Cray computers. Cray is a US company that has built some of the most powerful supercomputers since the 1960s. China's Tianhe-2 is now the most powerful supercomputer in the world, with a capacity of 100,000 petaflops (that's 102 x 1015, or 10 to the power of 17 floating-point operations per second!).
A supercomputer not only costs millions of US dollars but also requires its own physical infrastructure and has huge maintenance costs. It is also out of reach for individuals and for most companies. Engineers and researchers, hungry for HPC, now turn to on-demand cloud infrastructures. Cloud service offers are democratizing access to HPC.
Computing in the cloud is built on a distributed architecture. The processors are distributed across different servers instead of being aggregated in one single machine. With a few clicks or command lines, anyone can sign up massively complex banks of servers in a matter of minutes. The amount of power at your command can be mind-blowing.
Cloud computing can not only handle the most demanding optimization tasks but also carry out a simple regression on a tiny dataset. Cloud computing is extremely flexible.
To recap, cloud computing offers:
- Instantaneity: Resources can be made available in a matter of minutes.
- On-demand: Instances can be put on stand by or decommissioned when no longer needed.
- Diversity: The wide range of operating systems, storage, and database solutions, allow the architect to create project-focused architectures, from simple mobile applications to ML APIs.
- Unlimited resources: If not infinite yet, the volume of resources for storage computing and networks you can assemble is mind-blowing.
- GPUs: Most PCs are based on CPUs (with the exception of machines optimized for gaming). Deep learning requires GPUs to achieve human-compatible speeds for training models. Cloud computing makes GPUs available at a fraction of the cost needed to buy GPU machines.
- Controlled accessibility and security: With granular role definitions, service compartmentalization, encrypted connections, and user-based access control, cloud platforms greatly reduce the risk of intrusion and data loss.
Apart from these, there are several other types of cloud platforms and offers on the market.
There are two main types of cloud models depending on the needs of the customers: public versus private and multi-tenant versus single-tenant. These different cloud types offer different levels of management, security, and pricing.
A public cloud consists of resources that are located off-site over the internet. In a public cloud, the infrastructure is typically multi-tenant. Multiple customers can share the same underlying hardware or server. Resources such as networking, storage, power, cooling and computing are all shared. The customer usually has no visibility of where this infrastructure is hosted except for choosing a geographic region. The pricing mode of a public cloud service is based on the volume of data, the computing power that is used and other infrastructure-management-related services—or, more precisely, a mix of RAM, vCPUs, disk, and bandwidth.
In a private cloud, the resources are dedicated to a single customer; the architecture is single-tenant instead of multi-tenant. The servers are located on premise or in a remote data center. Customers own (or rent) the infrastructure and are responsible for maintaining it. Private cloud infrastructures are more expensive to operate as they require dedicated hardware to be secured for a single tenant. Customers of the private cloud have more control over their infrastructure, and therefore they can achieve their compliance and security requirements.
Hybrid clouds are composed of a mix of public clouds and private ones.
The GCP is a public multi-tenant cloud platform. You share the servers you use with other customers and let Google handle the support, the data centers, and the infrastructure.
The cloud market has also diversified into two large segments—managed cloud versus unmanaged cloud.
In an unmanaged cloud platform, the infrastructure is self-served. In case of failure, it is the responsibility of the customer to have some mechanisms in place to restore the operations. Unmanaged cloud requires the customer to have the qualified expertise and resources to build, manage, and maintain cloud instances and infrastructures. Focused on self-serving applications, unmanaged cloud offers do not include support with their basic tiers.
In a managed cloud platform, the provider will support the underlying infrastructure by offering monitoring, troubleshooting, and around-the-clock customer service. Managed cloud brings along qualified expertise and resources to the team right away. For many companies, having a service provider to handle their public cloud can be easier and more cost-effective than hiring their own staff to operate their clouds.
The GCP is a public, multi-tenant, and unmanaged cloud service. So are AWS and Azure. Rackspace, on the other hand, is an example of a managed cloud service company. As an example, Rackspace just started offering managed services for GCP in March 2017.
Another important distinction is to be made with respect to the amount of work done by the user or by the cloud platform provider. Let us take a look at this distinction with the help of the following service levels:
- Infrastructure as a Service (IaaS): At the minimum level, IaaS, the cloud provider, handles the machines, their virtualization and the required networking. The user is responsible for everything else—OS, middleware, data, and application software. The provider is the host of the resources on which the user builds the infrastructure. Google compute Engine, SQL, DNS, or load balancing are examples of IaaS services within the GCP.
- Platform as a service (PaaS): In a PaaS offering, the user is only responsible for the software and the data. Everything else is handled by the cloud provider. The provider builds the infrastructure while the user deploys the software. The main advantage of PaaS over IaaS, besides the reduced workload and need for sysadmin resources, is the automatic scaling for web applications. The appropriate number of resources are automatically allocated as demand fluctuates. Examples of PaaS services include Heroku or the Google App Engine.
- Software as a service (SaaS): In SaaS, the provider is a software company offering services online while the user consumes the service that are provided. Think Uber, Facebook, or Gmail.
While being mostly an IaaS provider, the GCP also has some PaaS offerings such as the Google App Engine. And its ML APIs (text, speech, video, and image) can be considered as SaaS.
Pricing of cloud services is complicated and varies across vendors. Basic cost structure of a cloud service can be broken down into:
- Computing costs: The duration of running VMs per number of vCPUs, per GB of RAM
- Storage costs: Disks, files, and databases per GB
- Networking costs: internal and external, inbound and outbound traffic
Google's preemptible VMs (AWS Spot instances) are VMs that are built on leftover, unused capacity and priced three to four times lower than normal on-demand VMs. However, Compute Engine may terminate (preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are adapted to batch processing jobs or workflows that can withstand sudden interruptions. They may also not always be available. In the next chapter, we learn how to launch preemptible instances from the command line.
Google cloud also recently introduced price reduction for committed use. You get a discount when you reserve instances for a long period of time, typically committing to a usage term of 1 year or 3 years.
The argument of cost cutting when moving to the cloud holds when your infrastructure is evolving quickly and requires scalability and rapid modifications. If your applications are very static with stable load, the cloud may not result in lower costs. In the end, as the cloud offers much more flexibility and opens the way to implementing new projects quickly, the overall cost is higher than with a fixed infrastructure. But this flexibility is the true benefit of cloud computing.
See https://cloud.google.com/compute/pricing for the current Google Compute Engine pricing.
The costs of cloud services have dwindled in the past several years. The three major public cloud actors have gone through successive phases of price reduction since 2012, when AWS drastically reduced its storage prices to undermine the competition. The four main cloud providers reduced their prices 22 times in 2012 and 26 times in 2013. Reductions ranged from 6% to 30% and touched all types of services: computing, storage, bandwidth, and databases. As of January 2014, Amazon had reduced the price of their offerings over 40 times. These reductions have been matched or exceeded by the other main cloud service providers. Recently, the three main actors have further reduced their prices on storage, possibly reigniting the price war. According to a recent study of cloud computing prices, there isn't much data suggesting that cloud is anywhere near a commodity yet. 451 research said so, further predicting that relational databases are likely to be the next price war battleground.
So, near-instant availability, low cost, flexible architecture, and near-unlimited resources are the advantages of cloud computing, at the expense of extra overhead and recurring costs.
In the global landscape of cloud computing, the GCP is a public unmanaged IaaS cloud offering, with some PaaS and SaaS services. Although Azure and GCP are directly comparable for standard cloud services such as from computing (EC2, Cloud Compute, and so on), databases (BigQuery, Redshift, and so on), network, and so forth; the Google Cloud approach to ML is quite different than Amazon's or Azure's.
In short, AWS offers, either all-in-one services for very specific applications—face recognition and Alexa-related applications, or a predictive analytics platform based on classic (not deep learning) models called Amazon ML. Microsoft's offer is more PaaS centered, with its Cortana Intelligence Suite. Microsoft's ML service is quite similar to AWS's, with more available models.
The GCP ML offer is based on TensorFlow, Google's deep learning library. Google offers a wide range of ML APIs based on pre-trained TensorFlow models for NLP, speech-to-text, translation, image, and video processing. It also offers a platform where you can train your own TensorFlow models and evaluate them (TensorBoard).
The first cloud computing service dates back to 15 years ago, when, in July 2002, Amazon launched the AWS platform to expose technology and product data from Amazon and its affiliates, enabling developers to build innovative and entrepreneurial applications on their own. In 2006, AWS was relaunched as the EC2.
The early start of AWS gave Amazon a lead in cloud computing, one that has never faltered since. Competitors were slow to counteract and launch their own offers. The first alternative to the AWS cloud services from a major company came with the Google App Engine launched in April 2008 as a PaaS service for developing and hosting web applications. The GCP was thus born. Microsoft and IBM followed, with the Windows Azure platform launched in February 2010 and LotusLive in January 2009.
Google didn’t enter the IaaS market until much later. In 2013, Google released the Compute Engine to the general public with enterprise service-level agreements (SLA).
With over 40 different IaaS, PaaS, and SaaS services, the GCP ecosystem is rich and complex. These services can be grouped into six different categories:
- Hosting and computation
- Storage and databases
- Identity and security
- Resource management and monitoring
In the following section, we learn how to set up and manage a single VM instance on Google Compute Engine. But, before that, we need to create our account.
Getting started on the GCP is pretty much straightforward. All you really need is a Google account. Go to https://cloud.google.com/, log in with your Google account, and follow the instructions. Add your billing information as needed. This gives you access to the web-based UI of the GCP. We'll cover command line and shell accessibility and related SSH key creation in the next chapter.
At the time of writing this, Google has a pretty generous free trial offer with a 12-month period and a credit of $300 for new accounts. There are, however, limitations on some services. For instance, you cannot launch the Google Compute Engine VM instances with more than eight CPUs and you are limited in the number of projects you create, though you can request more than your allocated quota. There is no SLA. Using Google Cloud services for activities such as bitcoin mining is not allowed. Once you upgrade your account, these limitations no longer apply and the money left out of the initial $300 is credited to your account. More information on the free trial offer is available at https://cloud.google.com/free/docs/frequently-asked-questions.
One key aspect of the GCP is its project-centered organization. All billing, permissions, resources, and settings are grouped within a user-defined project, which basically acts as a global namespace. It is simply not possible to launch a resource without specifying the project it belongs to first.
Each one of these projects has:
- A project name, which you choose.
- A project ID, suggested by GCP but editable. The project ID is used by API calls and resources within the project.
- A project number, which is provided by the GCP.
Both the project ID and project numbers are unique across all GCP projects. The project organization has several straightforward benefits:
- As resources are dedicated to a single project, budget allocation and billing are simplified
- As the resources allocated to a project are subject to the same regions-and-zones rules and share the same metadata, operations and communications between them work seamlessly
- Similarly, access management is coherent across a single project, limiting the overall complexity of access control
Project-based organization greatly simplifies the management of your resources and is a key aspect of what makes the GCP quite easy to work with.
To create a new project:
- Go to the resource management page, https://console.cloud.google.com/cloud-resource-manager.
- Click on CREATE PROJECT.
- Write down your project title and notice how Google generates a project ID on the fly. Edit it as needed.
- Click on Create.
- You are redirected to the Role section of the IAM service.
By default, when you create a new project, your Google account is set as the owner of the project with full permissions and access across all the project's resources and billing. In the roles section of the IAM page, https://console.cloud.google.com/iam-admin/roles/, you can add people to your project and define the role for that person. You can also create new custom roles on a service-by-service basis or allocate predefined roles organized by the services.
- Go to the IAM page and select the project you just created, if it's not already selected: https://console.cloud.google.com/iam-admin/iam/project. You should see your Google account email as the owner of the project.
- To add a new person to the project:
- Click on + ADD.
- Input the person's Google account email (it has to correspond to an active Google account).
- Select all the roles for that person, as shown in the following screenshot:
The role menu is organized by services and administrative domain (billing, logging, and monitoring), and for each service, by level of access. Although this differs depending on the service, you can roughly choose between four types of roles:
- Admin: Full control over the resources
- Client: Connectivity access
- Editor/creator: Full control except for user management, SSL certificates, and deleting instances
- Viewer: Read-only access
You can also create new custom made roles from the roles IAM page at https://console.cloud.google.com/iam-admin/roles/project?project=packt-gcp.
As you allocate new resources to your project, the platform creates the adequate and required roles and permissions between the services. You can view and manage these access permissions and associated roles from the info panel on the right of the manage resource page or the IAM page for the given project. Google does a great job of generating the right access levels, which makes the platform-user's life easier.
For this book I created the packt-gcp project. Since the name was unique across all other GCP projects, the project ID is also packt-gcp. And all the resources are created in the us-central1 zone.
Throughout the book, I will conclude the chapter with a list of online resources that recap or go beyond what was discussed in the chapter:
- Many excellent articles on the GCP use for big data can be found on the Google big data blog at https://cloud.google.com/blog/big-data/.
- What are the GCP services? Reto Meier, software engineer at Google, describes the different Google Cloud services in a simple way (for more information, see https://hackernoon.com/what-are-the-google-cloud-platform-gcp-services-285f1988957a). This is very useful for grasping the diversity of the GCP services.
- An Annotated History of Google’s Cloud Platform is another post by Reto Meier on the history of the GCP. You can find it at: https://medium.com/@retomeier/an-annotated-history-of-googles-cloud-platform-90b90f948920. It starts with the bullet point: Pre 2008 — Computers invented. Google Founded.... A much more detailed timeline of the GCP is available on Crunchbase at https://www.crunchbase.com/organization/google-cloud-platform/timeline#/timeline/index.
- The evolution of computing power, also known as Moore's law, is available at http://www.cs.columbia.edu/~sedwards/classes/2012/3827-spring/advanced-arch-2011.pdf, and a more recent version where the seven most recent data points are all NVIDIA GPUs is available at https://en.wikipedia.org/wiki/Moore%27s_law#/media/File:Moore%27s_Law_over_120_Years.png.
- For more on the pricing war of the three main cloud platforms, see this article: Cloud Pricing Trends: Get the White Paper, Rightscale, 2013, at https://www.rightscale.com/lp/cloud-pricing-trends-white-paper.
- A good article on Supercomputing vs. Cloud Computing by David Stepania can be found at https://www.linkedin.com/pulse/supercomputing-vs-cloud-computing-david-stepania/.
In this introductory chapter, we looked at the nature of the GCP and explored its services architecture. We created a new project and understood role creation and allocation. Although a new entrant on the cloud computing market, the GCP offers a complete set of services for a wide range of applications. We study these services in depth in the rest of this book.
We are now ready to get started with data science on the Google platform. In the next chapter, we'll create a VM instance on Google Compute Engine and install a data science Python stack with the Anaconda distribution. We'll explore the web UI and learn how to manage instances through the command line and the Google Shell.