Reader small image

You're reading from  Learning Google BigQuery

Product typeBook
Published inDec 2017
Reading LevelBeginner
PublisherPackt
ISBN-139781787288591
Edition1st Edition
Languages
Right arrow
Authors (3):
Thirukkumaran Haridass
Thirukkumaran Haridass
author image
Thirukkumaran Haridass

Thirukkumaran Haridass currently works as a lead software engineer at Builder Homesite Inc. in Austin, Texas, USA. He has over 15 years of experience in the IT industry. He has been working on the Google Cloud Platform for more than 3 years. Haridass is responsible for the big data initiatives in his organization that help the company and its customers realize the value of their data. He has played various roles in the IT industry and worked for Fortune 500 companies in various verticals, such as retail, e-commerce, banking, automotive, and presently, real estate online marketing.
Read more about Thirukkumaran Haridass

Eric Brown
Eric Brown
author image
Eric Brown

Eric Brown currently works as an analytics manager for PMG advertising in Austin, Texas. Eric has over 11 years of experience in the data analytics field. He has been working on the Google Cloud Platform for over 3 years. He oversees client web analytics implementations and implements big data integrations in both Google BigQuery and Amazon Redshift. Eric has a passion for analytics, and especially for visualization and data manipulation through open source tools such as R. He has worked in various roles in various verticals, such as web analytics service providers, media companies, real-estate online marketing, and advertising.
Read more about Eric Brown

View More author details
Right arrow

BigQuery SQL Advanced

This chapter explores advanced options in BigQuery SQL. It explains partition tables in BigQuery and how to query data from partition tables. Sharding of tables is explained as an option to store data across multiple tables to save on billing. Built-in functions are explained for various categories such as datetime, strings, numbers, and so on.

Partition tables

Partition tables are special tables that store data at a daily level in separate internal tables. This helps to improve the query performance and also reduces billing by querying data using a specified date range. The following steps outline how to create the partition table for your projects using a GUI and Google Cloud SDK.

Creating a partition table using a GUI

Querying external data sources using BigQuery

BigQuery supports querying and joining of data from external data sources on the Google Cloud Platform. The following are the data sources you can query from BigQuery:

  • Google Cloud Storage files
  • Google Bigtable database
  • Google Drive files

The following demo shows how to query a CSV file in the Google Cloud Storage bucket using the BigQuery engine. The performance of the queries against external data sources is not as good as BigQuery data, and so it should be used with caution. The following are the steps to integrate Google Cloud Storage bucket files for querying:

  1. Create a table definition file for the file in the Google Cloud Storage bucket
  2. Link the data source as a table in the BigQuery dataset
  3. Query the table in the BigQuery dataset
...

Wildcard tables

Wildcard is a way of performing a union on tables whose names are similar and have compatible schemas. The following queries show how to perform wildcard operations on tables in the public dataset bigquery-public-data:new_york provided by Google.

The following query gets the number of trips per year made by a yellow taxi in New York. The query uses UNION ALL on all tables that start with the name tlc_yellow_trips_. If a new table is added for 2017, this query has to be modified to include that table as well. To automatically include tables having similar names in the query, wildcard table syntax can be used. This query uses standard SQL:

#standardSQL
SELECT MAX(EXTRACT(YEAR from pickup_datetime)) as TripYear, count(1) as TripCount FROM `bigquery-public-data.new_york.tlc_yellow_trips_2009`
UNION ALL
SELECT MAX(EXTRACT(YEAR from pickup_datetime)) as TripYear...

User-defined functions

User-defined functions can be written in JavaScript or SQL in BigQuery. These functions can be called in queries to obtain results. The following are the supported datatypes that can be passed to and returned by the functions:

  • ARRAY
  • BOOL
  • BYTES
  • DATE
  • FLOAT64
  • STRING
  • STRUCT
  • TIMESTAMP

The following is a simple function written in JavaScript to return the sum of two numbers, and it is used in the query. This query passes the tip_amount and tolls_amount values for each row from the table to the function and gets the sum:

#standardSQL
CREATE TEMPORARY FUNCTION GetOtherCharges(tipamount FLOAT64, tollsamount FLOAT64)
RETURNS INT64
LANGUAGE js AS """
return tipamount + tollsamount;
""";

SELECT vendor_id, GetOtherCharges( tip_amount, tolls_amount )
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
Custom external JavaScript libraries...

Views

BigQuery supports creating views, but they are not materialized views and the underlying query for a view is executed each time someone runs a query on the view. A view can be defined using legacy SQL or standard SQL, but the limitation is that if a view is defined in legacy SQL, then the queries executed using that view must also be in legacy SQL. The same applies to views that are defined using standard SQL; they can be used only in standard SQL statements. User-defined functions cannot be used in the query to define the views.

The BigQuery web console provides an option to save a query as a view, as shown in the following screenshot. Click on the Save View button as shown in this screenshot and choose the dataset under which the view has to be saved; provide a view name and save it:

To change the view definition, navigate to the view in the BigQuery web console and open...

Querying nested and repeated records

Google BigQuery supports loading of JSON files into BigQuery tables. JSON format data can contain nested datatypes and repeated datatypes. The example table shown in the following screenshot has an Employee_Names column as RECORD datatype. Each record in that column has two columns, one to store the first name and one to store the last name. Create the table as shown in this screenshot:

Download the following file to load to this new table. The file is a JSON file that contains the records to be loaded into this table from JSON format: https://github.com/hthirukkumaran/Learning-Google-BigQuery/blob/master/chapter1/employeedetails.json.

Upload the file to your Google Cloud Storage bucket using the gsutil command as shown here:

gsutil cp employeedetails.json gs://myfirstprojectbucket201706/employeedetails.json

Run the following command...

Summary

This chapter covered the practical use of partition tables, wildcard tables, nested and repeated records, and views. This chapter also covered how to define and use user-defined functions in JavaScript and SQL. Then we covered how to connect to external data sources and query them using the BigQuery engine. More and more federated data sources will be added to this list and you will learn how to connect to Bigtable and Google Drive files using the documents provided in the further reading section.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Google BigQuery
Published in: Dec 2017Publisher: PacktISBN-13: 9781787288591
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Thirukkumaran Haridass

Thirukkumaran Haridass currently works as a lead software engineer at Builder Homesite Inc. in Austin, Texas, USA. He has over 15 years of experience in the IT industry. He has been working on the Google Cloud Platform for more than 3 years. Haridass is responsible for the big data initiatives in his organization that help the company and its customers realize the value of their data. He has played various roles in the IT industry and worked for Fortune 500 companies in various verticals, such as retail, e-commerce, banking, automotive, and presently, real estate online marketing.
Read more about Thirukkumaran Haridass

author image
Eric Brown

Eric Brown currently works as an analytics manager for PMG advertising in Austin, Texas. Eric has over 11 years of experience in the data analytics field. He has been working on the Google Cloud Platform for over 3 years. He oversees client web analytics implementations and implements big data integrations in both Google BigQuery and Amazon Redshift. Eric has a passion for analytics, and especially for visualization and data manipulation through open source tools such as R. He has worked in various roles in various verticals, such as web analytics service providers, media companies, real-estate online marketing, and advertising.
Read more about Eric Brown