This chapter explores advanced options in BigQuery SQL. It explains partition tables in BigQuery and how to query data from partition tables. Sharding of tables is explained as an option to store data across multiple tables to save on billing. Built-in functions are explained for various categories such as datetime, strings, numbers, and so on.
You're reading from Learning Google BigQuery
Partition tables
Partition tables are special tables that store data at a daily level in separate internal tables. This helps to improve the query performance and also reduces billing by querying data using a specified date range. The following steps outline how to create the partition table for your projects using a GUI and Google Cloud SDK.
Creating a partition table using a GUI
Download the sample file from this URL and upload it to a Google Cloud Storage bucket: https://github.com/hthirukkumaran/Learning-Google-BigQuery/blob/master/chapter1/employeedetails.csv. And note down the bucket name.
- Click on the Create new table option under the Dataset menu.
- To create a partition table, enable the partition option by choosing...
Querying external data sources using BigQuery
BigQuery supports querying and joining of data from external data sources on the Google Cloud Platform. The following are the data sources you can query from BigQuery:
- Google Cloud Storage files
- Google Bigtable database
- Google Drive files
The following demo shows how to query a CSV file in the Google Cloud Storage bucket using the BigQuery engine. The performance of the queries against external data sources is not as good as BigQuery data, and so it should be used with caution. The following are the steps to integrate Google Cloud Storage bucket files for querying:
- Create a table definition file for the file in the Google Cloud Storage bucket
- Link the data source as a table in the BigQuery dataset
- Query the table in the BigQuery dataset
Wildcard tables
Wildcard is a way of performing a union on tables whose names are similar and have compatible schemas. The following queries show how to perform wildcard operations on tables in the public dataset bigquery-public-data:new_york provided by Google.
The following query gets the number of trips per year made by a yellow taxi in New York. The query uses UNION ALL on all tables that start with the name tlc_yellow_trips_. If a new table is added for 2017, this query has to be modified to include that table as well. To automatically include tables having similar names in the query, wildcard table syntax can be used. This query uses standard SQL:
#standardSQL
SELECT MAX(EXTRACT(YEAR from pickup_datetime)) as TripYear, count(1) as TripCount FROM `bigquery-public-data.new_york.tlc_yellow_trips_2009`
UNION ALL
SELECT MAX(EXTRACT(YEAR from pickup_datetime)) as TripYear...
User-defined functions
User-defined functions can be written in JavaScript or SQL in BigQuery. These functions can be called in queries to obtain results. The following are the supported datatypes that can be passed to and returned by the functions:
- ARRAY
- BOOL
- BYTES
- DATE
- FLOAT64
- STRING
- STRUCT
- TIMESTAMP
The following is a simple function written in JavaScript to return the sum of two numbers, and it is used in the query. This query passes the tip_amount and tolls_amount values for each row from the table to the function and gets the sum:
#standardSQL
CREATE TEMPORARY FUNCTION GetOtherCharges(tipamount FLOAT64, tollsamount FLOAT64)
RETURNS INT64
LANGUAGE js AS """
return tipamount + tollsamount;
""";
SELECT vendor_id, GetOtherCharges( tip_amount, tolls_amount )
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
Views
BigQuery supports creating views, but they are not materialized views and the underlying query for a view is executed each time someone runs a query on the view. A view can be defined using legacy SQL or standard SQL, but the limitation is that if a view is defined in legacy SQL, then the queries executed using that view must also be in legacy SQL. The same applies to views that are defined using standard SQL; they can be used only in standard SQL statements. User-defined functions cannot be used in the query to define the views.
The BigQuery web console provides an option to save a query as a view, as shown in the following screenshot. Click on the Save View button as shown in this screenshot and choose the dataset under which the view has to be saved; provide a view name and save it:
To change the view definition, navigate to the view in the BigQuery web console and open...
Querying nested and repeated records
Google BigQuery supports loading of JSON files into BigQuery tables. JSON format data can contain nested datatypes and repeated datatypes. The example table shown in the following screenshot has an Employee_Names column as RECORD datatype. Each record in that column has two columns, one to store the first name and one to store the last name. Create the table as shown in this screenshot:
Download the following file to load to this new table. The file is a JSON file that contains the records to be loaded into this table from JSON format: https://github.com/hthirukkumaran/Learning-Google-BigQuery/blob/master/chapter1/employeedetails.json.
Upload the file to your Google Cloud Storage bucket using the gsutil command as shown here:
gsutil cp employeedetails.json gs://myfirstprojectbucket201706/employeedetails.json
Run the following command...
Summary
This chapter covered the practical use of partition tables, wildcard tables, nested and repeated records, and views. This chapter also covered how to define and use user-defined functions in JavaScript and SQL. Then we covered how to connect to external data sources and query them using the BigQuery engine. More and more federated data sources will be added to this list and you will learn how to connect to Bigtable and Google Drive files using the documents provided in the further reading section.
Further reading
- This document outlines the steps to create authorized views in your projects: https://cloud.google.com/bigquery/docs/views
- This document outlines how to specify nested and repeated fields when creating the tables: https://cloud.google.com/bigquery/docs/nested-repeated
- An overview of partition tables and best practices are outlined here: https://cloud.google.com/bigquery/docs/partitioned-tables
- This document outlines how to query wildcard tables using standard SQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference
- How to query nested and repeated fields using legacy SQL: https://cloud.google.com/bigquery/docs/legacy-nested-repeated
- How to migrate from legacy SQL to standard SQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
- How to connect to external data sources: https://cloud.google...