Getting started with BigQuery
What is BigQuery?
BigQuery is Google's fully managed solution of storing and analyzing big data at the scale of petabytes. One of the most remarkable features of BigQuery is that the underlying infrastructure is managed entirely by Google. BigQuery can be thought of as a scalable SQL server - we can fetch and manipulate data using SQL queries.
The goal of this guide is to get our hands dirty and familiarise ourselves with the BigQuery environment. We will create a dummy table and perform some simple queries against it in BigQuery.
Setting up GCP project and enabling the BigQuery API
We first need to go through the following steps before we can start using BigQuery:
create a GCP (Google Cloud Platform) project, which can be done here. You need to enable billing to start using BigQuery.
enable the BigQuery API, which can be done here.
BigQuery requires your GCP project to have enabled billing, and so you would need to supply credit card details. If you're a new user, then you get $300 USD worth of free credit that expires in 90 days. BigQuery does come with a generous free-tier so the operations that you perform in this guide should be covered by the free-tier.
GCP also provides the BigQuery sandbox, which is an easy way to try out BigQuery without supplying credit card details. However, the sandbox does not allow for Data manipulation language (DML) statements, that is, we cannot invoke statements such as
DELETE. For this guide, we will be using these DML statements, and so the sandbox does not meet our needs.
What are projects, datasets and tables in BigQuery?
BigQuery organises database objects as:
Projects > Datasets > Tables
Under a single GCP project, multiple datasets can be created, under which multiple tables can be created. Datasets can be thought of as schemas, and can be created using
CREATE SCEHMA. Also, tables belonging to different datasets can be still be joined.
Running our first BigQuery SQL script
Once we have set up our GCP project and enabled the BigQuery API, we will have access to the BigQuery SQL editor:
In this editor, we will be able to run SQL statements against our BigQuery database.
Let's try computing
SELECT 1 + 1;
Click on the RUN button and we should see the results panel pop up at the bottom:
🎉 Congratulations - you've just written your first BigQuery SQL script!
Creating a table using SQL in BigQuery
To create a table, we first need to create a dataset:
CREATE SCHEMA forum;
Here, we've created a dataset called
We also need to know our GCP project ID, which can be obtained by clicking on our project in the header:
We should see a list of projects and their IDs:
Copy the project ID because we need this information when creating tables.
To create an empty
CREATE TABLE IF NOT EXISTS `demobigquery-353007.forum.users` (name STRING NOT NULL,age INT,);
Here, note the following
demobigquery-353007is our project ID,
forumis the name of the dataset and
usersis the name of the table.
we must use
`instead of apostrophes to refer to table names that include the project ID.
Unlike traditional SQL systems like PostgreSQL, BigQuery has no notion of primary keys and indexes, and instead uses
columnnar as the storage method to greatly compress our data. However, similar to traditional SQL systems, the queries that we execute will be optimized automatically even without an index by BigQuery's internal query planner.
Inserting rows into a BigQuery table
To insert rows into the
INSERT INTO forum.users(name, age)VALUES('Alex', 20),('Bob', 30),('Cathy', 40);
Notice how we need to include the dataset name (
forum) when referring to the table.
Selecting rows from a BigQuery table
Now that we've inserted some rows into the
users table, let's use the
SELECT statement to print the rows of our table:
SELECT * FROM `forum.users`;Row name age1 Alex 202 Bob 303 Cathy 40
Notice how simple queries like this take a few seconds to complete. This is because BigQuery was designed to handle big data efficiently. While the processing time for small amounts of data may be slower in BigQuery than it is in standard SQL servers (e.g. PostgreSQL), BigQuery can process large datasets much faster.
Performing a simple query in BigQuery
To get all users whose
age is strictly greater than
SELECT * FROM forum.users WHERE age > 25;Row name age1 Cathy 402 Bob 30
Updating values in BigQuery
Updating values based on condition in BigQuery
By default, BigQuery only allows updating values with a
UPDATE forum.users SET name = upper(name) WHERE name = 'Alex';SELECT * FROM forum.users;Row name age1 ALEX 202 Cathy 403 Bob 30
The idea is that by having to add the
WHERE clause, we can avoid inadvertently updating all values.
Updating all values in BigQuery
To update all values, use
UPDATE forum.users SET name = upper(name) WHERE TRUE;SELECT * FROM forum.users;Row name age1 ALEX 202 CATHY 403 BOB 30
Deleting rows in BigQuery
Deleting rows based on condition in BigQuery table
To delete rows from a table, use the
DELETE forum.users WHERE age < 25;SELECT * FROM forum.users;Row name age1 Bob 302 Cathy 40
Note that the
DELETE statement requires the
WHERE clause just like for
Deleting all rows from BigQuery table
To delete all rows in a table, use the
TRUNCATE TABLE forum.users;
Note that the
forum.users table still exist and is an empty table with no zeros - meta information such as the column labels and types are not removed.
Dropping BigQuery table
To drop an entire table:
DROP TABLE forum.users;
DROP TABLE removes the table as well as its meta-information.
Deleting the GCP project
We highly recommend that you delete the GCP project once you're done experimenting to ensure that you will not be charged by the BigQuery service. To delete this GCP project, click on Cloud overview in the left side bar, and then click on Dashboard:
Next click on Go to project settings:
Finally, click on SHUT DOWN and follow the displayed steps: