BigQuery Best Practices: All in one place.

3 min readJan 15, 2022

Curated best practices for big query

I am sharing my notes on Bigquery across the following dimensions.

Cost Control :

It has 2 factors to consider while optimizing cost. Data storage cost and retrieval cost. ( applies to standard and Partition tables)
Partitions with cold data would go to cold storage if not accessed >90 days ( almost 50% storage cost reduction ). So, Plan your partitions accordingly.
When querying make the partition field part of the query filter so that you don’t end up scanning all data.
When you partition by time — after the expiration date of a partition is passed, BigQuery deletes all data in this particular partition. The ideal use case for log analytics. Additionally, when creating a table you can also set expiration time at a table level too.
Choose the best cost model — if you are planning to use it extensively may be a flat rate model would work ( for that you need $10,000 min monthly cost. Note, with annual commitments it would come down.). You will get 500 slots. You may have to compromise on performance as the queries will be queued.
The on-demand model, demands more caution with the queries we write. Hence make sure we retrieve only the data we need.
Creating materialized views will help reduce costs. Create a daily aggregate table. Read from that aggregate table instead of the original table. This will reduce data retrieval costs.
Try a — dry-run feature to estimate the cost when in doubt.
Use the maximum bytes billed setting to limit query costs
Use streaming inserts only if that data is needed in real-time.

Security :

Execution Permissions and query permission can be provided fine-grained: You can give hierarchical permission too.
You can create rowAccessPoliy to create row-level access.
For column-level access, you need to create taxonomy and policy tags appropriately.
Supports both google-managed keys and customer keys for encryption
Use principle of lease privilege — provide access to only the data that is required to be shared.
Create service accounts for each application that wants to access the data.
Provide group-level access instead of individual access.

Performance:

Use clustered tables to improve query performance for specific filters ( note it does not support cost guarantees )
Use materialized view for better performance
When joining tables start with the largest tables.
UDF would impact performance
External data source impacts performance

Extensibility :

BigQuery supports querying Avro, Parquet, ORC, JSON, and CSV partitioned data that are hosted on Google Cloud Storage using a default hive partitioning layout.
Also supports — Bigtable, GCS, and cloud SQL
Service accounts that need to access external repository — Gdrive you need to enable domain-wide delegation.
Use external data for the following
If the data is frequently changing in the external source.
Use Temp external source when loading data into DW. It cannot be shared.
With permanent external tables BQ can be used to control access

Partition in Big Query :

3 ways — Ingestion time, Time, integer
If we want to partition by ingestion time we must also provide the partitioning ingestion time granularity (hour, day, month, year).
If we choose to partition the table based on an integer range and column (e.g., customer_id), we must also provide the start value, end value, and the interval.
if we want to partition the table based on the date_registered field, we must also provide the respective granularity (day, month, year).
You can use partitioning and clustering together.
Pruning is a process when executing a query in a petitioned table BQ will scan the partitions that match the filter clause.

Design time considerations:

Choose the dataset location close to or in the same region
Column data types cannot be changed after the creation
When using jobs to load data they can be on different projects but should be in the same location.
Note that there are several constraints with views like

Underlaying tables should be in the same region

UDFs are not allowed in Views

No wildcard table reference

Create UDFs when needed to embed business rules and data consistency
Create views specific to the use case and provide appropriate read permission on tables behind them.

Written by Perumal Babu