BigQuery Best Practices: All in one place.

Perumal Babu
3 min readJan 15, 2022

Curated best practices for big query

Photo by Kevin Ku on Unsplash

I am sharing my notes on Bigquery across the following dimensions.

  • Cost control
  • Security
  • Performance
  • Extensibility
  • Design time considerations
  • Partitioning

Cost Control :

  • It has 2 factors to consider while optimizing cost. Data storage cost and retrieval cost. ( applies to standard and Partition tables)
  • Partitions with cold data would go to cold storage if not accessed >90 days ( almost 50% storage cost reduction ). So, Plan your partitions accordingly.
  • When querying make the partition field part of the query filter so that you don’t end up scanning all data.
  • When you partition by time — after the expiration date of a partition is passed, BigQuery deletes all data in this particular partition. The ideal use case for log analytics. Additionally, when creating a table you can also set expiration time at a table level too.
  • Choose the best cost model — if you are planning to use it extensively may be a flat rate model would work ( for that you need $10,000 min monthly cost. Note, with annual commitments it would come down.). You will get 500 slots. You may have to compromise on performance as the queries will be queued.
  • The on-demand model, demands more caution with the queries we write. Hence make sure we retrieve only the data we need.
  • Creating materialized views will help reduce costs. Create a daily aggregate table. Read from that aggregate table instead of the original table. This will reduce data retrieval costs.
  • Try a — dry-run feature to estimate the cost when in doubt.
  • Use the maximum bytes billed setting to limit query costs
  • Use streaming inserts only if that data is needed in real-time.

Security :

  • Execution Permissions and query permission can be provided fine-grained: You can give hierarchical permission too.
  • You can create rowAccessPoliy to create row-level access.
  • For column-level access, you need to create taxonomy and policy tags appropriately.
  • Supports both google-managed keys and customer keys for encryption
  • Use principle of lease privilege — provide access to only the data that is required to be shared.
  • Create service accounts for each application that wants to access the data.
  • Provide group-level access instead of individual access.

Performance:

  • Use clustered tables to improve query performance for specific filters ( note it does not support cost guarantees )
  • Use materialized view for better performance
  • When joining tables start with the largest tables.
  • UDF would impact performance
  • External data source impacts performance

Extensibility :

  • BigQuery supports querying Avro, Parquet, ORC, JSON, and CSV partitioned data that are hosted on Google Cloud Storage using a default hive partitioning layout.
  • Also supports — Bigtable, GCS, and cloud SQL
  • Service accounts that need to access external repository — Gdrive you need to enable domain-wide delegation.
  • Use external data for the following
  • If the data is frequently changing in the external source.
  • Use Temp external source when loading data into DW. It cannot be shared.
  • With permanent external tables BQ can be used to control access

Partition in Big Query :

  • 3 ways — Ingestion time, Time, integer
  • If we want to partition by ingestion time we must also provide the partitioning ingestion time granularity (hour, day, month, year).
  • If we choose to partition the table based on an integer range and column (e.g., customer_id), we must also provide the start value, end value, and the interval.
  • if we want to partition the table based on the date_registered field, we must also provide the respective granularity (day, month, year).
  • You can use partitioning and clustering together.
  • Pruning is a process when executing a query in a petitioned table BQ will scan the partitions that match the filter clause.

Design time considerations:

  • Choose the dataset location close to or in the same region
  • Column data types cannot be changed after the creation
  • When using jobs to load data they can be on different projects but should be in the same location.
  • Note that there are several constraints with views like

Underlaying tables should be in the same region

UDFs are not allowed in Views

No wildcard table reference

  • Create UDFs when needed to embed business rules and data consistency
  • Create views specific to the use case and provide appropriate read permission on tables behind them.

--

--