BigQuery Best Practices: All in one place.

Perumal Babu
3 min readJan 15, 2022

--

Curated best practices for big query

Photo by Kevin Ku on Unsplash

I am sharing my notes on Bigquery across the following dimensions.

  • Cost control
  • Security
  • Performance
  • Extensibility
  • Design time considerations
  • Partitioning

Cost Control :

  • It has 2 factors to consider while optimizing cost. Data storage cost and retrieval cost. ( applies to standard and Partition tables)
  • Partitions with cold data would go to cold storage if not accessed >90 days ( almost 50% storage cost reduction ). So, Plan your partitions accordingly.
  • When querying make the partition field part of the query filter so that you don’t end up scanning all data.
  • When you partition by time — after the expiration date of a partition is passed, BigQuery deletes all data in this particular partition. The ideal use case for log analytics. Additionally, when creating a table you can also set expiration time at a table level too.
  • Choose the best cost model — if you are planning to use it extensively may be a flat rate model would work ( for that you need $10,000 min monthly cost. Note, with annual commitments it would come down.). You will get 500 slots. You may have to compromise on performance as the queries will be queued.
  • The on-demand model, demands more caution with the queries we write. Hence make sure we retrieve only the data we need.
  • Creating materialized views will help reduce costs. Create a daily aggregate table. Read from that aggregate table instead of the original table. This will reduce data retrieval costs.
  • Try a — dry-run feature to estimate the cost when in doubt.
  • Use the maximum bytes billed setting to limit query costs
  • Use streaming inserts only if that data is needed in real-time.

Security :

  • Execution Permissions and query permission can be provided fine-grained: You can give hierarchical permission too.
  • You can create rowAccessPoliy to create row-level access.
  • For column-level access, you need to create taxonomy and policy tags appropriately.
  • Supports both google-managed keys and customer keys for encryption
  • Use principle of lease privilege — provide access to only the data that is required to be shared.
  • Create service accounts for each application that wants to access the data.
  • Provide group-level access instead of individual access.

Performance:

  • Use clustered tables to improve query performance for specific filters ( note it does not support cost guarantees )
  • Use materialized view for better performance
  • When joining tables start with the largest tables.
  • UDF would impact performance
  • External data source impacts performance

Extensibility :

  • BigQuery supports querying Avro, Parquet, ORC, JSON, and CSV partitioned data that are hosted on Google Cloud Storage using a default hive partitioning layout.
  • Also supports — Bigtable, GCS, and cloud SQL
  • Service accounts that need to access external repository — Gdrive you need to enable domain-wide delegation.
  • Use external data for the following
  • If the data is frequently changing in the external source.
  • Use Temp external source when loading data into DW. It cannot be shared.
  • With permanent external tables BQ can be used to control access

Partition in Big Query :

  • 3 ways — Ingestion time, Time, integer
  • If we want to partition by ingestion time we must also provide the partitioning ingestion time granularity (hour, day, month, year).
  • If we choose to partition the table based on an integer range and column (e.g., customer_id), we must also provide the start value, end value, and the interval.
  • if we want to partition the table based on the date_registered field, we must also provide the respective granularity (day, month, year).
  • You can use partitioning and clustering together.
  • Pruning is a process when executing a query in a petitioned table BQ will scan the partitions that match the filter clause.

Design time considerations:

  • Choose the dataset location close to or in the same region
  • Column data types cannot be changed after the creation
  • When using jobs to load data they can be on different projects but should be in the same location.
  • Note that there are several constraints with views like

Underlaying tables should be in the same region

UDFs are not allowed in Views

No wildcard table reference

  • Create UDFs when needed to embed business rules and data consistency
  • Create views specific to the use case and provide appropriate read permission on tables behind them.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Perumal Babu
Perumal Babu

Responses (1)

Write a response