BigQuery Best Practices: All in one place.
3 min readJan 15, 2022
Curated best practices for big query
I am sharing my notes on Bigquery across the following dimensions.
- Cost control
- Security
- Performance
- Extensibility
- Design time considerations
- Partitioning
Cost Control :
- It has 2 factors to consider while optimizing cost. Data storage cost and retrieval cost. ( applies to standard and Partition tables)
- Partitions with cold data would go to cold storage if not accessed >90 days ( almost 50% storage cost reduction ). So, Plan your partitions accordingly.
- When querying make the partition field part of the query filter so that you don’t end up scanning all data.
- When you partition by time — after the expiration date of a partition is passed, BigQuery deletes all data in this particular partition. The ideal use case for log analytics. Additionally, when creating a table you can also set expiration time at a table level too.
- Choose the best cost model — if you are planning to use it extensively may be a flat rate model would work ( for that you need $10,000 min monthly cost. Note, with annual commitments it would come down.). You will get 500 slots. You may have to compromise on performance as the queries will be queued.
- The on-demand model, demands more caution with the queries we write. Hence make sure we retrieve only the data we need.
- Creating materialized views will help reduce costs. Create a daily aggregate table. Read from that aggregate table instead of the original table. This will reduce data retrieval costs.
- Try a — dry-run feature to estimate the cost when in doubt.
- Use the maximum bytes billed setting to limit query costs
- Use streaming inserts only if that data is needed in real-time.
Security :
- Execution Permissions and query permission can be provided fine-grained: You can give hierarchical permission too.
- You can create rowAccessPoliy to create row-level access.
- For column-level access, you need to create taxonomy and policy tags appropriately.
- Supports both google-managed keys and customer keys for encryption
- Use principle of lease privilege — provide access to only the data that is required to be shared.
- Create service accounts for each application that wants to access the data.
- Provide group-level access instead of individual access.
Performance:
- Use clustered tables to improve query performance for specific filters ( note it does not support cost guarantees )
- Use materialized view for better performance
- When joining tables start with the largest tables.
- UDF would impact performance
- External data source impacts performance
Extensibility :
- BigQuery supports querying Avro, Parquet, ORC, JSON, and CSV partitioned data that are hosted on Google Cloud Storage using a default hive partitioning layout.
- Also supports — Bigtable, GCS, and cloud SQL
- Service accounts that need to access external repository — Gdrive you need to enable domain-wide delegation.
- Use external data for the following
- If the data is frequently changing in the external source.
- Use Temp external source when loading data into DW. It cannot be shared.
- With permanent external tables BQ can be used to control access
Partition in Big Query :
- 3 ways — Ingestion time, Time, integer
- If we want to partition by ingestion time we must also provide the partitioning ingestion time granularity (hour, day, month, year).
- If we choose to partition the table based on an integer range and column (e.g., customer_id), we must also provide the start value, end value, and the interval.
- if we want to partition the table based on the date_registered field, we must also provide the respective granularity (day, month, year).
- You can use partitioning and clustering together.
- Pruning is a process when executing a query in a petitioned table BQ will scan the partitions that match the filter clause.
Design time considerations:
- Choose the dataset location close to or in the same region
- Column data types cannot be changed after the creation
- When using jobs to load data they can be on different projects but should be in the same location.
- Note that there are several constraints with views like
Underlaying tables should be in the same region
UDFs are not allowed in Views
No wildcard table reference
- Create UDFs when needed to embed business rules and data consistency
- Create views specific to the use case and provide appropriate read permission on tables behind them.