Google’s BigQuery is a great fit as Marketing Analytics Data Warehouse. Google Analytics 360 uses BigQuery. If you are using free Google Analytics and Google Ads, you can also import their data into BigQuery via Google App Script or Ads Script. External data can be imported to enhance this data and this sets no limits in building own machine learning models. Models include forecasting, regression, classification, clustering, recommendation systems and more.
Google ML enables you to apply machine learning to BigQuery data by using standard SQL syntax. This is groundbreaking for Marketing Analytics, since it increases model development speed by:
- less programming need => no more Python code,
- enabling to broader ML practitioners => standard SQL is enough to apply ML,
- quick model building and availability => no more ETL jobs to move data.
Operationalization of the models with “Scheduled queries”: Since September 2018 Google allows you to schedule queries. Now it is possible to schedule model deployment (let’s say monthly) and schedule predictions (let’s say daily) save the prediction results automatically in a table. You have to use partitioned tables if you want to use Google BigQuery Machine Learning (ML) by scheduling queries and append them into an existing table.
How BigQuery ML works
1. Create Model
Create a simple model with existing customers. As an example, let’s use some simple values to predict the sale of a new customer.
Schedule the query to run every month to update the model regularly.
2. Predict the future event
Predict the sale of new customers. New customers can be from yesterday, last hour, etc.
Schedule the query and save results in a destination table.
3. Save predictions by scheduling the predictions query
You can schedule the prediction query, daily, hourly, weekly, monthly, custom and on demand.
Destination table can be a partitioned table, partitioned by field DATE.
Machine Learning with BigQuery in the past
If you were using Google Cloud for Machine Learning in the past, you needed to write Python models in Google Functions and trigger the functions by new files in Google Cloud Storage. The file transfers from BigQuery to Cloud Storage were steered by Cloud Scheduler and Cloud Pub/Sub and the daily predictions transfers back to BigQuery were done with additional Python Code, since BigQuery transfers were not appending data into existing tables.
This overhead on programmatic architecture is now obsolete. Historical data to build the models are in BigQuery, ML is done within BigQuery and the predictions are available in BigQuery without the need of additional engineering work.
Comparison partitioned tables vs. fragmented tables
BigQuery was append-only by design. Deleting single rows was not possible in the past. Recommendation was to use daily fragmented tables. Most of data import in business are on daily basis, so this action solved the majority of the problems. By doing daily import to fragmented tables, you could easily replace one day as a standalone table by deleting the wrong day. This was handy to solve errors in daily import jobs. You did not have to delete everything from huge datasets and recreate again.
Fragmented Tables save daily data into single tables by naming them correspondingly.
[PREFIX]_YYYYMMDD
Google allows deleting rows from tables now. This enables the deletion of specific days of partitioned tables. This makes fragmented tables obsolete. Partitioned tables have more advantages.
Partitioned Tables | Fragmented Tables |
Tables are partitioned based on DATE, TIMESTAMP, or DATETIME as column Ingestion time as load or arrival time Integer range as column | Tables are partitioned based on Table name [PREFIX]_YYYYMMDD |
Fast in query performance | Slower performance Each table has a copy of the schema and meta data BigQuery has to confirm the access rights to each table within the given data range |
Partition by number, ingestion time, day, week, month, year | Fragmentation by day only |
4.000 partitons | Unlimited fragments but query max 1.000 “dates” |
Recommended best practice from Google | – |
Cheat Sheet
- BigQuery ML enables fast development for Marketing Analytics.
- Use partitioned tables to build BigQuery ML to schedule the predictions query.
- You can delete data of partitions if errors occur.