Unified GA4 analytics pipeline for 8 brands

GA4 · BigQuery · Dataform · GTM · Looker Studio · Power BI
Direct Wines · Andre Arias

Eight ecommerce brands. Each with its own GA4 property, its own BigQuery export, its own quirks in how traffic sources and campaigns are tagged. The business wanted unified reporting: one place to see sessions, transactions, and channel performance across all brands, with numbers that hold up when someone asks where they came from.

The core challenge was not just getting the data into one place. It was making sure the same event, filtered the same way, classified the same way, showed up consistently regardless of which brand it came from. Any inconsistency in how filters or channel rules were applied would surface immediately in cross-brand comparisons.

I designed and built this pipeline as a layered Dataform project and have owned it end to end since.

sources
GA4Dataform exports · one BigQuery dataset per brand
+ Google Ads, Search Console, order data
staging
int_ga4_events · union across all 8 brands
int_ga4_sessions_filters · canonical eligibility gate
int_ga4_sessions / transactions · + intraday variants
processing_watermark · tracks latest date per brand
unified
fact_ga4_sessions · full channel + campaign classification
fact_ga4_events / transactions
dim_fiscal_dates · dim_campaign_details · dim_order_details
fact_facebook_ads · fact_gads · fact_searchdata
reporting
25+ rpt_ tables consumed by Looker Studio and Power BI
sessions · transactions · ecommerce · paid media · SEO

Raw GA4 event exports arrive via GA4Dataform by Superform Labs, one BigQuery dataset per brand. The staging layer unions them, applies filters, and produces session and transaction intermediates. The unified layer enriches those with channel classification, campaign mapping, and dimension joins. The reporting layer materializes the tables that dashboards read directly.

-- config-driven brand pattern

Every brand-specific detail lives in a single JS config object: GA4 property ID, BigQuery dataset, valid hostnames, GSC dataset, Google Ads customer ID. The staging union query is generated from that config at compile time. Adding a ninth brand is one entry in one file; every downstream model rebuilds automatically. See brands.js.

-- single canonical eligibility table

int_ga4_sessions_filters is the one place where all filter logic lives: hostname validation against the brand config, country exclusions for both bot-proxy geographies and internal developer traffic, internal traffic type filtering, and a list of 50-plus referral sources to exclude. Every fact table joins to this table rather than re-implementing its own filters. Adding or changing a rule takes one edit and applies everywhere.

-- schema-drift safeguard via assertion gate

A ga4_yesterday_gate assertion runs before the staging union. It compares the latest event date in each brand's raw GA4 export against a processing watermark. If all brands already match the watermark, the assertion fails and blocks the downstream run, preventing double-processing. If any brand has new data, the assertion passes and the pipeline runs. The gate also surfaces when a brand's export has stopped updating, which would otherwise produce a silent gap.

-- bot detection

Traffic quality filtering runs at multiple levels. Country exclusions in the eligibility table serve two purposes: some geographies are excluded because they are known sources of bot-proxy traffic, while others, such as India, are excluded because that is where the development team is based and their sessions would otherwise pollute production data. The internal traffic type field removes additional employee and QA traffic. The referral exclusion list removes tool and internal-system referrals. At the event level, server-side tagging via Stape surfaces is_bot_stape and bot_score_stape fields for further downstream filtering.

-- centralized channel classification

A JS helper function maps every combination of source, medium, campaign, and click ID to one of 18 channel categories, then to one of 34 marketing group IDs that join to a channel map dimension. All 8 brands run through the same function. When a channel rule needs updating, it changes in one place and applies across every brand and every reporting table on the next run.

-- reclassification via campaign response code lookup

The channel classification CASE statement can only work with what UTM parameters and click IDs provide. Offline channels, such as direct mail, often arrive with no UTMs at all, so the first pass returns Unmapped. A second reclassification step rescues those sessions using a proprietary campaign response code embedded in the landing URL. The code is looked up in a three-table chain: dim_campaign_details maps the code to a marketing activity and media type, dim_marketing_group maps that pair to a marketing group ID, and dim_channel_map resolves it to the final channel and category labels. Sessions with no response code and a google/none or none/none source are resolved to Direct. A dedicated assertion gates a specific high-volume response code to catch upstream data issues before they affect classification.

-- incremental pre-ops with backfill support

All incremental models use a pre-operations pattern that calculates the checkpoint date, deletes from that date forward, then reprocesses. A BACKFILL_DATE compilation variable in workflow_settings.yaml defaults to empty for normal scheduled runs. Passing it via CLI rewrites the checkpoint for a targeted date range without touching any SQL. See pre_ops.js and the note on backfilling incremental models in Dataform.

The pipeline processes GA4 data for all 8 brands daily, feeding 25-plus reporting tables that power dashboards in Looker Studio and Power BI. Cross-brand comparisons are consistent because every brand runs through the same filters, the same channel rules, and the same dimension joins.

8 brands · 1 pipeline · one config entry to add a new brand

Backfills that used to require hardcoding dates and reverting SQL are now a single CLI flag. Channel classification that used to live in individual reports is now tested and versioned in one place. Bot and quality filters that used to be inconsistent across reports are now applied at a single layer and inherited by everything downstream.

The pipeline is under active development. Current work includes expanding intraday coverage, refining the channel taxonomy, and integrating additional paid media sources.

← back to work