Unified GA4 analytics pipeline for 8 brands
Eight ecommerce brands. Each with its own GA4 property, its own BigQuery export, its own quirks in how traffic sources and campaigns are tagged. The business wanted unified reporting: one place to see sessions, transactions, and channel performance across all brands, with numbers that hold up when someone asks where they came from.
The core challenge was not just getting the data into one place. It was making sure the same event, filtered the same way, classified the same way, showed up consistently regardless of which brand it came from. Any inconsistency in how filters or channel rules were applied would surface immediately in cross-brand comparisons.
I designed and built this pipeline as a layered Dataform project and have owned it end to end since.
+ Google Ads, Search Console, order data
int_ga4_sessions_filters · canonical eligibility gate
int_ga4_sessions / transactions · + intraday variants
processing_watermark · tracks latest date per brand
fact_ga4_events / transactions
dim_fiscal_dates · dim_campaign_details · dim_order_details
fact_facebook_ads · fact_gads · fact_searchdata
sessions · transactions · ecommerce · paid media · SEO
Raw GA4 event exports arrive via GA4Dataform by Superform Labs, one BigQuery dataset per brand. The staging layer unions them, applies filters, and produces session and transaction intermediates. The unified layer enriches those with channel classification, campaign mapping, and dimension joins. The reporting layer materializes the tables that dashboards read directly.
-- config-driven brand pattern
Every brand-specific detail lives in a single JS config object: GA4 property ID, BigQuery dataset, valid hostnames, GSC dataset, Google Ads customer ID. The staging union query is generated from that config at compile time. Adding a ninth brand is one entry in one file; every downstream model rebuilds automatically. See brands.js.
-- single canonical eligibility table
int_ga4_sessions_filters is the one place where all filter
logic lives: hostname validation against the brand config, country exclusions
for both bot-proxy geographies and internal developer traffic, internal traffic
type filtering, and a list of 50-plus referral sources to exclude. Every fact
table joins to this table rather than re-implementing its own filters. Adding
or changing a rule takes one edit and applies everywhere.
-- schema-drift safeguard via assertion gate
A ga4_yesterday_gate assertion runs before the staging union.
It compares the latest event date in each brand's raw GA4 export against a
processing watermark. If all brands already match the watermark, the assertion
fails and blocks the downstream run, preventing double-processing. If any
brand has new data, the assertion passes and the pipeline runs. The gate also
surfaces when a brand's export has stopped updating, which would otherwise
produce a silent gap.
-- bot detection
Traffic quality filtering runs at multiple levels. Country exclusions in
the eligibility table serve two purposes: some geographies are excluded because
they are known sources of bot-proxy traffic, while others, such as India, are
excluded because that is where the development team is based and their sessions
would otherwise pollute production data. The internal traffic type field removes
additional employee and QA traffic. The referral exclusion list removes tool
and internal-system referrals. At the event level, server-side tagging via
Stape surfaces is_bot_stape and bot_score_stape
fields for further downstream filtering.
-- centralized channel classification
A JS helper function maps every combination of source, medium, campaign, and click ID to one of 18 channel categories, then to one of 34 marketing group IDs that join to a channel map dimension. All 8 brands run through the same function. When a channel rule needs updating, it changes in one place and applies across every brand and every reporting table on the next run.
-- reclassification via campaign response code lookup
The channel classification CASE statement can only work with what UTM
parameters and click IDs provide. Offline channels, such as direct mail, often
arrive with no UTMs at all, so the first pass returns Unmapped. A second
reclassification step rescues those sessions using a proprietary campaign
response code embedded in the landing URL. The code is looked up in a
three-table chain: dim_campaign_details maps the code to a
marketing activity and media type, dim_marketing_group maps that
pair to a marketing group ID, and dim_channel_map resolves it to
the final channel and category labels. Sessions with no response code and a
google/none or none/none source are resolved to Direct. A dedicated assertion
gates a specific high-volume response code to catch upstream data issues before
they affect classification.
-- incremental pre-ops with backfill support
All incremental models use a pre-operations pattern that calculates the
checkpoint date, deletes from that date forward, then reprocesses. A
BACKFILL_DATE compilation variable in workflow_settings.yaml
defaults to empty for normal scheduled runs. Passing it via CLI rewrites the
checkpoint for a targeted date range without touching any SQL. See
pre_ops.js
and the note on
backfilling incremental models in Dataform.
The pipeline processes GA4 data for all 8 brands daily, feeding 25-plus reporting tables that power dashboards in Looker Studio and Power BI. Cross-brand comparisons are consistent because every brand runs through the same filters, the same channel rules, and the same dimension joins.
Backfills that used to require hardcoding dates and reverting SQL are now a single CLI flag. Channel classification that used to live in individual reports is now tested and versioned in one place. Bot and quality filters that used to be inconsistent across reports are now applied at a single layer and inherited by everything downstream.
The pipeline is under active development. Current work includes expanding intraday coverage, refining the channel taxonomy, and integrating additional paid media sources.
-- code
github.com/andre683/dataform-ga4-patterns: sanitized patterns from this pipeline: brand config, session eligibility, incremental pre-ops, idempotency gate, channel classification