Follow me on LinkedIn - AI, GA4, BigQuery
Your Google Analytics 4 data in BigQuery is unlikely to be accurate unless you follow these four rules for data aggregation.

Rule-1: Aggregate event-level data at the session level when you are calculating dimensions and metrics with a session scope.

A session may have multiple events (like page_view, scroll, click, etc.). Each event could have a different source or no source at all.

Suppose a session has 3 page_view events (all with the same source) and 2 scroll events (with a missing source).

If you count these as “sessions” without aggregation, you’ll see 5 sessions for one user-session combination.

So you will inflate your session count.


You can aggregate event-level data at the session level by using a unique combination of ‘user_pseudo_id’ (which represents the user) and ‘ga_session_id’ (which represents the session).

Rule-2: Aggregate event-level data or session level data at the user level when you are calculating dimensions and metrics with a user scope.

When a user interacts with a website, they can have multiple sessions, and each session may have multiple events (like page_view, scroll, click, etc.). Each session can have different traffic sources, different revenue, and different engagement behaviors.

Suppose a user has 3 sessions:

Session 1 has a page_view event (with source = “google”),

Session 2 has a page_view event (with source = “facebook”),

Session 3 has a page_view event (with source = “direct”).


If you do not aggregate the session data at the user level, you may count the same user 3 times instead of 1 time.

If you count the users directly at the session level, you’ll see 3 users for the same user_pseudo_id, when in reality, it should be counted as 1 user.


You can aggregate the session-level data at the user level by using ‘user_pseudo_id’.

Rule-3: First aggregate event-level data at the session level and then aggregate session-level data at the user level. Avoid directly aggregating event-level data at the user level.

First aggregate event level data at the session level 1

Here’s a simple analogy:

>> Imagine you have 1 million events that belong to only 100,000 sessions and 10,000 users.

>> If you directly aggregate at the user level, you’re working with 1 million rows.

>> If you first aggregate at the session level, you reduce the dataset to 100,000 rows, and then when you aggregate at the user level, you’re working with only 10,000 rows.


When you aggregate at the session level first, you get faster query execution and lower cost, and the probability of getting inaccurate results decreases dramatically.

Rule-4: Use event-centric logic for creating SQL queries only when dealing with small datasets (less than 1 million rows). Otherwise, use session-level aggregation.

Event-centric logic works fine for small datasets (less than 1M rows) because query time is fast, and computational costs are low.

For large datasets (more than 1M rows), scanning all event rows repeatedly becomes inefficient.

For larger datasets, use session-level aggregation to reduce event data to 1 row per session, which drastically reduces the total number of rows processed.


I teach data aggregation in my GA4 BigQuery course, where you can learn to automate SQL generation via text prompts in ChatGPT.