Unnest 和 totals.timeOnSite(BigQuery 和 Google Analytics 数据) [英] Unnest and totals.timeOnSite (BigQuery and Google Analytics data)
问题描述
我想计算网站所有访问者的总 timeOnSite(并将其除以 3600,因为它在原始数据中存储为秒),然后我想在 content_group 和一个名为的自定义变量上分解它内容级别.
I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it's stored as seconds in the raw data), and then I want to break it down on content_group and a custom variable that is called content_level.
问题出现是因为 content_group 和 content_level 都嵌套在数组中,而 timeOnSite 是一个 totals.-stored 变量,如果在包含和取消嵌套的查询中使用时会膨胀.(content_group 是一个普通的 hits.-nested 变量,而 content_level 嵌套在 customDimensions 中,而 customDimensions 嵌套在 hits 中(二级嵌套变量)(Will 和 Thomas C 很好地解释了为什么这个问题会出现在这个问题 使用 BigQuery 提取命中级别数据时,Google Analytics Metrics 被夸大了,但我无法将他们的建议应用于 totals.timeOnSite 指标)
Problem arises because content_group and content_level are both nested in arrays, and timeOnSite is a totals.-stored variable that gets inflated if when used in a query that include and unnesting. (content_group is a normal hits.-nested variable, while content_level is nested in customDimensions that is nested in hits (a second level nested variable) (Will and Thomas C explain well why this problem emerges in this question Google Analytics Metrics are inflated when extracting hit level data using BigQuery , but I was unable to apply their advice to the totals.timeOnSite metric)
#StandardSQL
SELECT
date,
content_group,
content_level,
SUM(sessions) AS sessions,
SUM(sessions2) AS sessions2,
SUM(time_on_site) AS time_on_site
FROM (
SELECT
date AS date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
SUM(totals.visits) AS sessions,
COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,
SUM(totals.timeOnSite)/3600 AS time_on_site
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
(我使用子查询是因为我计划使用 UNION_ALL 从多个表中提取数据,但我省略了该语法,因为我认为它与问题无关.)
(I use a subquery because I'm planning on pulling data from several tables using UNION_ALL, but I omitted that syntax because I deemed it not relevant for the question.)
问题:
*是否可以为两个点击进行本地取消嵌套".和 hits.customDimensions 以便可以在我的查询中使用 totals.timeOnSite 而不会被夸大?
*Is it possible to make "local unnestings" for both hits. and hits.customDimensions so that it would be possible to use totals.timeOnSite in my query without it being inflated?
*是否可以像我在 session 和 session2 中所做的那样针对现场时间制定解决方法?
*Is it possible to make a workaround for time on site like I've made with sessions and sessions2?
*这个问题有第三个隐藏的解决方案吗?
*Is there a third, hidden solution to this problem?
推荐答案
我像这样回答自己的问题似乎很奇怪,但是我在 Stack Overflow 之外的一个联系人帮助我解决了这个问题,所以这实际上是他的答案而不是我的.
session_duration 的问题可以通过使用窗口函数来解决(你可以在 BigQuery 文档中阅读更多关于窗口函数的信息:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
#StandardSQL
SELECT
iso_date,
content_group,
content_level,
COUNT(DISTINCT SessionId) AS sessions,
SUM(session_duration) AS session_duration
FROM (
SELECT
date AS iso_date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId,
(LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131"
AND (SELECT
MAX(IF(index=51, value, NULL))
FROM
UNNEST(hits.customDimensions)
WHERE
value IN ("web", "phone", "tablet")
) IS NOT NULL
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
子选择中的 LEAD - OVER - PARTITION 和 WHERE 子句中的子子选择都是窗口函数正常工作所必需的.
Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.
还提供了一种更准确的会话计算方法.
A more accurate way of calculating sessions is also provided.
这篇关于Unnest 和 totals.timeOnSite(BigQuery 和 Google Analytics 数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!