Unnest 和 totals.timeOnSite(BigQuery 和 Google Analytics 数据) [英] Unnest and totals.timeOnSite (BigQuery and Google Analytics data)

查看:24
本文介绍了Unnest 和 totals.timeOnSite(BigQuery 和 Google Analytics 数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算网站所有访问者的总 timeOnSite(并将其除以 3600,因为它在原始数据中存储为秒),然后我想在 content_group 和一个名为的自定义变量上分解它内容级别.

I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it's stored as seconds in the raw data), and then I want to break it down on content_group and a custom variable that is called content_level.

问题出现是因为 content_group 和 content_level 都嵌套在数组中,而 timeOnSite 是一个 totals.-stored 变量,如果在包含和取消嵌套的查询中使用时会膨胀.(content_group 是一个普通的 hits.-nested 变量,而 content_level 嵌套在 customDimensions 中,而 customDimensions 嵌套在 hits 中(二级嵌套变量)(Will 和 Thomas C 很好地解释了为什么这个问题会出现在这个问题 使用 BigQuery 提取命中级别数据时,Google Analytics Metrics 被夸大了,但我无法将他们的建议应用于 totals.timeOnSite 指标)

Problem arises because content_group and content_level are both nested in arrays, and timeOnSite is a totals.-stored variable that gets inflated if when used in a query that include and unnesting. (content_group is a normal hits.-nested variable, while content_level is nested in customDimensions that is nested in hits (a second level nested variable) (Will and Thomas C explain well why this problem emerges in this question Google Analytics Metrics are inflated when extracting hit level data using BigQuery , but I was unable to apply their advice to the totals.timeOnSite metric)

#StandardSQL
SELECT   
 date,   
 content_group,   
 content_level,  
 SUM(sessions) AS sessions,   
 SUM(sessions2) AS sessions2, 
 SUM(time_on_site) AS time_on_site   
FROM (   
     SELECT   
       date AS date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,   
       SUM(totals.visits) AS sessions,   
       COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,   
       SUM(totals.timeOnSite)/3600 AS time_on_site   
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

(我使用子查询是因为我计划使用 UNION_ALL 从多个表中提取数据,但我省略了该语法,因为我认为它与问题无关.)

(I use a subquery because I'm planning on pulling data from several tables using UNION_ALL, but I omitted that syntax because I deemed it not relevant for the question.)

问题:

*是否可以为两个点击进行本地取消嵌套".和 hits.customDimensions 以便可以在我的查询中使用 totals.timeOnSite 而不会被夸大?

*Is it possible to make "local unnestings" for both hits. and hits.customDimensions so that it would be possible to use totals.timeOnSite in my query without it being inflated?

*是否可以像我在 session 和 session2 中所做的那样针对现场时间制定解决方法?

*Is it possible to make a workaround for time on site like I've made with sessions and sessions2?

*这个问题有第三个隐藏的解决方案吗?

*Is there a third, hidden solution to this problem?

推荐答案

我像这样回答自己的问题似乎很奇怪,但是我在 Stack Overflow 之外的一个联系人帮助我解决了这个问题,所以这实际上是他的答案而不是我的.

session_duration 的问题可以通过使用窗口函数来解决(你可以在 BigQuery 文档中阅读更多关于窗口函数的信息:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)

The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)

#StandardSQL
SELECT   
 iso_date,   
 content_group,   
 content_level,  
 COUNT(DISTINCT SessionId) AS sessions, 
 SUM(session_duration) AS session_duration 
FROM (   
     SELECT   
       date AS iso_date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,  
       CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId, 
       (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration 
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131" 
       AND (SELECT 
              MAX(IF(index=51, value, NULL)) 
            FROM 
              UNNEST(hits.customDimensions) 
            WHERE 
              value IN ("web", "phone", "tablet")
            ) IS NOT NULL 
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

子选择中的 LEAD - OVER - PARTITION 和 WHERE 子句中的子子选择都是窗口函数正常工作所必需的.

Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.

还提供了一种更准确的会话计算方法.

A more accurate way of calculating sessions is also provided.

这篇关于Unnest 和 totals.timeOnSite(BigQuery 和 Google Analytics 数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆