使用CASE语句通过在PARTITION中查找一个特定条目来更改新BigQuery列的值 [英] Using a CASE statement to change the value of a new BigQuery column based finding one specific entry inside a PARTITION

查看:109
本文介绍了使用CASE语句通过在PARTITION中查找一个特定条目来更改新BigQuery列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图编写一些case语句,如果满足特定条件,可能会改变调用中所有条目的值。这是特定的上下文。想象一下,我有一个使用以下SQL查询创建的特定数据集:

  SELECT date,CONCAT(fullVisitorId,STRING(访问ID,visitNumber,fullVisitorId,totals.pageviews,totals.bounces,
LAG(hits.page.pagePath,1)OVER(PARTITION BY unique_visit_id ORDER BY hits.time ASC)作为滞后,命中.page.pagePath,hits.page.pageTitle,device.deviceCategory,device.browser,device.browserVersion,hits.customVariables.index,
hits.customVariables.customVarName,hits.customVariables.customVarValue,hits.time
FROM(FLATTEN([XXXXXXXX.ga_sessions_20140711],hits.time))
WHERE hits.customVariables.index = 4
LIMIT 1000;

结果数据集与以下内容类似(以excel显示):





请注意,unique_visit_id在每次访问中都有相同的编号。我想在许多情况下执行的操作是通过hits_page_pagePath进行的。我想构造一个CASE语句,当滞后的URL(使用REGEX_MATCH()发现)等于一个特定的值,并且hits_page_pagePath的值等于hits_time = 0时的某个值时,然后创建一个新的列将整个分区标记为特定值。例如,假设我在hits_page_pagePath中发现了一个错误,并且滞后的值是一个确定的值。在这种情况下,我会将整个分区标记为预订错误。如果滞后的值在错误发生前是不同的,那么我会让分区成为不同的标签,例如付款错误。然后,表格将如下所示:



这将重复所有unique_visit_id分区。然后,我可以将每个分区的总弹跳次数,命中次数,事件等计算在一起。任何洞察力将不胜感激!

解决方案

如果您正在寻找避免连接,您可以使用Over的聚合函数。
类似于:

  Max(If((您的条件在这里),您的值在这里,Null))Over分区由Your_Partition)

窗口函数曾经有一些性能问题,最近应该有所改进。
我对BQ的经验促使我更喜欢乔丹的加入建议。但是,嘿,它是一个有趣的谜语......


I trying to write some case statements which might change the value of all entries in the call if a particular condition is satisfied INSIDE the partition. Here is the specific context. Imagine that I have a particular data set that was created using the following SQL query:

SELECT date, CONCAT(fullVisitorId, STRING(visitId)) AS unique_visit_id, visitId, visitNumber, fullVisitorId, totals.pageviews, totals.bounces, 
LAG(hits.page.pagePath,1) OVER(PARTITION BY unique_visit_id ORDER BY hits.time ASC) as lagged, hits.page.pagePath, hits.page.pageTitle, device.deviceCategory, device.browser, device.browserVersion, hits.customVariables.index,
hits.customVariables.customVarName, hits.customVariables.customVarValue, hits.time
FROM (FLATTEN([XXXXXXXX.ga_sessions_20140711], hits.time))
WHERE hits.customVariables.index = 4
LIMIT 1000;

The resulting data sets looks similar to the following (shown in excel):

Note that the unique_visit_id has the same number in it for each unique visit. What I would like to do in many instances is run through the hits_page_pagePath. I would like to construct a CASE statement such that, when the lagged URL (found using REGEX_MATCH()) equals a particular value, and the value of the hits_page_pagePath equals a certain value when hits_time = 0, then create a new column using case that labels the entire partition a certain value. For example, let's say that I found an error in the hits_page_pagePath and the lagged value was a certain value. In this case, I would then make the entire partition labelled "Booking error". If the lagged value was a different one before the error, I would make the partition be a different label, such as "Payment error". The table would then look like the one below:

This would repeat for all the unique_visit_id partitions. I would then be able to group together counts of total bounces, hits, events, etc., for each partition. Any insight would be greatly appreciated!

解决方案

If you are looking for avoiding joins, you can use an aggregated function with Over. something like:

Max(If((Your Condition here),Your value here, Null)) Over( Partition By Your_Partition)

the window functions used to had some performance issues that should have been improved recently. My experience with BQ drives me to prefer Jordan's Join suggestion. But hey, its a fun riddle...

这篇关于使用CASE语句通过在PARTITION中查找一个特定条目来更改新BigQuery列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆