BigQuery:根据匹配和产品范围维度过滤匹配 [英] BigQuery: filter out hits based on hit and product scope dimensions

查看:56
本文介绍了BigQuery:根据匹配和产品范围维度过滤匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在BigQuery中,有如下所述的基于Google Analytics(分析)的查询,此查询可以正常运行.

In BigQuery, there is the Google Analytics based query as is stated below and this works correctly.

#standard sql   
SELECT
      Date,
      SUM(totals.visits) AS Sessions,
      SUM(totals.transactions) AS Transactions
    FROM
      `[projectID].[DatasetID].ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20181217'
      AND '20181217'
      AND totals.visits > 0
    GROUP BY
      Date

在此查询中,我需要排除匹配内 ...

In this query, I need to exclude all hits where within a hit...

  • .. GA自定义维度#23(匹配范围)包含值编辑器"或
  • .. GA自定义维度#6(产品范围)与正则表达式值'^ 63 ..... $'相匹配或
  • .. GA hits.page.pagePath匹配正则表达式值'gebak | cake'

注意:不是打算在会话级应用上述3个条件(如

Note: it is not the intention to apply the 3 conditions as stated above on session-level (as in this screenshot) but on hit-level, since I'd like to reproduce numbers from another GA view than the view from which the data is loaded to BigQuery. In this other GA view the 3 conditions as are stated above are set as view filters.

到目前为止,最佳"查询是以下查询(基于以下Martin Weitzmann的帖子).但是,此查询未过滤数据集(换句话说,条件不起作用).

The 'best' query thus far is the one below (based on the post of Martin Weitzmann below). However, the dataset is not filtered in this query (in other words, the conditions do not work).

SELECT Date, 
-- hits,
SUM(totals.transactions), 
SUM(totals.visits) 

FROM (

(
  SELECT date, totals,
    -- create own hits array
    ARRAY(
      SELECT AS STRUCT 
        hitnumber, 
        page,
        -- create own product array
        ARRAY(
          SELECT AS STRUCT productSku, productQuantity 
          FROM h.product AS p
          WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%') 
        ) AS product
      FROM t.hits as h
      WHERE 
        NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
        AND
        (SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
    ) AS hits
  FROM
    `[projectID].[DatasetID].ga_sessions_*` t
  WHERE 
  _TABLE_SUFFIX BETWEEN '20181217'
  AND '20181217'
  AND totals.visits > 0
  ))
  GROUP BY Date

有人知道如何实现所需的输出吗?

Does anyone know how to achieve the desired output?

非常感谢!

注意:出于隐私方面的考虑,两个查询中的projectID和datasetID都已被屏蔽.

Note: the projectID and datasetID have been masked in both queries because of privacy concerns.

推荐答案

拥有数组的方法

您可以通过使用原始文件上的子查询并将其输出反馈到数组函数中来创建自己的匹配和产品数组.在这些子查询中,您可以过滤出匹配数据和产品:

Own arrays approach

You can create your own hits and product arrays by using sub-queries on the original and feeding their output back into array functions. In those subqueries you can filter out your hits and products:

#standardsql
SELECT
  date,
  hits
  --SUM(totals.visits) AS Sessions,
  --SUM(totals.transactions) AS Transactions
FROM
  (
  SELECT 
    date, totals,
    -- create own hits array
    ARRAY(
      SELECT AS STRUCT 
        hitnumber, 
        page,
        -- create own product array
        ARRAY(
          SELECT AS STRUCT productSku, productQuantity 
          FROM h.product AS p
          WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%') 
        ) AS product
      FROM t.hits as h
      WHERE 
        NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
        AND
        (SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
    ) AS hits
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20161104` t
  )
--GROUP BY 1
LIMIT 100

我将此示例保留为未分组的状态,但是您可以通过注释 hits 并相应地分组来轻松地对其进行调整...

I left this example in an ungrouped state, but you can easily adjust it by commenting out the hits and group accordingly ...

我认为您只需要在 WHERE 语句中使用正确的子查询即可:

I think you just need the right sub-query in your WHERE statement:

#standardsql
SELECT
  date,
  SUM(totals.visits) AS Sessions,
  SUM(totals.transactions) AS Transactions
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*` t
WHERE
  (SELECT COUNT(1)=0 FROM t.hits h
    WHERE 
      (SELECT count(1)>0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
      OR
      (SELECT count(1)>0 from h.product p, p.customdimensions cd WHERE index=6 AND value like '63%')
      OR
      REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
  )
GROUP BY date

由于所有组都处于会话级别,因此您不需要在主表上进行任何拼合(使用数组进行交叉交叉连接),这非常昂贵.在最外层的 WHERE 中,使用子查询输入 hits 数组-就像逐行的for-each.在这里您已经可以计算 REGEXP_CONTAINS(page.pagePath,r'gebak | cake')的场合了.

Since all your groups are on session level, you don't need any flattening (resp. cross joins with arrays) on the main table, which is costly. In your outermost WHERE you enter the hits array with a subquery - it's like a for-each on rows. Here you can already count occasions of REGEXP_CONTAINS(page.pagePath,r'gebak|cake').

在其他情况下,您再次编写子查询以输入相应的数组-第一种情况是在 hits 中的 customDimensions .这就像另一个子查询(子查询中的子查询)中的嵌套for-each.

For the other cases, you write a subquery again to enter the respective array - in the first case, customDimensions within hits. This is like a nested for-each inside the other one (subquery in a subquery).

在第二种情况下,我只是弄平-但仅在子查询中: product 及其 customDimensions .所以这也是一次嵌套的for-each,因为我很懒惰并且交叉连接.我本可以编写另一个子查询而不是交叉联接,所以基本上是三层嵌套的for-each(子查询中子查询中的子查询).

In the second case, I'm simply flattening - but within the subquery only: product with its customDimensions. So this is a one-time nested for-each as well because I was lazy and cross-joined. I could've written another Subquery instead of the cross-join, so basically a triple-nested for-each (subquery in a subquery in a subquery).

由于我要计算要排除的案件,因此我的外部条件是 COUNT(1)= 0 .

Since I'm counting cases that I want to exclude, my outer condition is COUNT(1)=0.

我只能用ga样本数据进行测试..因此,这是未经测试的.但是我想你知道这个主意.

I could only test it with ga sample data .. so it's kind of untested. But I guess you get the idea.

这篇关于BigQuery:根据匹配和产品范围维度过滤匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆