BigQuery GA与重复行导出 [英] BigQuery GA Exported with Duplicated Rows

查看:205
本文介绍了BigQuery GA与重复行导出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们一直试图解释为什么会发生在我们所有的数据集中,但迄今为止我们没有成功。我们观察到,从4月18日开始,我们的 > ga_sessions 数据集的大部分重复条目(如99%的行)。作为一个例子,我测试了这个查询:

  SELECT 
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT命中。*
FROM
UNNEST(命中)命中
ORDER BY
hits.hitnumber)h
FROM
`dafiti-analytics.40663402.ga_sessions *`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix,r'。* _(。*)')BETWEEN FORMAT_DATE (%Y%m%d,DATE_SUB(CURRENT_DATE(),INTERVAL 3 DAY))和FORMAT_DATE(%Y%m%d,DATE_SUB(CURRENT_DATE(),INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100

结果是:





我们试图调查当这种情况发生时,所以我运行了这个查询:

  SELECT 
日期,
f,
COUNT(f)从(
SELEC T
日期,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid,CAST(visitid AS字符串)))f
FROM
` dafiti-analytics.40663402.ga_sessions *`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d',REGEXP_EXTRACT(_table_suffix,r'。* _(。*) '))BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
日期)
GROUP BY
f,
日期
ORDER BY
日期,
freq DESC

我们发现我们的3个项目从4月18日开始,但在与LATAM数据相关的账户中,我们最近也开始看到重复的行。



我们还检查了在我们的GCP控制台中是否记录了某些内容,但找不到任何内容。



可能导致了ga_sessions导出中的重复?我们检查了我们的分析跟踪,但似乎工作得很好。此外,我们也没有修改这些日子来解释它。



如果您需要更多信息,请告诉我们, 确保只匹配盘中或非盘中表。对于盘中交易:

 `dafiti-analytics.40663402.ga_sessions_intraday *`

非日内:

 `dafiti-analytics。 40663402.ga_sessions_2017 *`

重要的部分是包含足够的前缀以匹配所需的表格。

We have been trying to explain why this happened in all of our datasets but so far we had no success.

We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:

SELECT
  fullvisitorid fv,
  visitid v,
  ARRAY(
  SELECT
    AS STRUCT hits.*
  FROM
    UNNEST(hits) hits
  ORDER BY
    hits.hitnumber) h
FROM
  `dafiti-analytics.40663402.ga_sessions*`
WHERE
  1 = 1
  AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
  fv,
  v
LIMIT
  100

And the result was:

We tried to investigate when this began to happen, so I ran this query:

SELECT
  date,
  f,
  COUNT(f) freq from(
  SELECT
    date,
    fullvisitorid fv,
    visitid v,
    COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
  FROM
    `dafiti-analytics.40663402.ga_sessions*`
  WHERE
    1 = 1
    AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
    AND TIMESTAMP('2017-04-30')
  GROUP BY
    fv,
    v,
    date )
GROUP BY
  f,
  date
ORDER BY
  date,
  freq DESC

And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.

We also checked if in our GCP Console something was logged but couldn't find anything.

Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.

If you need more info please let me know,

解决方案

Make sure to match only the intraday or non-intraday tables. For intraday:

`dafiti-analytics.40663402.ga_sessions_intraday*`

For non-intraday:

`dafiti-analytics.40663402.ga_sessions_2017*`

The important part is to include enough of the prefix to match the desired tables.

这篇关于BigQuery GA与重复行导出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆