BigQuery GA与重复行导出 [英] BigQuery GA Exported with Duplicated Rows
问题描述
SELECT
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT命中。*
FROM
UNNEST(命中)命中
ORDER BY
hits.hitnumber)h
FROM
`dafiti-analytics.40663402.ga_sessions *`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix,r'。* _(。*)')BETWEEN FORMAT_DATE (%Y%m%d,DATE_SUB(CURRENT_DATE(),INTERVAL 3 DAY))和FORMAT_DATE(%Y%m%d,DATE_SUB(CURRENT_DATE(),INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100
结果是:
我们试图调查当这种情况发生时,所以我运行了这个查询:
SELECT
日期,
f,
COUNT(f)从(
SELEC T
日期,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid,CAST(visitid AS字符串)))f
FROM
` dafiti-analytics.40663402.ga_sessions *`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d',REGEXP_EXTRACT(_table_suffix,r'。* _(。*) '))BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
日期)
GROUP BY
f,
日期
ORDER BY
日期,
freq DESC
我们发现我们的3个项目从4月18日开始,但在与LATAM数据相关的账户中,我们最近也开始看到重复的行。
我们还检查了在我们的GCP控制台中是否记录了某些内容,但找不到任何内容。
可能导致了ga_sessions导出中的重复?我们检查了我们的分析跟踪,但似乎工作得很好。此外,我们也没有修改这些日子来解释它。
如果您需要更多信息,请告诉我们, 确保只匹配盘中或非盘中表。对于盘中交易:
`dafiti-analytics.40663402.ga_sessions_intraday *`
非日内:
`dafiti-analytics。 40663402.ga_sessions_2017 *`
重要的部分是包含足够的前缀以匹配所需的表格。
We have been trying to explain why this happened in all of our datasets but so far we had no success.
We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:
SELECT
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT hits.*
FROM
UNNEST(hits) hits
ORDER BY
hits.hitnumber) h
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100
And the result was:
We tried to investigate when this began to happen, so I ran this query:
SELECT
date,
f,
COUNT(f) freq from(
SELECT
date,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
date )
GROUP BY
f,
date
ORDER BY
date,
freq DESC
And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.
We also checked if in our GCP Console something was logged but couldn't find anything.
Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.
If you need more info please let me know,
Make sure to match only the intraday or non-intraday tables. For intraday:
`dafiti-analytics.40663402.ga_sessions_intraday*`
For non-intraday:
`dafiti-analytics.40663402.ga_sessions_2017*`
The important part is to include enough of the prefix to match the desired tables.
这篇关于BigQuery GA与重复行导出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!