使用重复行导出的 BigQuery GA [英] BigQuery GA Exported with Duplicated Rows

查看:21
本文介绍了使用重复行导出的 BigQuery GA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们一直试图解释为什么在我们所有的数据集中都会发生这种情况,但到目前为止我们没有成功.

We have been trying to explain why this happened in all of our datasets but so far we had no success.

我们观察到,从 4 月 18 日开始,我们的 ga_sessions 数据集大部分是重复的条目(例如 99% 的行).例如,我测试了这个查询:

We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:

SELECT
  fullvisitorid fv,
  visitid v,
  ARRAY(
  SELECT
    AS STRUCT hits.*
  FROM
    UNNEST(hits) hits
  ORDER BY
    hits.hitnumber) h
FROM
  `dafiti-analytics.40663402.ga_sessions*`
WHERE
  1 = 1
  AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
  fv,
  v
LIMIT
  100

结果是:

我们试图调查这种情况何时开始发生,所以我运行了这个查询:

We tried to investigate when this began to happen, so I ran this query:

SELECT
  date,
  f,
  COUNT(f) freq from(
  SELECT
    date,
    fullvisitorid fv,
    visitid v,
    COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
  FROM
    `dafiti-analytics.40663402.ga_sessions*`
  WHERE
    1 = 1
    AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
    AND TIMESTAMP('2017-04-30')
  GROUP BY
    fv,
    v,
    date )
GROUP BY
  f,
  date
ORDER BY
  date,
  freq DESC

我们发现,对于我们的 3 个项目,它是在 4 月 18 日开始的,但在与 LATAM 数据相关的帐户中,我们最近也开始看到重复的行.

And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.

我们还检查了 GCP Console 中是否记录了某些内容但找不到任何内容.

We also checked if in our GCP Console something was logged but couldn't find anything.

是不是我们可能犯了一些错误导致 ga_sessions 导出中的重复?我们检查了我们的分析跟踪,但它似乎工作得很好.此外,我们这些天也没有进行任何修改来解释它.

Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.

如果您需要更多信息,请告诉我,

If you need more info please let me know,

推荐答案

确保只匹配日内或非日内表格.盘中:

Make sure to match only the intraday or non-intraday tables. For intraday:

`dafiti-analytics.40663402.ga_sessions_intraday*`

对于非日内:

`dafiti-analytics.40663402.ga_sessions_2017*`

重要的部分是包含足够的前缀以匹配所需的表.

The important part is to include enough of the prefix to match the desired tables.

这篇关于使用重复行导出的 BigQuery GA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆