Hive表中的Google Analytics报告和BigQuery数据的统计差异 [英] Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

查看:171
本文介绍了Hive表中的Google Analytics报告和BigQuery数据的统计差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Google Analytics高级帐户,用于监控网站和移动应用程序的用户活动。



来自GA的原始数据存储在BigQuery表中。

然而,我注意到统计数据在GA报告中看到的统计信息与我在查询BigQuery表格时看到的统计信息非常不同。



据我了解,GA报告显示汇总数据和可能的采样数据。并且BigQuery表中的原始数据是会话/命中级数据。



但我仍然不确定是否理解统计数据可能会有所不同的原因。



感谢它,如果有人为我澄清这一点。



预先感谢。



更新1: / strong>



我将来自Bigquery的原始数据导出到我的Hadoop集群中。数据存储在配置单元表中。我在导出之前展开了所有嵌套和重复的字段。



以下是我在Hive表格中的原始数据上运行的配置单元查询:

 选择
日期作为VisitDate,
计数(distinct fullvisitorid)作为CountVisitors,
SUM(totals_visits)作为SumVisits,
SUM(totals_pageviews)AS PVs
FROM
bigquerydata
WHERE
fullvisitorid IS NOT NULL
GROUP BY
日期
ORDER BY $ b)b VisitDate DESC



以2月9日为VisitDate,我得到以下结果查询:

  i)CountVisitors = 1,074,323 
ii)SumVisits = 48,990,198
iii)PVs = 1,122,841,424

Vs b b
采取相同的VisitDate并从GA报告中获得相同的统计数据:

  i)用户数= 1,549,757 
ii)浏览量= 11,604,449(与A相比巨大差异(iii) )

在上面的hive查询中,我是否使用了任何错误的字段或以错误的方式处理字段?试图找出为什么我在数字上有这种差异。



更新2(遵循@Felipe Hoffa的建议):



这就是我在将结果导出到GCS然后导出到Hadoop集群之前将我的Python代码中的表展平的情况:



<$ (flatten(flatten(flatten(['+ TABLE_NAME +'],hits),hits.product),hits.promotion),点击.customVariables),hits.customDimensions),hits.customMetrics)'

我明白你在说什么扁平化导致重复的浏览量和每个重复进入最后的错误添加。

我在Bigquery表上尝试了相同的查询(从Update1开始),而不是我的Hive表。这些数字与Google Analytics(分析)信息中心上的数据相匹配。

然而,假设Hive表是我拥有的,并且由于展平而具有这些重复字段。但是Is无论如何,我仍然可以修复我的配置单元查询以匹配来自Google Analytics仪表板的统计信息?从逻辑上讲,如果重复字段由于变扁而出现,我不能在我的Hive表中反转同样的东西吗?如果你认为我可以扭转局面,你有什么建议可以继续吗?



非常感谢你提前!

解决方案

您可以在BigQuery中运行相同的查询,而不是在导出到Hive的数据中?

我的猜测:数据存储在配置单元表中,我在导出之前将所有嵌套和重复的字段放平。当扁平化 - 你重复几次浏览量,每次重复进入最后的错误添加?

请注意压扁行时数据可以重复的方式:

  SELECT col,x FROM(
SELECTwrongcol,SUM(totals.pageviews)x
FROM(FLATTEN( [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],点击))
),(
SELECTcorrectcol,SUM(totals.pageviews)x
FROM [google.com :analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]



col x
错误2262
正确249





给出更新2给问题:

由于BigQuery工作正常,并且这是一个Hive问题,因此您应该添加该标记以获取相关答案。



然而,这就是我正确地使用BigQuery去重复先前重复的行:

  SELECT SUM(pv) 
FROM(
SELECT visitId,MAX(totals.pageviews)pv
FROM(FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],hits))
GROUP EACH BY 1


I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application.

Raw data from GA is being stored in BigQuery tables.

However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables.

I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data.

But I am still not sure if I understand the reason why the statistics could be different.

Would really appreciate it if someone clarified this for me.

Thanks in advance.

UPDATE 1:

I exported the raw data from Bigquery into my Hadoop cluster. The data is stored in a hive table. I flattened all the nested and repeated fields before exporting.

Here is the hive query that I ran on the raw data in the Hive table:

SELECT
   date as VisitDate,
   count(distinct fullvisitorid) as CountVisitors,
   SUM(totals_visits) as SumVisits,
   SUM(totals_pageviews) AS PVs
FROM
   bigquerydata 
WHERE 
   fullvisitorid IS NOT NULL
GROUP BY 
   date
ORDER BY
   VisitDate DESC

A) Taking February 9th as the VisitDate, I get the following results from this query:

i) CountVisitors= 1,074,323
ii) SumVisits= 48,990,198
iii) PVs= 1,122,841,424 

Vs

B) Taking the same VisitDate and obtaining the same statistics from the GA report:

i) Users count = 1,549,757
ii) Number of pageviews = 11,604,449 (Huge difference when compared to A(iii))  

In the hive query above, am I using any wrong fields or processing the fields in a wrong way? Just trying to figure out why I have this difference in numbers.

UPDATE 2 (following @Felipe Hoffa 's suggestion):

This is how I am flattening the tables in my Python code before exporting the result to GCS and then to Hadoop cluster:

queryString = 'SELECT * FROM flatten(flatten(flatten(flatten(flatten(flatten([' + TABLE_NAME + '],hits),hits.product),hits.promotion),hits.customVariables), hits.customDimensions), hits.customMetrics)'

I understand what you are saying about flattening causing repeated pageviews and each repetition getting into the final wrong addition.

I tried the same query (from Update1) on Bigquery table instead of my Hive table. The numbers matched with those on the Google Analytics Dashboard.

However, assuming that the Hive table is all I have and it has those repeated fields due to flattening.. BUT Is there still anyway that I can fix my hive query to match the stats from Google Analytics dashboard?

Logically speaking, if the repeated fields came up due to flattening.. can't I reverse the same thing in my Hive table? If you think that I can reverse, do you have any suggestion as to how I can proceed on it?

Thank you so much in advance!

解决方案

Can you run the same query in BigQuery, instead of on the data exported to Hive?

My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?

Note how data can get duplicated when flattening rows:

SELECT col, x FROM (
  SELECT "wrong" col, SUM(totals.pageviews) x
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
  SELECT "correct" col, SUM(totals.pageviews) x
  FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)


col     x    
wrong   2262     
correct 249 


Update given "update 2" to the question:

Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.

Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:

SELECT SUM(pv)
FROM (
  SELECT visitId, MAX(totals.pageviews) pv
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
  GROUP EACH BY 1
)

这篇关于Hive表中的Google Analytics报告和BigQuery数据的统计差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆