Google BigQuery:如何为查询结果中的值获取不同的行 [英] Google BigQuery: How do I get a distinct row for a value in query results

查看:77
本文介绍了Google BigQuery:如何为查询结果中的值获取不同的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试在github存档上使用Google BigQuery( http://www.githubarchive.org/)数据来获取他们最近一次活动时仓库的统计数据,我正在试图让这个仓库的观众人数最多。我意识到这是很多,但我觉得我真的接近于在一个查询中得到它。



这是我现在的查询:

 以时间$ b $的形式,将repository_name,repository_owner,repository_organization,repository_size,repository_watchers作为观察者,repository_forks作为forks,repository_language,MAX(PARSE_UTC_USEC(created_at)) b FROM [githubarchive:github.timeline] 
GROUP EACH BY repository_name,repository_owner,repository_organization,repository_size,watchers,fork,repository_language
ORDER BY观察者DESC,时间DESC
极限1000

唯一的问题是,我得到所有来自最高端存储库(twitter bootstrap)的事件:


$ b

结果:

 行储存库名称repository_owner储存库组织储存库大小watchers forks repository_language时间
1 bootstrap twbs twbs 8 3875 61191 21602 JavaScript 1384991582000000
2 bootstrap twbs twbs 83875 61190 21602 JavaScript 1384991337000000
3 bootstrap twbs twbs 83875 61190 21603 JavaScript 1384989683000000

...

我怎样才能得到这个结果(最近的,也就是Max(time))对于repository_name?



我试过了:

  SELECT repository_name ,repository_owner,repository_organization,repository_size,repository_watchers as watchers,repository_forks as forks,repository_language,MAX(PARSE_UTC_USEC(created_at))as time $ b $ FROM [githubarchive:github.timeline] 
WHERE PARSE_UTC_USEC(created_at)IN(SELECT MAX(PARSE_UTC_USEC(created_at))FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name,repository_owner, repository_organization,repository_size,watchers,fork,repository_language
ORDER BY观察者DESC,时间DESC
极限1000

不知道这是否可行,但没关系,因为我收到错误消息:

 错误:未定义连接属性:PARSE_UTC_USEC 

任何帮助都很好,谢谢。

解决方案

该查询的一个问题是,如果有两个操作同时发生,您的结果可能会混淆。如果按存储库名称进行分组以获得每个存储库的最大提交时间,然后再加入以获得所需的其他字段,则可以获得所需的结果。例如:

 选择
a.repository_name作为名称,
a.repository_owner作为所有者,
a.repository_organization as organization,
a.repository_size as size,
a.repository_watchers AS watchers,
a.repository_forks AS forks,
a.repository_language as language,
PARSE_UTC_USEC(created_at)AS time
FROM [githubarchive:github.timeline] a
JOIN each

SELECT MAX(created_at)as max_created,repository_name
FROM [ githubarchive:github.timeline]
GROUP EACH BY repository_name
)b
ON
b.max_created = a.created_at和
b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000


I am trying to use Google BigQuery on the github archive (http://www.githubarchive.org/) data to get the statistics for repositories at the time of their latest event and I am trying to get this for the repositories with the most watchers. I realize this is a lot but I feel like I'm really close to getting it in one query.

This is the query I have now:

SELECT repository_name, repository_owner, repository_organization, repository_size,  repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

The only problem is that I get all events that are from the highest watched repository (twitter bootstrap):

Result:

Row repository_name repository_owner    repository_organization repository_size watchers    forks   repository_language time     
1   bootstrap           twbs                    twbs                   83875      61191     21602   JavaScript          1384991582000000     
2   bootstrap           twbs                    twbs                   83875      61190     21602   JavaScript          1384991337000000     
3   bootstrap           twbs                    twbs                   83875      61190     21603   JavaScript          1384989683000000

...

How can I just get this to return a single result (the most recent, aka Max(time)) for a repository_name?

I've tried:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

Not sure if that would work or not but it doesn't matter because I get the error message:

Error: Join attribute is not defined: PARSE_UTC_USEC

Any help would be great, thanks.

解决方案

One issue with that query is that if there are two operations that both happen at the same time, your results can get confused. You can get what you want if you just group by the repository name to get the max commit time for each repository, and then join against that to get the other fields you want. E.g:

SELECT
  a.repository_name as name,
  a.repository_owner as owner,
  a.repository_organization as organization,
  a.repository_size as size,
  a.repository_watchers AS watchers,
  a.repository_forks AS forks,
  a.repository_language as language,
  PARSE_UTC_USEC(created_at) AS time  
FROM [githubarchive:github.timeline] a
JOIN EACH
  (
     SELECT MAX(created_at) as max_created, repository_name 
     FROM [githubarchive:github.timeline]
     GROUP EACH BY repository_name
  ) b
  ON 
  b.max_created = a.created_at and
  b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000  

这篇关于Google BigQuery:如何为查询结果中的值获取不同的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆