Google BigQuery:如何为查询结果中的值获取不同的行 [英] Google BigQuery: How do I get a distinct row for a value in query results
问题描述
我正尝试在github存档上使用Google BigQuery( http://www.githubarchive.org/)数据来获取他们最近一次活动时仓库的统计数据,我正在试图让这个仓库的观众人数最多。我意识到这是很多,但我觉得我真的接近于在一个查询中得到它。
这是我现在的查询:
以时间$ b $的形式,将repository_name,repository_owner,repository_organization,repository_size,repository_watchers作为观察者,repository_forks作为forks,repository_language,MAX(PARSE_UTC_USEC(created_at)) b FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name,repository_owner,repository_organization,repository_size,watchers,fork,repository_language
ORDER BY观察者DESC,时间DESC
极限1000
唯一的问题是,我得到所有来自最高端存储库(twitter bootstrap)的事件:
$ b
结果:
行储存库名称repository_owner储存库组织储存库大小watchers forks repository_language时间
1 bootstrap twbs twbs 8 3875 61191 21602 JavaScript 1384991582000000
2 bootstrap twbs twbs 83875 61190 21602 JavaScript 1384991337000000
3 bootstrap twbs twbs 83875 61190 21603 JavaScript 1384989683000000
...
我怎样才能得到这个结果(最近的,也就是Max(time))对于repository_name?
我试过了:
SELECT repository_name ,repository_owner,repository_organization,repository_size,repository_watchers as watchers,repository_forks as forks,repository_language,MAX(PARSE_UTC_USEC(created_at))as time $ b $ FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at)IN(SELECT MAX(PARSE_UTC_USEC(created_at))FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name,repository_owner, repository_organization,repository_size,watchers,fork,repository_language
ORDER BY观察者DESC,时间DESC
极限1000
不知道这是否可行,但没关系,因为我收到错误消息:
错误:未定义连接属性:PARSE_UTC_USEC
任何帮助都很好,谢谢。
该查询的一个问题是,如果有两个操作同时发生,您的结果可能会混淆。如果按存储库名称进行分组以获得每个存储库的最大提交时间,然后再加入以获得所需的其他字段,则可以获得所需的结果。例如:
选择
a.repository_name作为名称,
a.repository_owner作为所有者,
a.repository_organization as organization,
a.repository_size as size,
a.repository_watchers AS watchers,
a.repository_forks AS forks,
a.repository_language as language,
PARSE_UTC_USEC(created_at)AS time
FROM [githubarchive:github.timeline] a
JOIN each
(
SELECT MAX(created_at)as max_created,repository_name
FROM [ githubarchive:github.timeline]
GROUP EACH BY repository_name
)b
ON
b.max_created = a.created_at和
b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000
I am trying to use Google BigQuery on the github archive (http://www.githubarchive.org/) data to get the statistics for repositories at the time of their latest event and I am trying to get this for the repositories with the most watchers. I realize this is a lot but I feel like I'm really close to getting it in one query.
This is the query I have now:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
The only problem is that I get all events that are from the highest watched repository (twitter bootstrap):
Result:
Row repository_name repository_owner repository_organization repository_size watchers forks repository_language time
1 bootstrap twbs twbs 83875 61191 21602 JavaScript 1384991582000000
2 bootstrap twbs twbs 83875 61190 21602 JavaScript 1384991337000000
3 bootstrap twbs twbs 83875 61190 21603 JavaScript 1384989683000000
...
How can I just get this to return a single result (the most recent, aka Max(time)) for a repository_name?
I've tried:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
Not sure if that would work or not but it doesn't matter because I get the error message:
Error: Join attribute is not defined: PARSE_UTC_USEC
Any help would be great, thanks.
One issue with that query is that if there are two operations that both happen at the same time, your results can get confused. You can get what you want if you just group by the repository name to get the max commit time for each repository, and then join against that to get the other fields you want. E.g:
SELECT
a.repository_name as name,
a.repository_owner as owner,
a.repository_organization as organization,
a.repository_size as size,
a.repository_watchers AS watchers,
a.repository_forks AS forks,
a.repository_language as language,
PARSE_UTC_USEC(created_at) AS time
FROM [githubarchive:github.timeline] a
JOIN EACH
(
SELECT MAX(created_at) as max_created, repository_name
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name
) b
ON
b.max_created = a.created_at and
b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000
这篇关于Google BigQuery:如何为查询结果中的值获取不同的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!