如何获取BigQuery中给定回购的GitHub星星总数? [英] How to get total number of GitHub stars for a given repo in BigQuery?
问题描述
<天真的计数:有些人明星和非明星,并再次出演明星。这会创建重复的WatchEvents。
唯一由演员id计数:每个人只能出演一次。我们可以数一数(但我们不知道他们是否未出演,所以总数会低于这个数字)。
唯一由演员登录:一些历史月份缺少'actor.id'字段。我们可以看看'actor.login'字段(但有些人改变他们的登录名)。
另外,感谢GHTorrent项目:
#standardSQL
SELECT COUNT(*)stars
FROM`ghtorrent-bq.ght_2017_01_19.watchers` a
JOIN` ghtorrent-bq.ght_2017_01_19.projects` b
ON a.repo_id = b.id
WHERE url ='https://api.github.com/repos/angular/angular'
LIMIT 10
<20567,截至2017/01/19。
相关:
- 项目更改名称时会发生什么情况?
https://stackoverflow.com/ a / 42935592/132438
- 如何获取更新后的GHtorrent数据,然后再更新它?
https://stackoverflow.com/a/42935662/132438
My goal is to track over time the popularity of my BigQuery repo.
I want to use publicly available BigQuery datasets, like GitHub Archive or the GitHub dataset
The GitHub dataset sample_repos
does not contain a snapshot of the star counts:
SELECT
watch_count
FROM
[bigquery-public-data:github_repos.sample_repos]
WHERE
repo_name == 'angular/angular'
returns 5318.
GitHub Archive is a timeline of event. I can try to sum them all, but the numbers do not match with the numbers in the GitHub UI. I guess because it does not count unstar actions. Here is the query I used:
SELECT
COUNT(*)
FROM
[githubarchive:year.2011],
[githubarchive:year.2012],
[githubarchive:year.2013],
[githubarchive:year.2014],
[githubarchive:year.2015],
[githubarchive:year.2016],
TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2017-01-01'), TIMESTAMP('2017-03-30') )
WHERE
repo.name == 'angular/angular'
AND type = "WatchEvent"
returns 24144
The real value is 21,921
#standardSQL
SELECT
COUNT(*) naive_count,
COUNT(DISTINCT actor.id) unique_by_actor_id,
COUNT(DISTINCT actor.login) unique_by_actor_login
FROM `githubarchive.month.*`
WHERE repo.name = 'angular/angular'
AND type = "WatchEvent"
Naive count: Some people star and un-star, and star again. This creates duplicate WatchEvents.
Unique by actor id count: Each person can only star once. We can count those (but we don't know if they un-starred, so the total count will be lower than this).
Unique by actor login: Some historical months are missing the 'actor.id' field. We can look at the 'actor.login' field instead (but some people change their logins).
Alternatively, thanks to GHTorrent project:
#standardSQL
SELECT COUNT(*) stars
FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
ON a.repo_id=b.id
WHERE url = 'https://api.github.com/repos/angular/angular'
LIMIT 10
20567, as of 2017/01/19.
Related:
- What happens when a project changes it's name?
https://stackoverflow.com/a/42935592/132438
- How to get updated GHtorrent data, before they update it?
https://stackoverflow.com/a/42935662/132438
这篇关于如何获取BigQuery中给定回购的GitHub星星总数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!