BigQuery SQL:平均值、几何平均值、删除异常值、中位数 [英] BigQuery SQL: Average, geometric mean, remove outliers, median
问题描述
我正在计算在 Stack Overflow 上获得回复的平均时间,但结果毫无意义.
#standardSQLWITH question_answers AS (选择 *, timestamp_diff(answers.first, creation_date, minute) 分钟从 (选择创建日期, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c从`bigquery-public-data.stackoverflow.posts_answers` bWHERE a.id=b.parent_id) 答案, SPLIT(tags, '|') 标签从`bigquery-public-data.stackoverflow.posts_questions` aWHERE EXTRACT(从创建日期开始的年份)>2015年), UNNEST(tags) 标签WHERE 标签('java'、'javascript'、'google-bigquery'、'firebase'、'php')AND 答案.c >0)选择标签, COUNT(*) 个问题, ROUND(AVG(minutes), 2) first_reply_avg_minutesFROM question_answers按标签分组
我应该如何计算平均时间?
2019 年更新: 分享一些
正如您在此处看到的,在这种情况下,移除异常值会为我们提供类似于几何平均值的结果 - 而中位数报告的数字甚至更低.使用哪一种?您的选择.
WITH question_answers AS (选择 *, timestamp_diff(answers.first, creation_date, minute) 分钟从 (选择创建日期, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c从`bigquery-public-data.stackoverflow.posts_answers` bWHERE a.id=b.parent_id) 答案, SPLIT(tags, '|') 标签从`bigquery-public-data.stackoverflow.posts_questions` aWHERE EXTRACT(从创建日期开始的年份)>2015年), UNNEST(tags) 标签WHERE 标签('java'、'javascript'、'google-bigquery'、'firebase'、'php'、'sql'、'elasticsearch'、'apache-kafka'、'tensorflow')AND 答案.c >0)SELECT * EXCEPT(qs, all_minutes), (SELECT ROUND(AVG(q),2) FROM (SELECT q FROM UNNEST(qs) q ORDER BY q LIMIT 80 OFFSET 2)) avg_no_outliers, all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))] medium_minutes从 (选择标签, COUNT(*) 个问题, ROUND(AVG(minutes), 2) avg_minutes, ROUND(EXP(AVG(LOG(GREATEST(minutes,1)))),2) first_reply_avg_minutes_geom, APPROX_QUANTILES(minutes, 100) qs, ARRAY_AGG(minutes IGNORE NULLS ORDER BY minutes) all_minutesFROM question_answers按标签分组)按 2 个 DESC 订购
<小时>
奖金 MEDIAN()
来自 Elliott 的 UDF 函数.
CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS ((选择如果(MOD(ARRAY_LENGTH(arr), 2) = 0,(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))])/2,arr[偏移量(DIV(ARRAY_LENGTH(arr),2))])FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)));
I'm calculating the average time to get a reply on Stack Overflow, and the results make no sense.
#standardSQL
WITH question_answers AS (
SELECT *
, timestamp_diff(answers.first, creation_date, minute) minutes
FROM (
SELECT creation_date
, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_answers` b
WHERE a.id=b.parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(year FROM creation_date) > 2015
), UNNEST(tags) tag
WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php')
AND answers.c > 0
)
SELECT tag
, COUNT(*) questions
, ROUND(AVG(minutes), 2) first_reply_avg_minutes
FROM question_answers
GROUP BY tag
How should I calculate the average time?
Update 2019: How about sharing some persisted public UDFs?
First one, median:
SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000])
3.0
Indeed - getting an average time to get answers on Stack Overflow of above 100 hours (>6000 minutes) seems wrong - and is largely driven by outliers.
Instead of doing a simple AVG()
you could get:
- The geometric mean:
EXP(AVG(LOG(GREATEST(minutes,1))))
- The mean after removing outliers:
AVG(q) FROM (SELECT q FROM QUANTILES(q, 100) LIMIT 80 OFFSET 2))
. - The median:
all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))]
The results make way more sense if you use any of those alternatives:
As you can see here, in this case removing outliers give us results similar to the geometric mean - while the median reports even lower numbers. Which one to use? Your choice.
WITH question_answers AS (
SELECT *
, timestamp_diff(answers.first, creation_date, minute) minutes
FROM (
SELECT creation_date
, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_answers` b
WHERE a.id=b.parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(year FROM creation_date) > 2015
), UNNEST(tags) tag
WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php', 'sql', 'elasticsearch', 'apache-kafka', 'tensorflow')
AND answers.c > 0
)
SELECT * EXCEPT(qs, all_minutes)
, (SELECT ROUND(AVG(q),2) FROM (SELECT q FROM UNNEST(qs) q ORDER BY q LIMIT 80 OFFSET 2)) avg_no_outliers
, all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64) )] median_minutes
FROM (
SELECT tag
, COUNT(*) questions
, ROUND(AVG(minutes), 2) avg_minutes
, ROUND(EXP(AVG(LOG(GREATEST(minutes,1)))),2) first_reply_avg_minutes_geom
, APPROX_QUANTILES(minutes, 100) qs
, ARRAY_AGG(minutes IGNORE NULLS ORDER BY minutes) all_minutes
FROM question_answers
GROUP BY tag
)
ORDER BY 2 DESC
Bonus MEDIAN()
UDF function from Elliott.
CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS ((
SELECT
IF(
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));
这篇关于BigQuery SQL:平均值、几何平均值、删除异常值、中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!