在 BigQuery 中计算移动最大值 [英] Computing a moving maximum in BigQuery

查看:22
本文介绍了在 BigQuery 中计算移动最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个带有一些排序和一些数字的 BigQuery 表,我想计算数字的移动最大值"——类似于移动平均线,但要计算最大值.来自 尝试使用 BigQuery 计算 EMA(指数移动平均线) 似乎最好的方法是使用 LEAD() 然后自己进行聚合.(Bigquery 移动平均线 本质上暗示了CROSS JOIN,但这似乎是考虑到数据的大小,会很慢.)

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)

理想情况下,我可能只能从内部查询中返回一个重复的字段,而不是 20 个单独的字段,然后对重复的字段使用正常聚合,但我还没有想出办法做到这一点,所以我坚持滚动我自己的聚合.虽然这对于求和或平均值来说很容易,但计算最大内联非常棘手,我还没有想出一个好的方法.

Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.

(为了使用公共数据集,下面的示例当然有些人为设计.他们还对 3 个元素进行了最大滚动,而我想这样做大约 20 个.我已经在以编程方式生成查询,所以缩短它不是什么大问题.)

(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)

一种方法是执行以下操作:

One approach is to do the following:

SELECT word,
  (CASE
    WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
    WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
    ELSE word_count_2 END
    ) AS max_count
FROM (
  SELECT word, word_count,
    LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
    LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'
)

这是 O(n^2),但它至少有效.我也可以做一个嵌套的 IF 链,就像这样:

This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:

SELECT word,
  IF(word_count >= word_count_1,
    IF(word_count >= word_count_2, word_count, word_count_2),
    IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...

这是 O(n) 来评估的,但查询大小是 n 的指数,所以我认为这不是一个好的选择;当然,它会超过 n=20 的 BigQuery 查询大小限制.我还可以执行 n 个嵌套查询:

This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:

SELECT word,
  IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
  SELECT word,
    IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
  FROM ...
)

不过,似乎执行 20 个嵌套查询在性能方面可能不是一个好主意.

It seems like doing 20 nested queries might not be a great idea performance-wise, though.

有没有什么好办法来做这种查询?如果不是,我是否更正了 n 大约 20,第一个是最不坏的?

Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?

推荐答案

我用于滚动窗口的一个技巧:CROSS JOIN 与数字表.在这种情况下,为了有 3 年的移动窗口,我与数字 0、1、2 交叉连接.然后您可以为每个组创建一个 id (ending_at_year==year-i) 并按此分组.

A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.

SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM 
(
 SELECT mean_temp, year-i ending_at_year, year
 FROM [publicdata:samples.gsod] a
 CROSS JOIN 
  (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
 WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;

这篇关于在 BigQuery 中计算移动最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆