计算BigQuery中的移动最大值 [英] Computing a moving maximum in BigQuery

查看:85
本文介绍了计算BigQuery中的移动最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个包含一些排序和一些数字的BigQuery表,我想计算一个数字的移动最大值 - 类似于移动平均值,但是取代最大值。从尝试使用BigQuery来计算EMA(指数移动平均数)看起来最好的方法是使用 LEAD(),然后自己进行聚合。 ( Bigquery移动平均数表明本质上是 CROSS JOIN ,但由于数据的大小,这看起来会很慢)。

理想情况下,我可能只能返回一个重复字段,而不是20个单独的字段,来自内部查询,然后在重复字段上使用正常聚合,但我还没有想出一种方法来实现这一点,所以我坚持滚动自己的聚合。虽然这对于总计或平均数来说很容易,但计算最大内联是相当棘手的,我还没有想出一个好方法来完成它。



(The下面的例子当然有些人为了使用公共数据集而做了一些设计,他们也做了滚动max超过3个元素,而我想这样做大约20个。我已经以编程方式生成了查询,因此使它变短)

一种方法是执行以下操作:

  SELECT word,
(CASE
when word_count> = word_count_1 AND word_count> = word_count_2 THEN word_count
when word_count_1> = word_count AND word_count_1> = word_count_2 THEN word_count_1
ELSE word_count_2 END
)AS max_count
FROM(
SELECT word,word_count,
LEAD(word_count,1)OVER(ORDER BY word)AS word_count_1,
LEAD(word_count,2)OVER(ORDER BY word)AS word_count_2,
FROM [publicdata:samples.shakes peare]
WHERE corpus ='macbeth'

这是O n ^ 2),但它至少起作用。我也可以做一个 IF s的嵌套链,就像这样:

  SELECT word,
IF(word_count> = word_count_1,
IF(word_count> = word_count_2,word_count,word_count_2),
IF(word_count_1> = word_count_2,word_count_1,word_count_2)) AS max_count
FROM ...

这是O(n)来评估,但是查询大小在n中是指数函数,所以我不认为这是一个好的选择;肯定会超过n = 20的BigQuery查询大小限制。我也可以做n个嵌套查询:

  SELECT word,
IF(word_count_2> = max_count,word_count_2,max_count )AS max_count
FROM(
SELECT word,
IF(word_count_1> = word_count,word_count_1,word_count)AS max_count
FROM ...
)$ b $但是,看起来20个嵌套查询可能并不是一个好主意,但是。

b
$ b

有没有一种很好的方法来做这种查询?如果没有,我是否正确地认为n在20左右,第一个是最不好的?

解决方案

用于滚动窗口:CROSS JOIN和一个数字表。在这种情况下,为了有一个3年的移动窗口,我交叉连接数字0,1,2。然后,您可以为每个组创建一个ID( ending_at_year == year-i )并按组进行分组。 (b)
$ b $ $ $ $ $ $ $ $ $ $ $ $ $ b SELECT mean_temp,year-i ending_at_year,year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i< 3 )b
WHERE station_number = 722860

GROUP BY ending_at_year
HAVING c = 3
ORDER BY ending_at_year;


Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)

Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.

(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)

One approach is to do the following:

SELECT word,
  (CASE
    WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
    WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
    ELSE word_count_2 END
    ) AS max_count
FROM (
  SELECT word, word_count,
    LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
    LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
  FROM [publicdata:samples.shakespeare]
  WHERE corpus = 'macbeth'
)

This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:

SELECT word,
  IF(word_count >= word_count_1,
    IF(word_count >= word_count_2, word_count, word_count_2),
    IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...

This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:

SELECT word,
  IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
  SELECT word,
    IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
  FROM ...
)

It seems like doing 20 nested queries might not be a great idea performance-wise, though.

Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?

解决方案

A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.

SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM 
(
 SELECT mean_temp, year-i ending_at_year, year
 FROM [publicdata:samples.gsod] a
 CROSS JOIN 
  (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
 WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;

这篇关于计算BigQuery中的移动最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆