BigQuery argmax:执行CROSS JOIN UNNEST时是否保持数组顺序 [英] BigQuery argmax: Is array order maintained when doing CROSS JOIN UNNEST

查看:553
本文介绍了BigQuery argmax:执行CROSS JOIN UNNEST时是否保持数组顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

在BigQuery中,如果我运行的是标准SQL

In BigQuery, standard SQL, if I run

SELECT *
FROM mytable
CROSS JOIN UNNEST(mytable.array)

我可以确定结果行顺序与数组顺序相同吗?

Can I be certain that the resulting row order is the same as the array order?

示例:

假设我有下表mytable:

Row | id   | prediction
1   | abcd | [0.2, 0.5, 0.3]

如果运行SELECT * FROM mytable CROSS JOIN UNNEST(mytable.prediction),是否可以确定行顺序与数组顺序相同? IE.结果表将始终为:

If I run SELECT * FROM mytable CROSS JOIN UNNEST(mytable.prediction), can I be certain that the row order is the same as the array order? I.e. will the resulting table always be:

Row | id   | unnested_prediction
1   | abcd | 0.2
2   | abcd | 0.5
3   | abcd | 0.3

更多用例背景(argmax):

我试图在每一行(argmax)中找到具有最大数组值的数组索引,即上面数组中的第二个元素(0.5).因此,我的目标输出是这样的:

I'm trying to find the array index with the largest value for the array in each row (argmax), i.e. the second element (0.5) in the array above. My target output is thus something like this:

Row | id   | argmax
1   | abcd | 2

使用CROSS JOIN(由prediction值排序的DENSE_RANK窗口函数)和ROW_NUMBER窗口函数来找到argmax,我可以使用一些测试数据来完成这项工作.您可以通过以下查询进行验证:

Using CROSS JOIN, a DENSE_RANK window function ordered by the prediction value and a ROW_NUMBER window function to find the argmax, I am able to make this work with some test data. You can verify with this query:

WITH predictions AS (
  SELECT 'abcd' AS id, [0.2, 0.5, 0.3] AS prediction
  UNION ALL
  SELECT 'efgh' AS id, [0.7, 0.2, 0.1] AS prediction
),
ranked_predictions AS (
  SELECT 
    id,
    ROW_NUMBER() OVER (PARTITION BY id) AS rownum, -- This is the ordering I'm curious about
    DENSE_RANK() OVER (PARTITION BY id ORDER BY flattened_prediction DESC) AS array_rank
  FROM
     predictions P
  CROSS JOIN
    UNNEST(P.prediction) AS flattened_prediction
)
SELECT
  id,
  rownum AS argmax
FROM
  ranked_predictions
WHERE array_rank = 1

ROW_NUMBER在我的测试中表现良好可能只是一个巧合(即,它是根据未嵌套的数组排序的),因此可以肯定地说.

It could just be a coincidence that ROW_NUMBER behaves well in my tests (i.e. that it is ordered according to the unnested array), so it would be nice to be certain.

推荐答案

简短的回答:不,不能保证顺序会得到保证.

Short answer: no, order is not guaranteed to be maintained.

长答案:在实践中,您很可能会看到该顺序得以维护,但您不应依赖它.您提供的示例类似于这种查询:

Long answer: in practice, you'll most likely see that order is maintained, but you should not depend on it. The example that you provided is similar to this type of query:

SELECT *
FROM (
  SELECT 3 AS x UNION ALL
  SELECT 2 UNION ALL
  SELECT 1
  ORDER BY x
)

输出的预期顺序是什么? ORDER BY在子查询中,并且外部查询不施加任何顺序,因此BigQuery(或您在其中运行的任何引擎)可以自由地对输出中的行进行重新排序.您可能最终会返回1, 2, 3,或者可能会收到3, 2, 1或其他任何命令.更普遍的原则是,投影不保留顺序.

What is the expected order of the output? The ORDER BY is in the subquery, and the outer query doesn't impose any ordering, so BigQuery (or whatever engine you run this in) is free to reorder the rows in the output as it sees fit. You may end up getting back 1, 2, 3, or you may receive 3, 2, 1 or any other ordering. The more general principle is that projections are not order-preserving.

虽然数组具有明确定义的元素顺序,但是当您使用UNNEST函数时,您正在将数组转换为关系,除非您使用ORDER BY,否则该关系没有明确定义的顺序.例如,考虑以下查询:

While arrays have a well-defined order of their elements, when you use the UNNEST function, you're converting the array into a relation, which doesn't have a well-defined order unless you use ORDER BY. For example, consider this query:

SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)

实际上并不能保证new_arr数组具有该元素[2, 3, 4],因为ARRAY函数内部的查询不使用ORDER BY.您可以通过根据元素偏移量进行排序来解决这种不确定性,但是:

The new_arr array isn't actually guaranteed to have the elements [2, 3, 4] in that order, since the query inside the ARRAY function doesn't use ORDER BY. You can address this non-determinism by ordering based on the element offsets, however:

SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x WITH OFFSET ORDER BY OFFSET) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)

现在保证输出为[2, 3, 4].

回到最初的问题,可以通过在计算行号的子查询中强加一个顺序来确保获得确定的输出:

Going back to your original question, you can ensure that you get deterministic output by imposing an ordering in the subquery that computes the row numbers:

ranked_predictions AS (
  SELECT 
    id,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY OFFSET) AS rownum,
    DENSE_RANK() OVER (PARTITION BY id ORDER BY flattened_prediction DESC) AS array_rank
  FROM
     predictions P
  CROSS JOIN
    UNNEST(P.prediction) AS flattened_prediction WITH OFFSET
)

我在ROW_NUMBER窗口内的UNNESTORDER BY OFFSET之后添加了WITH OFFSET,以确保根据数组元素的原始顺序计算行号.

I added the WITH OFFSET after the UNNEST, and ORDER BY OFFSET inside the ROW_NUMBER window in order to ensure that the row numbers are computed based on the original ordering of the array elements.

这篇关于BigQuery argmax:执行CROSS JOIN UNNEST时是否保持数组顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆