BigQuery-在连接谓词中使用SQL UDF [英] BigQuery - using SQL UDF in join predicate

查看:43
本文介绍了BigQuery-在连接谓词中使用SQL UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在运行左联接时使用SQL UDF,但出现以下错误:

I'm trying to use a SQL UDF when running a left join, but get the following error:

联接谓词中的子查询应仅依赖于一个联接端.

Subquery in join predicate should only depend on exactly one join side.

查询是:

CREATE TEMPORARY FUNCTION game_match(game1 STRING,game2 STRING) AS (
  strpos(game1,game2) >0
);

SELECT 
  t1.gameId 
  FROM `bigquery-public-data.baseball.games_post_wide` t1
  left join `bigquery-public-data.baseball.games_post_wide` t2 on t1.gameId=t2.gameId and game_match(t1. gameId, t2.gameId)

内联编写条件时,该查询有效,而不是函数调用(strpos(t1.gameId,t2.gameId)> 0).

When writing the condition inline, instead of the function call (strpos(t1. gameId, t2. gameId) >0), the query works.

此特定功能是否存在问题,还是联接谓词不支持一般的SQL UDF(由于某种原因)?

Is there something problematic with this specific function, or is it that in general SQL UDF aren't supported in join predicate (for some reason)?

推荐答案

您可以在问题跟踪器来完成这项工作.这是查询计划/优化的局限性;在某些背景下,BigQuery会转换函数调用,以使查询的逻辑表示像这样:

You could file a feature request on the issue tracker to make this work. It's a limitation of query planning/optimization; for some background, BigQuery converts the function call so that the query's logical representation is like this:

SELECT 
  t1.gameId 
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2
on t1.gameId=t2.gameId
  and (SELECT strpos(game1,game2) > 0 FROM (SELECT t1.gameId AS game1, t2.gameId AS game2))

BigQuery这样转换SQL UDF调用的原因是,它需要避免多次计算输入.虽然在这种特定情况下这不是问题,但是如果您在UDF主体中多次引用输入中的一个,则会有所不同,例如考虑这个UDF:

The reason that BigQuery transforms the SQL UDF call like this is that it needs to avoid computing the inputs more than once. While it's not an issue in this particular case, it makes a difference if you reference one of the inputs more than once in the UDF body, e.g. consider this UDF:

CREATE TEMP FUNCTION Foo(x FLOAT64) AS (x - x);
SELECT Foo(RAND());

如果BigQuery直接将表达式内联,您将得到以下结果:

If BigQuery were to inline the expression directly, you'd end up with this:

SELECT RAND() - RAND();

结果不会为零,这在给定UDF的情况下是意外的.

The result would not be zero, which is unexpected given the definition of the UDF.

在大多数情况下,BigQuery的逻辑优化将如上所示的更复杂的子选择转换为更简单的形式,假设这样做不会改变查询的语义.但是,在这种情况下并没有发生这种情况,因此是错误.

In most cases, BigQuery's logical optimizations transform the more complicated subselect as shown above into a simpler form, assuming that doing so doesn't change the semantics of the query. That didn't happen in this case, though, hence the error.

这篇关于BigQuery-在连接谓词中使用SQL UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆