为什么在 SQL 查询中使用 UDF 会导致笛卡尔积? [英] Why using a UDF in a SQL query leads to cartesian product?

查看：43 发布时间：2021/11/12 5:38:51 sql apache-spark apache-spark-sql

本文介绍了为什么在 SQL 查询中使用 UDF 会导致笛卡尔积?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

为什么使用 UDF 会导致笛卡尔积而不是完整的外连接?显然，笛卡尔积会比完整的外部联接多得多(Joins 是一个例子)这是一个潜在的表现命中.
在Databricks-问题?

Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance hit.
Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?

我有一个使用 SQLContext 执行的 Spark Streaming 应用程序关于流数据的 SQL 语句.当我在Scala，流应用的性能下降显着地.详情如下:

I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:

声明 1:

select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar

声明 2:

select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)

我使用 SQLContext 注册自定义 UDF，如下所示:

I register a custom UDF using SQLContext as follows:

sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)

在相同的输入和 Spark 配置上，Statement2 性能与 Statement1 相比明显更差(接近 100 倍).

On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.

为什么在 SQL 查询中使用 UDF 会导致笛卡尔积? [英] Why using a UDF in a SQL query leads to cartesian product?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在 SQL 查询中使用 UDF 会导致笛卡尔积? [英] Why using a UDF in a SQL query leads to cartesian product?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭