为什么在SQL查询中使用UDF导致笛卡尔积? [英] Why using a UDF in a SQL query leads to cartesian product?

查看:277
本文介绍了为什么在SQL查询中使用UDF导致笛卡尔积?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看了<一个href=\"https://forums.databricks.com/questions/1907/performance-degradation-when-using-a-custom-udfs-i.html\"相对=nofollow> Databricks-问题并不懂


  1. 为什么使用UDF的导致笛卡尔乘积,而不是一个完全外部联接?显然,笛卡尔乘积会比完全外连接了很多行(加入是一个例子),这是一个潜在的性能
    命中。

  2. 任何方式强制外连接在以上的例子中<给定的笛卡尔积href=\"https://forums.databricks.com/questions/1907/performance-degradation-when-using-a-custom-udfs-i.html\"相对=nofollow> Databricks-问题?

引述<一个href=\"https://forums.databricks.com/questions/1907/performance-degradation-when-using-a-custom-udfs-i.html\"相对=nofollow> Databricks-问这里:


  

我有一个使用SQLContext执行星火流应用程序
  在流数据的SQL语句。当我注册一个自定义UDF
  斯卡拉,流可降解性能
  显著。详情如下:


  
  

声明1:


  
  

选择COL1,从表1 COL2为T1表2中加入作为t1.foo = t2.bar

T2
  
  

声明2:


  
  

选择COL1,从表1 COL2为T1表2中加入作为平等的T2(t1.foo,t2.bar)


  
  

我注册使用SQLContext自定义的UDF如下:


  
  

sqlc.udf.register(等于,(S1:字符串,S2:字符串)=&GT; S1 == S2)


  
  

在相同的输入和星火配置,性能声明2
  显著恶化(接近100倍)相比,语句1。



解决方案

  

为什么使用UDF的导致笛卡尔乘积,而不是一个完整的外部连接?


为什么使用UDF的需要笛卡尔积的原因很简单。既然你传递一个任意函数可能是无限的域和不确定性的行为,以决定其价值的唯一途径是通过参数和评价。这意味着你只需要检查所有可能的对。

从另一方面简单的平等有predictable行为。如果你使用 t1.foo = t2.bar 情况你可以简单地洗牌 T1 T2 分别得到预期的结果。

和刚需的关系代数外precise加盟实际上是前$ P $使用自然连接pssed。任何超出仅仅是一个优化。


  

任何方式强制外连接在笛卡尔乘积


不是真的,除非你想修改星火SQL引擎。

I saw Databricks-Question and don't understand

  1. Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance hit.
  2. Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?

Quoting the Databricks-Question here:

I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:

Statement 1:

Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar

Statement 2:

Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)

I register a custom UDF using SQLContext as follows:

sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)

On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.

解决方案

Why using UDFs leads to a Cartesian product instead of a full outer join?

The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.

Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.

And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.

Any way to force an outer join over the Cartesian product

Not really, unless you want to modify Spark SQL engine.

这篇关于为什么在SQL查询中使用UDF导致笛卡尔积?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆