在Spark中执行DataFrame自联接的最干净，最高效的语法 [英] Cleanest, most efficient syntax to perform DataFrame self-join in Spark

查看：278 发布时间：2020/9/4 1:44:34 apache-spark dataframe apache-spark-sql

本文介绍了在Spark中执行DataFrame自联接的最干净，最高效的语法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在标准SQL中，当您将表联接到自身时，您可以为表创建别名以跟踪所引用的列:

In standard SQL, when you join a table to itself, you can create aliases for the tables to keep track of which columns you are referring to:

SELECT a.column_name, b.column_name...
FROM table1 a, table1 b
WHERE a.common_field = b.common_field;

使用Spark DataFrame API可以想到两种方法来实现相同的目的:

There are two ways I can think of to achieve the same thing using the Spark DataFrame API:

解决方案1:重命名列

有两种不同的方法可以回答

There are a couple of different methods for this in answer to this question. This one just renames all the columns with a specific suffix:

df.toDF(df.columns.map(_ + "_R"):_*)

例如，您可以执行以下操作:

For example you can do:

df.join(df.toDF(df.columns.map(_ + "_R"):_*), $"common_field" === $"common_field_R")

解决方案2:将引用复制到DataFrame

Solution #2: Copy the reference to the DataFrame

另一个简单的解决方案是执行以下操作:

Another simple solution is to just do this:

val df: DataFrame = .... val df_right = df df.join(df_right, df("common_field") === df_right("common_field"))

这两种解决方案均有效，我可以看到每种解决方案在某些情况下都是有用的.我应该知道两者之间是否存在内部差异?

Both of these solutions work, and I could see each being useful in certain situations. Are there any internal differences between the two I should be aware of?

推荐答案

通过别名至少可以使用两种不同的方法来实现此目的:

There are at least two different ways you can approach this either by aliasing:

df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")

或使用基于名称的相等联接:

or using name-based equality joins:

// Note that it will result in ambiguous column names // so using aliases here could be a good idea as well. // df.as("df1").join(df.as("df2"), Seq("foo")) df.join(df, Seq("foo"))

一般而言，列重命名虽然很麻烦，但却是所有版本中最安全的做法.有一些与列解析有关的错误(不久前我们在SO上发现了一个错误)，某些细节可能有所不同如果您使用原始表达式，则在解析器之间(HiveContext/标准SQLContext)之间.

In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext / standard SQLContext) if you use raw expressions.

我个人更喜欢使用别名，因为它们类似于惯用的SQL，并且能够在特定的DataFrame对象范围之外使用.

Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame objects.

关于性能，除非您对近实时处理感兴趣，否则任何性能上都不应有差异.所有这些都应生成相同的执行计划.

Regarding performance unless you're interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.

这篇关于在Spark中执行DataFrame自联接的最干净，最高效的语法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中执行DataFrame自联接的最干净，最高效的语法 [英] Cleanest, most efficient syntax to perform DataFrame self-join in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中执行DataFrame自联接的最干净，最高效的语法 [英] Cleanest, most efficient syntax to perform DataFrame self-join in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭