在 Spark 中执行 DataFrame 自连接的最简洁、最有效的语法 [英] Cleanest, most efficient syntax to perform DataFrame self-join in Spark
问题描述
在标准 SQL 中,当您将表与自身连接时,您可以为表创建别名以跟踪您所引用的列:
In standard SQL, when you join a table to itself, you can create aliases for the tables to keep track of which columns you are referring to:
SELECT a.column_name, b.column_name...
FROM table1 a, table1 b
WHERE a.common_field = b.common_field;
我可以想到两种使用 Spark DataFrame
API 实现相同目的的方法:
There are two ways I can think of to achieve the same thing using the Spark DataFrame
API:
解决方案 #1:重命名列
有几种不同的方法来回答这个问题.这只是重命名具有特定后缀的所有列:
There are a couple of different methods for this in answer to this question. This one just renames all the columns with a specific suffix:
df.toDF(df.columns.map(_ + "_R"):_*)
例如你可以这样做:
df.join(df.toDF(df.columns.map(_ + "_R"):_*), $"common_field" === $"common_field_R")
解决方案#2:复制对DataFrame
另一个简单的解决方案就是这样做:
Another simple solution is to just do this:
val df: DataFrame = ....
val df_right = df
df.join(df_right, df("common_field") === df_right("common_field"))
这两种解决方案都有效,而且我发现每种解决方案在某些情况下都很有用.我应该注意两者之间的内部差异吗?
Both of these solutions work, and I could see each being useful in certain situations. Are there any internal differences between the two I should be aware of?
推荐答案
至少有两种不同的方法可以通过别名来解决这个问题:
There are at least two different ways you can approach this either by aliasing:
df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")
或使用基于名称的相等连接:
or using name-based equality joins:
// Note that it will result in ambiguous column names
// so using aliases here could be a good idea as well.
// df.as("df1").join(df.as("df2"), Seq("foo"))
df.join(df, Seq("foo"))
一般来说,列重命名虽然最丑陋,但却是所有版本中最安全的做法.有一些与列解析相关的错误(我们不久前在 SO 上发现了一个错误),一些细节可能有所不同如果使用原始表达式,则在解析器之间(HiveContext
/标准 SQLContext
).
In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext
/ standard SQLContext
) if you use raw expressions.
我个人更喜欢使用别名,因为它们类似于惯用的 SQL,并且能够在特定 DataFrame
对象的范围之外使用.
Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame
objects.
关于性能,除非您对接近实时的处理感兴趣,否则应该没有任何性能差异.所有这些都应该生成相同的执行计划.
Regarding performance unless you're interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.
这篇关于在 Spark 中执行 DataFrame 自连接的最简洁、最有效的语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!