如何使用usingColumns在Spark中连接嵌套列 [英] How to join nested columns in spark with usingColumns
问题描述
我想加入2个数据框.
I have 2 dataframes that I would like to join.
DF1:
root
|-- myStruct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- region: long (nullable = true)
|-- first_name: string (nullable = true)
DF2:
root
|-- id: string (nullable = true)
|-- region: long (nullable = true)
|-- second_name: string (nullable = true)
我的加入声明是
df1.join(df2, Seq("id", "region"), "leftouter")
但是以
USING column `id` cannot be resolved on the left side of the join. The left-side columns: myStruct, first_name
我正在Scala上运行Spark 2.2
I am running Spark 2.2 on Scala
推荐答案
您可以使用.
表示法从struct
列中选择一个元素.因此要从 df1 中选择id
,您将必须执行myStruct.id
,而要选择region
,则必须使用myStruct.region
.
You can use .
notation to select an element from struct
column. so to select id
from df1 you will have to do myStruct.id
and to select region
you have to use myStruct.region
.
和由于要使用的列名不同,您可以使用===
表示法进行比较
And since the column names to be used are not same you can use ===
notation for comparison as
df1.join(df2, df1("myStruct.id") === df2("id") && df1("myStruct.region") === df2("region"), "leftouter")
您应该将连接的 dataframe 与以下 schema
You should have the joined dataframe with following schema
root
|-- myStruct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- region: long (nullable = false)
|-- first_name: string (nullable = true)
|-- id: string (nullable = true)
|-- region: integer (nullable = true)
|-- second_name: string (nullable = true)
您可以在加入后删除不必要的列,或者在加入后选择仅需要的列
You can drop the unnecessary columns after join or select only needed columns after join
我希望答案会有所帮助
这篇关于如何使用usingColumns在Spark中连接嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!