Spark Dataframe 区分名称重复的列 [英] Spark Dataframe distinguish columns with duplicated name
问题描述
据我所知,在 Spark Dataframe 中,多列可以具有相同的名称,如下面的数据帧快照所示:
<预><代码>[行(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0,1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),行(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0,1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),行(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0,1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),行(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0,1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),行(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0,1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))]以上结果是通过将数据帧与自身连接创建的,您可以看到有 4
列,其中包含两个 a
和 f
.
问题是当我尝试使用 a
列进行更多计算时,我找不到选择 a
的方法,我尝试了 df[0]
和 df.select('a')
,都在错误消息下面返回了我:
AnalysisException:引用a"不明确,可能是:a#1333L、a#1335L.
无论如何在 Spark API 中我可以再次区分列和重复名称?或者有什么方法可以让我更改列名?
我建议您更改 join
的列名称.
df1.select(col("a") as "df1_a", col("f") as "df1_f").join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))
生成的 DataFrame
将具有 schema
(df1_a, df1_f, df2_a, df2_f)
So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:
[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]
Above result is created by join with a dataframe to itself, you can see there are 4
columns with both two a
and f
.
The problem is is there when I try to do more calculation with the a
column, I cant find a way to select the a
, I have try df[0]
and df.select('a')
, both returned me below error mesaage:
AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.
Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?
I would recommend that you change the column names for your join
.
df1.select(col("a") as "df1_a", col("f") as "df1_f")
.join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))
The resulting DataFrame
will have schema
(df1_a, df1_f, df2_a, df2_f)
这篇关于Spark Dataframe 区分名称重复的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!