星火数据框区分列有重复的名字 [英] Spark Dataframe distinguish columns with duplicated name

查看：241 发布时间：2016/5/22 15:35:43 python apache-spark apache-spark-sql pyspark spark-dataframe

本文介绍了星火数据框区分列有重复的名字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我所知在星火数据框，对于多列可以具有相同的名称如下数据帧快照所示：

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]

既两个 A 以上的结果是由一个数据帧本身加入创建的，你可以看到有 4 列C>和˚F。


Above result is created by join with a dataframe to itself, you can see there are 4 columns with both two a and f.
问题是有没有当我尝试做更多的计算与 A 专栏中，我无法找到一个方法来选择 A ，我有尝试 DF [0] 和 df.select（'A'），无论是我返回下面的错误mesaage：
The problem is is there when I try to do more calculation with the a column, I cant find a way to select the a, I have try df[0] and df.select('a'), both returned me below error mesaage:
AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

 反正是有火花API，我可以再次重复的名字区分列？或者一些方法来让我改列名？ 
推荐答案
让我们开始一些数据：
from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=125231, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])

df2 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])

有可以解决这个问题的一些方法。首先，你可以使用父列明确引用子表列：
There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:
df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

您也可以使用表的别名：
You can also use table aliases:
from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

最后，您可以通过编程重命名列：
Finally you can programmatically rename columns:
df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df1.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+
## |               f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+


                        这篇关于星火数据框区分列有重复的名字的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

星火数据框区分列有重复的名字 [英] Spark Dataframe distinguish columns with duplicated name

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

星火数据框区分列有重复的名字 [英] Spark Dataframe distinguish columns with duplicated name

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭