从Spark数据框中选择或删除重复的列 [英] Selecting or removing duplicate columns from spark dataframe
问题描述
给出一个Spark数据框,该列具有重复的列名(例如, A
),但我不能修改上游或源,我该如何选择,删除或重命名列之一,以便我可以检索列值?
Given a spark dataframe, with a duplicate columns names (eg. A
) for which I cannot modify the upstream or source, how do I select, remove or rename one of the columns so that I may retrieve the columns values?
df.select('A')
向我显示了一个模棱两可的列错误, filter
, drop
和 withColumnRenamed也是如此
.如何选择其中一列?
df.select('A')
shows me an ambiguous column error, as does filter
, drop
, and withColumnRenamed
. How do I select one of the columns?
推荐答案
经过数小时的研究,我发现的唯一方法是重命名列集,然后使用新集作为标题创建另一个数据框.
The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header.
例如,如果您有:
>>> import pyspark
>>> from pyspark.sql import SQLContext
>>>
>>> sc = pyspark.SparkContext()
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])
DataFrame[a: bigint, b: bigint, a: bigint]
>>> df.columns
['a', 'b', 'a']
>>> df2 = df.toDF('a', 'b', 'c')
>>> df2.columns
['a', 'b', 'c']
您可以使用 df.columns
获取列列表,然后使用循环重命名所有重复项以获取新的列列表(请不要忘记传递 * new_col_list
而不是 toDF
函数的 new_col_list
,否则会抛出无效的计数错误.
You can get the list of columns using df.columns
and then use a loop to rename any duplicates to get the new column list (don't forget to pass *new_col_list
instead of new_col_list
to toDF
function else it'll throw an invalid count error).
这篇关于从Spark数据框中选择或删除重复的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!