从火花数据框中选择或删除重复的列 [英] Selecting or removing duplicate columns from spark dataframe

查看：23 发布时间：2021/11/14 23:05:05 apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了从火花数据框中选择或删除重复的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定一个带有重复列名称(例如 A)的 spark 数据框，我无法修改上游或源，我该如何选择、删除或重命名列之一，以便我可以检索列值?

df.select('A') 显示了一个不明确的列错误，filter、drop 和 withColumnRenamed 也是如此.如何选择其中一列?

解决方案

经过数小时的研究，我发现的唯一方法是重命名列集，然后以新集作为标题创建另一个数据框.

例如，如果您有:

<预><代码>>>>导入pyspark>>>从 pyspark.sql 导入 SQLContext>>>>>>sc = pyspark.SparkContext()>>>sqlContext = SQLContext(sc)>>>df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])DataFrame[a: bigint, b: bigint, a: bigint]>>>df.columns['a', 'b', 'a']>>>df2 = df.toDF('a', 'b', 'c')>>>df2.columns['a', 'b', 'c']

您可以使用 df.columns 获取列列表，然后使用循环重命名任何重复项以获取新列列表(不要忘记传递 *new_col_list 而不是 new_col_list 到 toDF 函数，否则它会抛出无效计数错误).

Given a spark dataframe, with a duplicate columns names (eg. A) for which I cannot modify the upstream or source, how do I select, remove or rename one of the columns so that I may retrieve the columns values?

df.select('A') shows me an ambiguous column error, as does filter, drop, and withColumnRenamed. How do I select one of the columns?

解决方案

The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header.

Eg, if you have:

>>> import pyspark
>>> from pyspark.sql import SQLContext
>>> 
>>> sc = pyspark.SparkContext()
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])
DataFrame[a: bigint, b: bigint, a: bigint]
>>> df.columns
['a', 'b', 'a']
>>> df2 = df.toDF('a', 'b', 'c')
>>> df2.columns
['a', 'b', 'c']

You can get the list of columns using df.columns and then use a loop to rename any duplicates to get the new column list (don't forget to pass *new_col_list instead of new_col_list to toDF function else it'll throw an invalid count error).

这篇关于从火花数据框中选择或删除重复的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从火花数据框中选择或删除重复的列 [英] Selecting or removing duplicate columns from spark dataframe

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从火花数据框中选择或删除重复的列 [英] Selecting or removing duplicate columns from spark dataframe

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭