如何将 Pyspark 数据帧标题设置为另一行? [英] How to Set Pyspark Dataframe Headers to another Row?

查看:58
本文介绍了如何将 Pyspark 数据帧标题设置为另一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框:

I have a dataframe that looks like this:

# +----+------+---------+
# |col1| col2 |  col3   |
# +----+------+---------+
# |  id| name |    val  |
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

我需要从中创建一个新的数据框,使用 row[1] 作为新的列标题并忽略或删除 col1、col2 等行.新表应如下所示:

I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:

# +----+------+---------+
# | id | name |   val   |
# +----+------+---------+
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

列可以是可变的,所以我不能使用名称在新数据框中显式设置它们.这不是使用熊猫 df 的.

The columns can be variable, so I can't use the names to set them explicitly in the new dataframe. This is not using pandas df's.

推荐答案

假设col1中只有一行id 中的code>name 和col3 中的val,可以使用以下逻辑(为了清晰和解释而进行注释)

Assuming that there is only one row with id in col1, name in col2 and val in col3, you can use the following logic (commented for clarity and explanation)

#select the row with the header name 
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))

#selecting the rest of the rows except the first one 
restDF = df.subtract(header)

#converting the header row into Row 
headerColumn = header.first()

#looping columns for renaming 
for column in restDF.columns:
    restDF = restDF.withColumnRenamed(column, headerColumn[column])

restDF.show(truncate=False)

这应该给你

+---+----+---+
|id |name|val|
+---+----+---+
|1  |a01 |X  |
|2  |a02 |Y  |
+---+----+---+

但最好的选择是在使用 sqlContext 从源读取 dataframe将标头选项设置为 true 读取它

But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source

这篇关于如何将 Pyspark 数据帧标题设置为另一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆