如何将 Pyspark 数据帧标题设置为另一行? [英] How to Set Pyspark Dataframe Headers to another Row?
问题描述
我有一个如下所示的数据框:
I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
我需要从中创建一个新的数据框,使用 row[1] 作为新的列标题并忽略或删除 col1、col2 等行.新表应如下所示:
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
列可以是可变的,所以我不能使用名称在新数据框中显式设置它们.这不是使用熊猫 df 的.
The columns can be variable, so I can't use the names to set them explicitly in the new dataframe. This is not using pandas df's.
推荐答案
假设col1中只有一行id
,val
Assuming that there is only one row with id
in col1, name
in col2 and val
in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
这应该给你
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
但最好的选择是在使用 sqlContext 从源读取 dataframe 时将标头选项设置为 true 读取它
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
这篇关于如何将 Pyspark 数据帧标题设置为另一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!