将PySpark数据框中的一行拆分为多行 [英] Splitting a row in a PySpark Dataframe into multiple rows
问题描述
我想通过拆分col4的元素将一行拆分为多个,并保留所有其他列的值.
例如,给定一个具有单行的df:
col1 [0] | col2 [0] | col3 [0] | a b c |
我希望输出为:
col1 [0] | col2 [0] | col3 [0] |一个|
col1 [0] | col2 [0] | col3 [0] | b |
col1 [0] | col2 [0] | col3 [0] | c |
使用split和explode函数,我尝试了以下操作:
d = COMBINED_DF.select(col1, col2, col3, explode(split(my_fun(col4), " ")))
但是,这将导致以下输出:
col1 [0] | col2 [0] | col3 [0] | a b c |
col1 [0] | col2 [0] | col3 [0] | a b c |
col1 [0] | col2 [0] | col3 [0] | a b c |
这不是我想要的.
以下是可重现的示例:
# Create dummy data
df = sc.parallelize([(1, 2, 3, 'a b c'),
(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])
# Explode column
from pyspark.sql.functions import split, explode
df.withColumn('col4',explode(split('col4',' '))).show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| a|
| 1| 2| 3| b|
| 1| 2| 3| c|
| 4| 5| 6| d|
| 4| 5| 6| e|
| 4| 5| 6| f|
| 7| 8| 9| g|
| 7| 8| 9| h|
| 7| 8| 9| i|
+----+----+----+----+
I currently have a dataframe where one column is of type "a b c d e ...". Call this column "col4"
I would like to split a single row into multiple by splitting the elements of col4, preserving the value of all the other columns.
So, for example, given a df with single row:
col1[0] | col2[0] | col3[0] | a b c |
I would like the output to be:
col1[0] | col2[0] | col3[0] | a |
col1[0] | col2[0] | col3[0] | b |
col1[0] | col2[0] | col3[0] | c |
Using the split and explode functions, I have tried the following:
d = COMBINED_DF.select(col1, col2, col3, explode(split(my_fun(col4), " ")))
However, this results in the following output:
col1[0] | col2[0] | col3[0] | a b c |
col1[0] | col2[0] | col3[0] | a b c |
col1[0] | col2[0] | col3[0] | a b c |
which is not what I want.
Here's a reproducible example:
# Create dummy data
df = sc.parallelize([(1, 2, 3, 'a b c'),
(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])
# Explode column
from pyspark.sql.functions import split, explode
df.withColumn('col4',explode(split('col4',' '))).show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| a|
| 1| 2| 3| b|
| 1| 2| 3| c|
| 4| 5| 6| d|
| 4| 5| 6| e|
| 4| 5| 6| f|
| 7| 8| 9| g|
| 7| 8| 9| h|
| 7| 8| 9| i|
+----+----+----+----+
这篇关于将PySpark数据框中的一行拆分为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!