如何在pyspark中爆炸数据框的多列 [英] How to explode multiple columns of a dataframe in pyspark
本文介绍了如何在pyspark中爆炸数据框的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个数据框,其中包含与以下内容类似的列中的列表。所有列中列表的长度都不相同。
I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
我想爆炸数据框,使我得到以下输出-
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
我该如何实现?
推荐答案
此方法有效,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
使用 udf
和 zip
。 爆炸
所需的那些列必须在爆炸前合并。
Use udf
with zip
. Those columns needed to explode
have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
这篇关于如何在pyspark中爆炸数据框的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文