如何在pyspark中爆炸数据框的多列 [英] How to explode multiple columns of a dataframe in pyspark

查看：131 发布时间：2020/10/16 22:02:18 python dataframe pyspark

本文介绍了如何在pyspark中爆炸数据框的多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，其中包含与以下内容类似的列中的列表。所有列中列表的长度都不相同。

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.

Name  Age  Subjects                  Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]

我想爆炸数据框，使我得到以下输出-

I want to explode the dataframe in such a way that i get the following output-

Name Age Subjects Grades
Bob  16   Maths     A
Bob  16  Physics    B
Bob  16  Chemistry  C

我该如何实现？

推荐答案

此方法有效，

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df.show()

+-----+----+--------------------+---------+
| Name| Age|            Subjects|   Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+

使用 udf 和 zip 。 爆炸所需的那些列必须在爆炸前合并。

Use udf with zip. Those columns needed to explode have to be merged before exploding.

combine = F.udf(lambda x, y: list(zip(x, y)),
              ArrayType(StructType([StructField("subs", StringType()),
                                    StructField("grades", StringType())])))

df = df.withColumn("new", combine("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()


+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

这篇关于如何在pyspark中爆炸数据框的多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在pyspark中爆炸数据框的多列 [英] How to explode multiple columns of a dataframe in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在pyspark中爆炸数据框的多列 [英] How to explode multiple columns of a dataframe in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭