如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列? [英] How to split a column with comma separated values in PySpark's Dataframe?

查看:93
本文介绍了如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PySpark 数据框,其中有一列包含逗号分隔值.该列包含的值的数量是固定的(比如 4).示例:

I have a PySpark dataframe with a column that contains comma separated values. The number of values that the column contains is fixed (say 4). Example:

+----+----------------------+
|col1|                  col2|
+----+----------------------+
|   1|val1, val2, val3, val4|
|   2|val1, val2, val3, val4|
|   3|val1, val2, val3, val4|
|   4|val1, val2, val3, val4|
+----+----------------------+

这里我想将 col2 拆分为 4 个单独的列,如下所示:

Here I want to split col2 into 4 separate columns as shown below:

+----+-------+-------+-------+-------+
|col1|  col21|  col22|  col23|  col24|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+

如何做到这一点?

推荐答案

我会拆分列并使数组的每个元素成为一个新列.

I would split the column and make each element of the array a new column.

from pyspark.sql import functions as F

df = spark.createDataFrame(sc.parallelize([['1', 'val1, val2, val3, val4'], ['2', 'val1, val2, val3, val4'], ['3', 'val1, val2, val3, val4'], ['4', 'val1, val2, val3, val4']]), ["col1", "col2"])

df2 = df.select('col1', F.split('col2', ', ').alias('col2'))

# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]

df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
>>>
+----+-------+-------+-------+-------+
|col1|col2[0]|col2[1]|col2[2]|col2[3]|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+

这篇关于如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆