如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列? [英] How to split a column with comma separated values in PySpark's Dataframe?
本文介绍了如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个 PySpark 数据框,其中有一列包含逗号分隔值.该列包含的值的数量是固定的(比如 4).示例:
I have a PySpark dataframe with a column that contains comma separated values. The number of values that the column contains is fixed (say 4). Example:
+----+----------------------+
|col1| col2|
+----+----------------------+
| 1|val1, val2, val3, val4|
| 2|val1, val2, val3, val4|
| 3|val1, val2, val3, val4|
| 4|val1, val2, val3, val4|
+----+----------------------+
这里我想将 col2 拆分为 4 个单独的列,如下所示:
Here I want to split col2 into 4 separate columns as shown below:
+----+-------+-------+-------+-------+
|col1| col21| col22| col23| col24|
+----+-------+-------+-------+-------+
| 1| val1| val2| val3| val4|
| 2| val1| val2| val3| val4|
| 3| val1| val2| val3| val4|
| 4| val1| val2| val3| val4|
+----+-------+-------+-------+-------+
如何做到这一点?
推荐答案
我会拆分列并使数组的每个元素成为一个新列.
I would split the column and make each element of the array a new column.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['1', 'val1, val2, val3, val4'], ['2', 'val1, val2, val3, val4'], ['3', 'val1, val2, val3, val4'], ['4', 'val1, val2, val3, val4']]), ["col1", "col2"])
df2 = df.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
>>>
+----+-------+-------+-------+-------+
|col1|col2[0]|col2[1]|col2[2]|col2[3]|
+----+-------+-------+-------+-------+
| 1| val1| val2| val3| val4|
| 2| val1| val2| val3| val4|
| 3| val1| val2| val3| val4|
| 4| val1| val2| val3| val4|
+----+-------+-------+-------+-------+
这篇关于如何在 PySpark 的 Dataframe 中用逗号分隔值拆分列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文