将大数组列拆分为多列 - Pyspark [英] Split large array columns into multiple columns - Pyspark
本文介绍了将大数组列拆分为多列 - Pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
我想要:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
有效,但我的一些数组很长(最多 332 个).
works, but some of my arrays are very long (max 332).
我该如何编写它才能考虑到所有长度的数组?
How can I write this so that it takes account of all length arrays?
推荐答案
无论初始列的数量和数组的大小如何,此解决方案都适用于您的问题.此外,如果一列具有不同的数组大小(例如 [1,2]、[3,4,5]),则会导致空值的最大列数填补空白.
This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
这篇关于将大数组列拆分为多列 - Pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文