在PySpark中进行高效的列处理 [英] Efficient column processing in PySpark
本文介绍了在PySpark中进行高效的列处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的数据框的列数非常多(> 30000).
I have a dataframe with a very large number of columns (>30000).
我根据这样的第一列用1
和0
填充它:
I'm filling it with 1
and 0
based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
但是,此过程需要很多时间.有办法更有效地做到这一点吗?告诉我列处理可以并行化.
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
样本输入数据
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+
推荐答案
您可能会这样,
import pyspark.sql.functions as F
exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
for column in list_column_names]
df = df.select(['list_column']+exprs)
这篇关于在PySpark中进行高效的列处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文