在PySpark中进行高效的列处理 [英] Efficient column processing in PySpark

查看：211 发布时间：2020/9/4 19:08:11 python apache-spark pyspark apache-spark-sql

本文介绍了在PySpark中进行高效的列处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的数据框的列数非常多(> 30000).

I have a dataframe with a very large number of columns (>30000).

我根据这样的第一列用1和0填充它:

I'm filling it with 1 and 0 based on the first column like this:

for column in list_of_column_names:
  df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))

但是，此过程需要很多时间.有办法更有效地做到这一点吗?告诉我列处理可以并行化.

However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.

样本输入数据

+----------------+-----+-----+-----+
|  list_column   | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] |     |     |     |
| ['Bar', Baz']  |     |     |     |
| ['Foo']        |     |     |     |
+----------------+-----+-----+-----+

推荐答案

您可能会这样，

import pyspark.sql.functions as F

exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
                  for column in list_column_names]

df = df.select(['list_column']+exprs)

这篇关于在PySpark中进行高效的列处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在PySpark中进行高效的列处理 [英] Efficient column processing in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在PySpark中进行高效的列处理 [英] Efficient column processing in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭