使用 Python 的 reduce() 加入多个 PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

查看：22 发布时间：2021/11/14 22:31:21 python python-3.x pyspark spark-dataframe pyspark-sql

本文介绍了使用 Python 的 reduce() 加入多个 PySpark DataFrames的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有谁知道为什么在加入多个 PySpark DataFrames 时使用 Python3 的 functools.reduce() 会导致比使用 for 循环迭代地加入相同的 DataFrames 更差的性能?具体来说，这会导致大幅减速，然后出现内存不足错误:

Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error:

def join_dataframes(list_of_join_columns, left_df, right_df):
    return left_df.join(right_df, on=list_of_join_columns)

joined_df = functools.reduce(
    functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes,
)

而这个没有:

joined_df = list_of_dataframes[0]
joined_df.cache()
for right_df in list_of_dataframes[1:]:
    joined_df = joined_df.join(right_df, on=list_of_join_columns)

任何想法将不胜感激.谢谢！

Any ideas would be greatly appreciated. Thanks!

使用 Python 的 reduce() 加入多个 PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Python 的 reduce() 加入多个 PySpark DataFrames [英] Using Python&#39;s reduce() to join multiple PySpark DataFrames

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

使用 Python 的 reduce() 加入多个 PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

登录关闭