使用Python的reduce()来连接多个PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

查看：119 发布时间：2020/9/4 19:36:49 python python-3.x pyspark spark-dataframe pyspark-sql

本文介绍了使用Python的reduce()来连接多个PySpark DataFrames的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有谁知道为什么使用Python3的functools.reduce()会导致在连接多个PySpark DataFrame时比使用for循环迭代地联接相同的DataFrame时导致更差的性能?具体来说，这会导致速度大幅下降，然后出现内存不足错误:

Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error:

def join_dataframes(list_of_join_columns, left_df, right_df):
    return left_df.join(right_df, on=list_of_join_columns)

joined_df = functools.reduce(
    functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes,
)

而这个不是:

joined_df = list_of_dataframes[0]
joined_df.cache()
for right_df in list_of_dataframes[1:]:
    joined_df = joined_df.join(right_df, on=list_of_join_columns)

任何想法都将不胜感激.谢谢！

Any ideas would be greatly appreciated. Thanks!

使用Python的reduce()来连接多个PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python的reduce()来连接多个PySpark DataFrames [英] Using Python&#39;s reduce() to join multiple PySpark DataFrames

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

使用Python的reduce()来连接多个PySpark DataFrames [英] Using Python's reduce() to join multiple PySpark DataFrames

登录关闭