PySpark数据框性能调整 [英] PySpark Dataframe Performance Tuning

查看：70 发布时间：2021/4/8 19:48:22 apache-spark pyspark

本文介绍了PySpark数据框性能调整的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试合并一些脚本；让我们一次读取数据库，而不是让每个脚本从Hive读取相同的数据.因此，请一次阅读；处理许多模型.

I am trying to consolidate some scripts; to give us one read of the DB rather than every script reading the same data from Hive. So moving to a read-once; process many model.

我坚持了数据框&每次聚合后将输出重新分区；但是我需要它更快，如果有的话，那些事情会减慢它的速度.每天我们有20 TB以上的数据，因此我假设持久存储数据(如果将被多次读取)将使处理速度更快，但事实并非如此.

I've persisted the dataframes & repartition the output after each aggregation; but I need it to be faster, if anything, those things have slowed it down. We have 20TB+ of data per day, so I had assumed that persisting the data, if it's going to be read many times, would make things faster, but it hasn't.

此外，我有很多工作是根据相同的数据进行的，如下所示.我们可以并行运行它们吗?DF2可以定义&输出与DF3的定义同时发生，以帮助加快速度?

Also, I have lots of jobs that happen from the same data, like below. Can we run them in parallel. Can DF2 definition & output happen at the same time as the definition of DF3 to help speed it up?

df = definedf....persist()
df2 = df.groupby....
df3 = df.groupby....
....

是否可以定义其他脚本可以调用的全局缓存数据框?

Is it possible to define a globally cached dataframe that other scripts can call on?

非常感谢！

PySpark数据框性能调整 [英] PySpark Dataframe Performance Tuning

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark数据框性能调整 [英] PySpark Dataframe Performance Tuning

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭