Spark DataFrame如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

查看：563 发布时间：2020/5/23 23:19:13 pandas apache-spark dataframe apache-spark-sql hdf5

本文介绍了Spark DataFrame如何处理大于内存的Pandas DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在正在学习Spark，它似乎是Pandas Dataframe的大数据解决方案，但是我有这个问题使我不确定.

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

当前，我正在使用HDF5存储大于内存的Pandas数据帧. HDF5是一个很棒的工具，它使我可以对熊猫数据框进行分块.因此，当我需要在大型Pandas数据帧上进行处理时，将按块进行处理.但是Pandas不支持分布式处理，HDF5仅用于单个PC环境.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

使用Spark数据帧可能是解决方案，但是我对Spark的理解是该数据帧必须能够容纳在内存中，并且一旦作为Spark数据帧加载，Spark便会将数据帧分发给不同的工作人员以进行分布式处理.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

我的理解正确吗?如果是这种情况，那么Spark如何处理大于内存的数据帧?它支持像HDF5这样的分块吗?

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

Spark DataFrame如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark DataFrame如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭