Spark DataFrame如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory
问题描述
我现在正在学习Spark,它似乎是Pandas Dataframe的大数据解决方案,但是我有这个问题使我不确定.
I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
当前,我正在使用HDF5存储大于内存的Pandas数据帧. HDF5是一个很棒的工具,它使我可以对熊猫数据框进行分块.因此,当我需要在大型Pandas数据帧上进行处理时,将按块进行处理.但是Pandas不支持分布式处理,HDF5仅用于单个PC环境.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
使用Spark数据帧可能是解决方案,但是我对Spark的理解是该数据帧必须能够容纳在内存中,并且一旦作为Spark数据帧加载,Spark便会将数据帧分发给不同的工作人员以进行分布式处理.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
我的理解正确吗?如果是这种情况,那么Spark如何处理大于内存的数据帧?它支持像HDF5这样的分块吗?
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?
推荐答案
数据帧必须能够容纳在内存中,并且作为Spark数据帧加载后,Spark会将数据帧分发给不同的工作人员以进行分布式处理.
the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
仅当您尝试将数据加载到驱动程序上然后进行并行化时,这才是正确的.在典型情况下,您以一种可以并行读取的格式存储数据.这意味着您的数据:
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
-
必须在每个工作人员上都可以访问
- ,例如使用分布式文件系统
- 文件格式必须支持拆分(最简单的示例是普通的旧csv)
- has to be accessible on each worker, for example using distributed file system
- file format has to support splitting (the simplest examples is plain old csv)
在这种情况下,每个工作人员仅读取数据集的自己部分,而无需将数据存储在驱动程序存储器中.与计算拆分相关的所有逻辑均由适用的Hadoop输入格式透明地处理.
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
关于HDF5文件,您有两个选择:
Regarding HDF5 files you have two options:
- 读取驱动程序上的数据块,从每个数据块构建Spark DataFrame,然后合并结果.这效率低下但易于实现
- 分发HDF5文件,并直接在worker上读取数据.一般来说,这种做法较难实施,并且需要智能的数据分发策略
这篇关于Spark DataFrame如何处理大于内存的Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!