Spark DataFrame如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

查看:563
本文介绍了Spark DataFrame如何处理大于内存的Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在学习Spark,它似乎是Pandas Dataframe的大数据解决方案,但是我有这个问题使我不确定.

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

当前,我正在使用HDF5存储大于内存的Pandas数据帧. HDF5是一个很棒的工具,它使我可以对熊猫数据框进行分块.因此,当我需要在大型Pandas数据帧上进行处理时,将按块进行处理.但是Pandas不支持分布式处理,HDF5仅用于单个PC环境.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

使用Spark数据帧可能是解决方案,但是我对Spark的理解是该数据帧必须能够容纳在内存中,并且一旦作为Spark数据帧加载,Spark便会将数据帧分发给不同的工作人员以进行分布式处理.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

我的理解正确吗?如果是这种情况,那么Spark如何处理大于内存的数据帧?它支持像HDF5这样的分块吗?

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

推荐答案

数据帧必须能够容纳在内存中,并且作为Spark数据帧加载后,Spark会将数据帧分发给不同的工作人员以进行分布式处理.

the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

仅当您尝试将数据加载到驱动程序上然后进行并行化时,这才是正确的.在典型情况下,您以一种可以并行读取的格式存储数据.这意味着您的数据:

This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:

    必须在每个工作人员上都可以访问
  • ,例如使用分布式文件系统
  • 文件格式必须支持拆分(最简单的示例是普通的旧csv)
  • has to be accessible on each worker, for example using distributed file system
  • file format has to support splitting (the simplest examples is plain old csv)

在这种情况下,每个工作人员仅读取数据集的自己部分,而无需将数据存储在驱动程序存储器中.与计算拆分相关的所有逻辑均由适用的Hadoop输入格式透明地处理.

In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.

关于HDF5文件,您有两个选择:

Regarding HDF5 files you have two options:

  • 读取驱动程序上的数据块,从每个数据块构建Spark DataFrame,然后合并结果.这效率低下但易于实现
  • 分发HDF5文件,并直接在worker上读取数据.一般来说,这种做法较难实施,并且需要智能的数据分发策略

这篇关于Spark DataFrame如何处理大于内存的Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆