Spark DataFrame 如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

查看:54
本文介绍了Spark DataFrame 如何处理大于内存的Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在学习 Spark,它似乎是 Pandas Dataframe 的大数据解决方案,但我有一个让我不确定的问题.

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

目前我正在使用 HDF5 存储大于内存的 Pandas 数据帧.HDF5 是一个很棒的工具,它允许我对 Pandas 数据框进行分块.因此,当我需要对大型 Pandas 数据帧进行处理时,我会分块进行.但 Pandas 不支持分布式处理,HDF5 仅适用于单台 PC 环境.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

使用 Spark dataframe 可能是解决方案,但我对 Spark 的理解是 dataframe 必须能够适应内存,并且一旦作为 Spark dataframe 加载,Spark 会将 dataframe 分发给不同的 worker 进行分布式处理.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

我的理解正确吗?如果是这种情况,那么 Spark 如何处理大于内存的数据帧?它是否支持分块,如 HDF5?

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

推荐答案

数据帧必须能够装入内存,一旦作为 Spark 数据帧加载,Spark 会将数据帧分发给不同的工作人员进行分布式处理.

the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

仅当您尝试在驱动程序上加载数据然后并行化时,这才是正确的.在典型情况下,您以可以并行读取的格式存储数据.这意味着您的数据:

This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:

  • 必须在每个工作人员上都可以访问,例如使用分布式文件系统
  • 文件格式必须支持拆分(最简单的例子是普通的旧 csv)

在这种情况下,每个 worker 只读取自己的数据集部分,而无需将数据存储在驱动程序内存中.与计算拆分相关的所有逻辑都由适用的 Hadoop 输入格式透明处理.

In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.

关于 HDF5 文件,您有两个选择:

Regarding HDF5 files you have two options:

  • 在驱动程序上以块的形式读取数据,从每个块构建 Spark DataFrame,并合并结果.这效率低下,但易于实施
  • 分发 HDF5 文件/文件并直接在工作人员上读取数据.这一般来说更难实施,需要智能的数据分发策略

这篇关于Spark DataFrame 如何处理大于内存的Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆