Spark DataFrame 如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

查看：54 发布时间：2021/11/14 21:45:43 pandas apache-spark dataframe apache-spark-sql hdf5

本文介绍了Spark DataFrame 如何处理大于内存的Pandas DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在正在学习 Spark，它似乎是 Pandas Dataframe 的大数据解决方案，但我有一个让我不确定的问题.

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

目前我正在使用 HDF5 存储大于内存的 Pandas 数据帧.HDF5 是一个很棒的工具，它允许我对 Pandas 数据框进行分块.因此，当我需要对大型 Pandas 数据帧进行处理时，我会分块进行.但 Pandas 不支持分布式处理，HDF5 仅适用于单台 PC 环境.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

使用 Spark dataframe 可能是解决方案，但我对 Spark 的理解是 dataframe 必须能够适应内存，并且一旦作为 Spark dataframe 加载，Spark 会将 dataframe 分发给不同的 worker 进行分布式处理.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

我的理解正确吗?如果是这种情况，那么 Spark 如何处理大于内存的数据帧?它是否支持分块，如 HDF5?

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

Spark DataFrame 如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark DataFrame 如何处理大于内存的Pandas DataFrame [英] How does Spark DataFrame handles Pandas DataFrame that is larger than memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭