将 Spark 数据帧转换为 Pandas/R 数据帧的要求 [英] Requirements for converting Spark dataframe to Pandas/R dataframe

查看：54 发布时间：2021/11/12 5:28:06 pandas apache-spark dataframe hadoop apache-spark-sql

本文介绍了将 Spark 数据帧转换为 Pandas/R 数据帧的要求的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Hadoop 的 YARN 上运行 Spark.这种转换是如何工作的?collect() 是否在转换之前发生?

I'm running Spark on Hadoop's YARN. How does this conversion work? Does a collect() take place before the conversion?

我还需要在每个从节点上安装 Python 和 R 才能进行转换吗?我正在努力寻找这方面的文档.

Also I need to install Python and R on every slave node for the conversion to work? I'm struggling to find documentation on this.

推荐答案

toPandas (PySpark)/as.data.frame (SparkR)

toPandas (PySpark) / as.data.frame (SparkR)

必须在创建本地数据框之前收集数据.例如 toPandas 方法如下所示:

Data has to be collected before local data frame is created. For example toPandas method looks as follows:

def toPandas(self):
    import pandas as pd
    return pd.DataFrame.from_records(self.collect(), columns=self.columns)

您需要在每个节点上安装 Python，最好安装所有依赖项.

You need Python, optimally with all the dependencies, installed on each node.

SparkR 对应物 (as.data.frame) 只是 collect 的别名.

SparkR counterpart (as.data.frame) is simply an alias for collect.

总结在这两种情况下，数据被收集到驱动程序节点并转换为本地数据结构(pandas.DataFrame和base::data.frame 分别在 Python 和 R 中).

To summarize in both cases data is collected to the driver node and converted to the local data structure (pandas.DataFrame and base::data.frame in Python and R respectively).

矢量化用户定义函数

自 Spark 2.3.0 起，PySpark 还提供了一组 pandas_udf (SCALAR, GROUPED_MAP,GROUPED_AGG) 并行操作由

Since Spark 2.3.0 PySpark also provides a set of pandas_udf (SCALAR, GROUPED_MAP, GROUPED_AGG) which operate in parallel on chunks of data defined by

SCALAR 变体情况下的分区
GROUPED_MAP 和 GROUPED_AGG 情况下的分组表达式.

Partitions in case of SCALAR variant
Grouping expression in case of GROUPED_MAP and GROUPED_AGG.

每个块由

一个或多个 pandas.core.series.Series，如果是 SCALAR 和 GROUPED_AGG 变体.
在 GROUPED_MAP 变体的情况下，一个 pandas.core.frame.DataFrame.

One or more pandas.core.series.Series in case of SCALAR and GROUPED_AGG variants.
A single pandas.core.frame.DataFrame in case of GROUPED_MAP variant.

同样，从 Spark 2.0.0 起，SparkR 提供了 dapply 和 gapply 函数对分别由分区和分组表达式定义的 data.frames 进行操作.

Similarly, since Spark 2.0.0, SparkR provides dapply and gapply functions operating on data.frames defined by partitions and grouping expressions respectively.

上述功能:

不要向司机收取费用.除非数据仅包含单个分区(即使用 coalesce(1))或分组表达式是微不足道的(即 groupBy(lit(1)))，否则没有单节点瓶颈.
在相应执行程序的内存中加载相应的块.因此，它受到每个执行器上可用的单个块/内存大小的限制.

Don't collect to the driver. Unless data contains only a single partition (i.e. with coalesce(1)) or grouping expression is trivial (i.e. groupBy(lit(1))) there is no single node bottleneck.
Load respective chunks in memory of the corresponding executor. As a result it is limited by the size of individual chunks / memory available on each executor.

这篇关于将 Spark 数据帧转换为 Pandas/R 数据帧的要求的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将 Spark 数据帧转换为 Pandas/R 数据帧的要求 [英] Requirements for converting Spark dataframe to Pandas/R dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将 Spark 数据帧转换为 Pandas/R 数据帧的要求 [英] Requirements for converting Spark dataframe to Pandas/R dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭