将 Spark 数据帧转换为 Pandas/R 数据帧的要求 [英] Requirements for converting Spark dataframe to Pandas/R dataframe

查看:54
本文介绍了将 Spark 数据帧转换为 Pandas/R 数据帧的要求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Hadoop 的 YARN 上运行 Spark.这种转换是如何工作的?collect() 是否在转换之前发生?

I'm running Spark on Hadoop's YARN. How does this conversion work? Does a collect() take place before the conversion?

我还需要在每个从节点上安装 Python 和 R 才能进行转换吗?我正在努力寻找这方面的文档.

Also I need to install Python and R on every slave node for the conversion to work? I'm struggling to find documentation on this.

推荐答案

toPandas (PySpark)/as.data.frame (SparkR)

toPandas (PySpark) / as.data.frame (SparkR)

必须在创建本地数据框之前收集数据.例如 toPandas 方法如下所示:

Data has to be collected before local data frame is created. For example toPandas method looks as follows:

def toPandas(self):
    import pandas as pd
    return pd.DataFrame.from_records(self.collect(), columns=self.columns)

您需要在每个节点上安装 Python,最好安装所有依赖项.

You need Python, optimally with all the dependencies, installed on each node.

SparkR 对应物 (as.data.frame) 只是 collect 的别名.

SparkR counterpart (as.data.frame) is simply an alias for collect.

总结在这两种情况下,数据被收集到驱动程序节点并转换为本地数据结构(pandas.DataFramebase::data.frame 分别在 Python 和 R 中).

To summarize in both cases data is collected to the driver node and converted to the local data structure (pandas.DataFrame and base::data.frame in Python and R respectively).

矢量化用户定义函数

Spark 2.3.0 起,PySpark 还提供了一组 pandas_udf (SCALAR, GROUPED_MAP,GROUPED_AGG) 并行操作由

Since Spark 2.3.0 PySpark also provides a set of pandas_udf (SCALAR, GROUPED_MAP, GROUPED_AGG) which operate in parallel on chunks of data defined by

  • SCALAR 变体情况下的分区
  • GROUPED_MAPGROUPED_AGG 情况下的分组表达式.
  • Partitions in case of SCALAR variant
  • Grouping expression in case of GROUPED_MAP and GROUPED_AGG.

每个块由

  • 一个或多个 pandas.core.series.Series,如果是 SCALARGROUPED_AGG 变体.
  • GROUPED_MAP 变体的情况下,一个 pandas.core.frame.DataFrame.
  • One or more pandas.core.series.Series in case of SCALAR and GROUPED_AGG variants.
  • A single pandas.core.frame.DataFrame in case of GROUPED_MAP variant.

同样,从 Spark 2.0.0 起,SparkR 提供了 dapplygapply 函数对分别由分区和分组表达式定义的 data.frames 进行操作.

Similarly, since Spark 2.0.0, SparkR provides dapply and gapply functions operating on data.frames defined by partitions and grouping expressions respectively.

上述功能:

  • 不要向司机收取费用.除非数据仅包含单个分区(即使用 coalesce(1))或分组表达式是微不足道的(即 groupBy(lit(1))),否则没有单节点瓶颈.
  • 在相应执行程序的内存中加载相应的块.因此,它受到每个执行器上可用的单个块/内存大小的限制.
  • Don't collect to the driver. Unless data contains only a single partition (i.e. with coalesce(1)) or grouping expression is trivial (i.e. groupBy(lit(1))) there is no single node bottleneck.
  • Load respective chunks in memory of the corresponding executor. As a result it is limited by the size of individual chunks / memory available on each executor.

这篇关于将 Spark 数据帧转换为 Pandas/R 数据帧的要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆