将 Spark 数据帧转换为 Pandas/R 数据帧的要求 [英] Requirements for converting Spark dataframe to Pandas/R dataframe
问题描述
我在 Hadoop 的 YARN 上运行 Spark.这种转换是如何工作的?collect() 是否在转换之前发生?
I'm running Spark on Hadoop's YARN. How does this conversion work? Does a collect() take place before the conversion?
我还需要在每个从节点上安装 Python 和 R 才能进行转换吗?我正在努力寻找这方面的文档.
Also I need to install Python and R on every slave node for the conversion to work? I'm struggling to find documentation on this.
推荐答案
toPandas
(PySpark)/as.data.frame
(SparkR)
toPandas
(PySpark) / as.data.frame
(SparkR)
必须在创建本地数据框之前收集数据.例如 toPandas
方法如下所示:
Data has to be collected before local data frame is created. For example toPandas
method looks as follows:
def toPandas(self):
import pandas as pd
return pd.DataFrame.from_records(self.collect(), columns=self.columns)
您需要在每个节点上安装 Python,最好安装所有依赖项.
You need Python, optimally with all the dependencies, installed on each node.
SparkR 对应物 (as.data.frame
) 只是 collect
的别名.
SparkR counterpart (as.data.frame
) is simply an alias for collect
.
总结在这两种情况下,数据被收集
到驱动程序节点并转换为本地数据结构(pandas.DataFrame
和base::data.frame
分别在 Python 和 R 中).
To summarize in both cases data is collected
to the driver node and converted to the local data structure (pandas.DataFrame
and base::data.frame
in Python and R respectively).
矢量化用户定义函数
自 Spark 2.3.0 起,PySpark 还提供了一组 pandas_udf
(SCALAR
, GROUPED_MAP
,GROUPED_AGG
) 并行操作由
Since Spark 2.3.0 PySpark also provides a set of pandas_udf
(SCALAR
, GROUPED_MAP
, GROUPED_AGG
) which operate in parallel on chunks of data defined by
SCALAR
变体情况下的分区GROUPED_MAP
和GROUPED_AGG
情况下的分组表达式.
- Partitions in case of
SCALAR
variant - Grouping expression in case of
GROUPED_MAP
andGROUPED_AGG
.
每个块由
- 一个或多个
pandas.core.series.Series
,如果是SCALAR
和GROUPED_AGG
变体. - 在
GROUPED_MAP
变体的情况下,一个pandas.core.frame.DataFrame
.
- One or more
pandas.core.series.Series
in case ofSCALAR
andGROUPED_AGG
variants. - A single
pandas.core.frame.DataFrame
in case ofGROUPED_MAP
variant.
同样,从 Spark 2.0.0 起,SparkR 提供了 dapply
和 gapply
函数对分别由分区和分组表达式定义的 data.frames
进行操作.
Similarly, since Spark 2.0.0, SparkR provides dapply
and gapply
functions operating on data.frames
defined by partitions and grouping expressions respectively.
上述功能:
- 不要向司机收取费用.除非数据仅包含单个分区(即使用
coalesce(1)
)或分组表达式是微不足道的(即groupBy(lit(1))
),否则没有单节点瓶颈. - 在相应执行程序的内存中加载相应的块.因此,它受到每个执行器上可用的单个块/内存大小的限制.
- Don't collect to the driver. Unless data contains only a single partition (i.e. with
coalesce(1)
) or grouping expression is trivial (i.e.groupBy(lit(1))
) there is no single node bottleneck. - Load respective chunks in memory of the corresponding executor. As a result it is limited by the size of individual chunks / memory available on each executor.
这篇关于将 Spark 数据帧转换为 Pandas/R 数据帧的要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!