Spark:如何在每个执行器中创建本地数据框 [英] Spark : how can i create local dataframe in each executor
问题描述
在spark scala中,有一种方法可以在执行程序(如pyspark中的pandas)中创建本地数据框.在mappartitions方法中,我想将迭代器转换为本地数据框(例如python中的pandas数据框),以便可以使用数据框功能,而不是在迭代器上手动编码它们.
In spark scala is there a way to create local dataframe in executors like pandas in pyspark. In mappartitions method i want to convert iterator to local dataframe (like pandas dataframe in python) so that dataframe features can be used instead of hand coding them on iterators.
推荐答案
这是不可能的.
数据框是一个Spark中的分布式集合.而且只能在驱动程序节点上(即,在转换/动作之外)创建数据框.
Dataframe is a distributed collection in Spark. And Dataframes can only be created on driver node (i.e. outside of transformations/actions).
此外,在Spark中,您无法在其他操作内对RDD/Dataframe/Dataset执行操作: 例如以下代码将产生错误.
Additionally, in Spark you cannot execute operations on RDDs/Dataframes/Datasets inside other operations: e.g. following code will produce errors.
rdd.map(v => rdd1.filter(e => e == v))
DF和DS在下面也有RDD,所以那里的行为相同.
DF and DS also have RDDs underneath, so same behavior there.
这篇关于Spark:如何在每个执行器中创建本地数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!