为什么在Sparks等效的情况下pyspark中的数据框没有映射功能? [英] Why is no map function for dataframe in pyspark while the spark equivalent has it?
问题描述
目前在PySpark上工作. DataFrame
上没有地图功能,对于map
功能,必须转到RDD
.在Scala中,DataFrame
上有一个map
,这有什么原因吗?
Currently working on PySpark. There is no map function on DataFrame
, and one has to go to RDD
for map
function. In Scala there is a map
on DataFrame
, is there any reason for this?
推荐答案
Dataset.map
不属于DataFrame
(Dataset[Row]
)API.它将强类型Dataset[T]
转换为强类型Dataset[U]
:
Dataset.map
is not part of the DataFrame
(Dataset[Row]
) API. It transforms strongly typed Dataset[T]
into strongly typed Dataset[U]
:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
,在强类型的Dataset
世界中,Python根本就没有地方.通常,Datasets
是本机JVM对象(不同于RDD
,它没有Python特定的实现),它们严重依赖于丰富的Scala类型系统(甚至Java API也受到严格限制).即使Python实现了Encoder
API的某些变体,数据仍必须转换为RDD
进行计算.
and there is simply no place for Python in the strongly typed Dataset
world. In general, Datasets
are native JVM objects (unlike RDD
it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder
API, data would still have to be converted to RDD
for computations.
相反,Python使用矢量化udfs实现了自己的map
类似机制,该机制应在Spark 2.3中发布.它着重于与Pandas API结合的高性能Serde实现.
In contrast Python implements its own map
like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.
包括典型的udfs
(特别是SCALAR
和SCALAR_ITER
变体)以及类似地图的变体-通过GroupedData.apply
和DataFrame.mapInPandas
应用的GROUPED_MAP
和MAP_ITER
(火花> = 3.0.0).
That includes both typical udfs
(in particular SCALAR
and SCALAR_ITER
variants) as well as map-like variants - GROUPED_MAP
and MAP_ITER
applied through GroupedData.apply
and DataFrame.mapInPandas
(Spark >= 3.0.0) respectively.
这篇关于为什么在Sparks等效的情况下pyspark中的数据框没有映射功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!