为什么在Sparks等效的情况下pyspark中的数据框没有映射功能? [英] Why is no map function for dataframe in pyspark while the spark equivalent has it?

查看：74 发布时间：2020/9/4 2:32:06 apache-spark pyspark

本文介绍了为什么在Sparks等效的情况下pyspark中的数据框没有映射功能?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目前在PySpark上工作. DataFrame上没有地图功能，对于map功能，必须转到RDD.在Scala中，DataFrame上有一个map，这有什么原因吗?

Currently working on PySpark. There is no map function on DataFrame, and one has to go to RDD for map function. In Scala there is a map on DataFrame, is there any reason for this?

推荐答案

Dataset.map不属于DataFrame(Dataset[Row])API.它将强类型Dataset[T]转换为强类型Dataset[U]:

Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:

def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

，在强类型的Dataset世界中，Python根本就没有地方.通常，Datasets是本机JVM对象(不同于RDD，它没有Python特定的实现)，它们严重依赖于丰富的Scala类型系统(甚至Java API也受到严格限制).即使Python实现了Encoder API的某些变体，数据仍必须转换为RDD进行计算.

and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.

相反，Python使用矢量化udfs实现了自己的map类似机制，该机制应在Spark 2.3中发布.它着重于与Pandas API结合的高性能Serde实现.

In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.

包括典型的udfs(特别是SCALAR和SCALAR_ITER变体)以及类似地图的变体-通过GroupedData.apply和DataFrame.mapInPandas应用的GROUPED_MAP和MAP_ITER(火花> = 3.0.0).

That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.

这篇关于为什么在Sparks等效的情况下pyspark中的数据框没有映射功能?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么在Sparks等效的情况下pyspark中的数据框没有映射功能? [英] Why is no map function for dataframe in pyspark while the spark equivalent has it?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在Sparks等效的情况下pyspark中的数据框没有映射功能? [英] Why is no map function for dataframe in pyspark while the spark equivalent has it?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭