为什么将Spark DataFrame转换为RDD需要完全重新映射? [英] Why does the Spark DataFrame conversion to RDD require a full re-mapping?

查看:176
本文介绍了为什么将Spark DataFrame转换为RDD需要完全重新映射?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自Spark源代码:

From the Spark source code:

/**
   * Represents the content of the Dataset as an `RDD` of `T`.
   *
   * @group basic
   * @since 1.6.0
   */
  lazy val rdd: RDD[T] = {
    val objectType = exprEnc.deserializer.dataType
    rddQueryExecution.toRdd.mapPartitions { rows =>
      rows.map(_.get(0, objectType).asInstanceOf[T])
    }
  }

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2972

mapPartitions可能需要很长的时间才能计算出RDD.因此,这使得诸如

The mapPartitions can take as long as the time to compute the RDD in the first place.. So this makes operations such as

df.rdd.getNumPartitions

非常昂贵.假设DataFrameDataSet[Row]并且DataSetRDD组成,为什么需要重新映射?任何见解都表示赞赏.

very expensive. Given that a DataFrame is DataSet[Row] and a DataSet is composed of RDD's why is a re-mapping required? Any insights appreciated.

推荐答案

TL; DR 这是因为内部RDD不是RDD[Row].

鉴于DataFrame是DataSet[Row],而DataSet是由RDD组成的

Given that a DataFrame is DataSet[Row] and a DataSet is composed of RDD's

这太过简单了.首先,DataSet[T]并不意味着您与T的容器进行交互.这意味着,如果您使用类似集合的API(通常称为强类型),则内部表示形式将被解码T中.

That's a huge oversimplification. First of all DataSet[T] doesn't mean that you interact with container of T. It means that if you use collection-like API (often referred as strongly typed), internal representation will be decoded into T.

内部表示形式是Tungsten内部使用的二进制格式.这种表示形式是内部的,并且有变化的主题,并且太低了,无法在实践中使用.

The internal representation is a binary format used internally by Tungsten.This representation is internal and subject of changes and far too low level to be used in practice.

公开此数据的中间表示形式是InternalRow-rddQueryExecution.toRDD实际上是RDD[InternalRow].这种表示形式(存在不同的实现)仍然公开内部类型,被视为弱"私有的,因为o.a.s.sql.catalyst中的所有对象(访问没有明确限制,但未记录API),并且很难交互.

An intermediate representation, which exposes this data is the InternalRow - rddQueryExecution.toRDD is in fact RDD[InternalRow]. This representation (there are different implementation) still exposes the internal types, is consider "weakly" private, as all objects in o.a.s.sql.catalyst (the access is not explicitly restricted, but API is not documented), and rather tricky to interact with.

在解码中发挥作用的原因以及为什么需要完整的重新映射"-将内部(通常是不安全的)对象转换为供公共使用的外部类型.

This where decoding comes into play and why you need full "re-mapping" - to convert internal, often unsafe, objects into external types intended for public usage.

最后,重申我之前的声明-调用getNumPartitions时,不会执行有问题的代码

Finally, to reiterate my previous statement - the code in question won't be executed when getNumPartitions is called.

这篇关于为什么将Spark DataFrame转换为RDD需要完全重新映射?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆