为什么在Spark中创建数据集需要编码器 [英] Why a encoder is needed for creating dataset in spark

查看:213
本文介绍了为什么在Spark中创建数据集需要编码器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想以拼花形式编写输出文件.为此,我将RDD转换为数据集,因为从RDD无法直接获得实木复合地板形式.对于创建数据集,我们需要使用隐式编码器,否则,它将开始产生编译时错误.我仅在这方面有几个问题.以下是我的代码:

I wanted to write the output file in parquet form. For that, I converted the RDD to dataset since from RDD, we cannot get the parquet form directly. And for creating the dataset, we need to use the implicit encoder otherwise, it start giving compile time error. I have few questions in this regard only. Following is my code:

implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[ItemData]
    val ds: Dataset[ItemData] = sparkSession.createDataset(filteredRDD)

    ds.write
      .mode(SaveMode.Overwrite)
      .parquet(configuration.outputPath)
  }

以下是我的问题:

  1. 为什么在创建数据集时使用编码器很重要?这个编码器有什么作用?
  2. 从上面的代码中,当我得到实木复合地板形式的输出文件时,我看到了它的编码形式.如何解码?当我使用base64格式对其进行解码时,得到以下信息: com ......... processor.spark.ItemDat"0156028263
  1. Why is it important to use encoder while creating the dataset? And what does this encoder do?
  2. From the above code, when I get the output file in parquet form, I see it in encoded form. How can I decode it? When I decode it using base64 form, I get the following: com.........processor.spark.ItemDat"0156028263

所以,基本上,它向我展示了object.toString()的一种值.

So, basically it is showing me object.toString() kind of value.

推荐答案

来自

createDataset需要编码器才能将类型为T的JVM对象与内部Spark SQL表示形式相互转换.

createDataset requires an encoder to convert a JVM object of type T to and from the internal Spark SQL representation.

来自希瑟·米勒的课程:

基本上,编码器可以在JVM对象和Spark SQL的专用内部(表格)表示形式之间转换数据. 所有数据集都需要它们!

编码器是高度专业化和经过优化的代码生成器,可生成用于数据序列化和反序列化的自定义字节码.

Encoders are highly specialized and optimized code generators that generate custom bytecode for serialization and deserialization of your data.

我相信现在很清楚什么是编码器以及它们的作用.关于第二个问题,Kryo序列化程序会导致Spark 将数据集中的每一行存储为平面二进制对象. 而不是使用JavaKryo序列化程序,您可以使用Spark的内部编码器.您可以通过spark.implicits._自动使用它.与Kryo/Java序列化相比,它使用的内存也更少.

I believe that it is now clear what encoders are and what they do. Regarding to your second question, Kryo serializer leads to Spark storing every row in the dataset as a flat binary object. Instead of using Java or Kryo serializer, you can use Spark's internal encoders. You can use it automatically via spark.implicits._. It also uses less memory than Kryo/Java serialization.

更新我

根据您的评论,以下是将Spark编码器与常规JavaKryo序列化区别开的东西(来自

Based on your comment, here are the things that set Spark Encoders apart from regular Java and Kryo serialization (from Heather Miller's Course):

    对于原语和案例类,Spark SQL数据类型,
  • 仅限和 最优.
  • 它们包含架构信息,这使这些高度优化的代码生成器成为可能,并能够基于数据的形状进行优化.由于Spark理解数据集中的数据结构,因此在缓存数据集时可以在内存中创建更优化的布局.
  • Kryo序列化
  • >快10倍(Java序列化的速度慢)
  • Limited to and optimal for primitives and case classes, Spark SQL data types.
  • They contains schema information, which makes these highly optimized code generators possible, and enables optimization based on the shape of the data. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets.
  • >10x faster than Kryo serialization (Java serialization orders of magnitude slower)

希望对您有帮助!

这篇关于为什么在Spark中创建数据集需要编码器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆