Spark:数据集序列化 [英] Spark: Dataset Serialization
问题描述
如果我有一个数据集,每个记录的每个记录都是一个案例类,那么我按如下所示保存该数据集,以便使用序列化:
myDS.persist(StorageLevel.MERORY_ONLY_SER)
Spark是否使用Java/kyro序列化序列化数据集?还是像数据框一样,Spark有其自己的方式将数据存储在数据集中?
Spark Dataset
不使用标准的序列化器.相反,它使用 Encoders
,它理解"数据的内部结构,并且可以将对象(具有 Encoder
的任何对象,包括 Row
的对象)有效地转换为内部二进制存储.
使用Kryo或Java序列化的唯一情况是明确地
图片来源
- Michael Armbrust,范文晨,Reynold Xin和Matei Zaharia.介绍Apache Spark数据集, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:
myDS.persist(StorageLevel.MERORY_ONLY_SER)
Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?
Spark Dataset
does not use standard serializers. Instead it uses Encoders
, which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder
, including Row
) into internal binary storage.
The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_]
or Encoders.java[_]
. In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product
encoder, etc.). The only difference compared to Row
is its Encoder
- RowEncoder
(in a sense Encoders
are similar to lenses).
Databricks explicitly puts Encoder
/ Dataset
serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)
Source of the images
- Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Apache Spark Datasets, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
这篇关于Spark:数据集序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!