如何在Spark中使用Kryo序列化器缓存DataFrame? [英] How can I cache DataFrame with Kryo Serializer in Spark?

查看:94
本文介绍了如何在Spark中使用Kryo序列化器缓存DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Spark与Kryo序列化程序一起使用,以较少的内存成本来存储一些数据.现在我遇到了麻烦,我无法使用Kryo序列化器将DataFram e(类型为Dataset [Row])保存在内存中.我以为我需要做的就是添加org.apache.spark.sql.Row to classesToRegister,但是仍然会出现错误:

I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister, but error still occurs:

spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel

val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()

这样的错误发生:

我不认为将byte[][]添加到classesToRegister是个好主意.那么,如何使用Kryo将数据帧存储在内存中?

I don't think adding byte[][] to classesToRegister is a good idea. So what should I do to store a dataframe in memory with Kryo?

推荐答案

Dataset不使用标准的序列化方法.他们使用专门的列式存储及其自身的压缩方法,因此您无需使用Kryo序列化器存储数据集.

Datasets don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.

这篇关于如何在Spark中使用Kryo序列化器缓存DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆