如何在 Spark 中使用 Kryo Serializer 缓存 DataFrame? [英] How can I cache DataFrame with Kryo Serializer in Spark?

查看:54
本文介绍了如何在 Spark 中使用 Kryo Serializer 缓存 DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Spark 与 Kryo Serializer 结合使用,以降低内存成本来存储一些数据.现在我遇到了一个问题,我无法使用 Kryo 序列化程序在内存中保存 DataFram e(其类型为 Dataset[Row]).我以为我需要做的就是将 org.apache.spark.sql.Row 添加到 classesToRegister,但仍然出现错误:

I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister, but error still occurs:

spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel

val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()

错误如下:

我不认为将 byte[][] 添加到 classesToRegister 是一个好主意.那么我应该怎么做才能使用 Kryo 在内存中存储数据帧?

I don't think adding byte[][] to classesToRegister is a good idea. So what should I do to store a dataframe in memory with Kryo?

推荐答案

Dataset 不使用标准的序列化方法.它们使用具有自己压缩方法的专用列式存储,因此您无需使用 Kryo Serializer 存储您的数据集.

Datasets don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.

这篇关于如何在 Spark 中使用 Kryo Serializer 缓存 DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆