如何在 Spark 中使用 Kryo Serializer 缓存 DataFrame? [英] How can I cache DataFrame with Kryo Serializer in Spark?
问题描述
我正在尝试将 Spark 与 Kryo Serializer 结合使用,以降低内存成本来存储一些数据.现在我遇到了一个问题,我无法使用 Kryo 序列化程序在内存中保存 DataFram e(其类型为 Dataset[Row]).我以为我需要做的就是将 org.apache.spark.sql.Row 添加到 classesToRegister
,但仍然出现错误:
I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister
, but error still occurs:
spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel
val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()
错误如下:
我不认为将 byte[][]
添加到 classesToRegister
是一个好主意.那么我应该怎么做才能使用 Kryo 在内存中存储数据帧?
I don't think adding byte[][]
to classesToRegister
is a good idea. So what should I do to store a dataframe in memory with Kryo?
推荐答案
Dataset
不使用标准的序列化方法.它们使用具有自己压缩方法的专用列式存储,因此您无需使用 Kryo Serializer 存储您的数据集.
Dataset
s don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.
这篇关于如何在 Spark 中使用 Kryo Serializer 缓存 DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!