如何定义在星火SQL自定义类型的模式? [英] How to define schema for custom type in Spark SQL?
问题描述
下面的例子code试图把一些情况下,对象成数据帧。在code包括外壳对象层次的定义和使用这种特质情况下类:
The following example code tries to put some case objects into a dataframe. The code includes the definition of a case object hierarchy and a case class using this trait:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
sealed trait Some
case object AType extends Some
case object BType extends Some
case class Data( name : String, t: Some)
object Example {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf()
.setAppName( "Example" )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize( Seq( Data( "a", AType), Data( "b", BType) ), 4).toDF()
df.show()
}
}
在执行code,我不幸遇到以下异常:
When executing the code, I unfortunately encounter the following exception:
java.lang.UnsupportedOperationException: Schema for type Some is not supported
问题
- 是否有可能增加或定义模式为某些类型(在这里键入
部分
)? - 另一种方法是存在的,重新present这种枚举的?
- 我试图用
枚举
直接,也没有成功。 (见下文) - Is there a possibility to add or define a schema for certain types (here type
Some
)? - Does another approach exist to represent this kind of enumerations?
- I tried to use
Enumeration
directly, but also without success. (see below)
code为
枚举
:object Some extends Enumeration { type Some = Value val AType, BType = Value }
在此先感谢。我希望,最好的办法是不要使用字符串来代替。
Thanks in advance. I hope, that the best approach is not to use strings instead.
推荐答案
注意
UserDefinedType
已取得私人Spark中2.0.0和现在它已经没有数据集
友好的替代品。UserDefinedType
has been made private in Spark 2.0.0 and as for now it has noDataset
friendly replacement.有没有一种可能添加或定义某些类型的架构(这里键入一些)?
Is there a possibility to add or define a schema for certain types (here type Some)?
我猜的答案取决于你需要这个有多严重。它看起来像它可以创建一个
UserDefinedType
,但它需要访问DeveloperApi类
和不完全直接的或有据可查I guess the answer depends on how badly you need this. It looks like it is possible to create an
UserDefinedType
but it requires access toDeveloperApi
and is not exactly straightforward or well documented.import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[SomeUDT]) sealed trait Some case object AType extends Some case object BType extends Some class SomeUDT extends UserDefinedType[Some] { override def sqlType: DataType = IntegerType override def serialize(obj: Any) = { obj match { case AType => 0 case BType => 1 } } override def deserialize(datum: Any): Some = { datum match { case 0 => AType case 1 => BType } } override def userClass: Class[Some] = classOf[Some] }
您或许应该重写
散code
和等于
以及其PySpark对手可以是这样的:
Its PySpark counterpart can look like this:
from enum import Enum, unique from pyspark.sql.types import UserDefinedType, IntegerType class SomeUDT(UserDefinedType): @classmethod def sqlType(self): return IntegerType() @classmethod def module(cls): return cls.__module__ @classmethod def scalaUDT(cls): # Required in Spark < 1.5 return 'net.zero323.enum.SomeUDT' def serialize(self, obj): return obj.value def deserialize(self, datum): return {x.value: x for x in Some}[datum] @unique class Some(Enum): __UDT__ = SomeUDT() AType = 0 BType = 1
在星火&LT; 1.5的Python UDT需要成对的Scala UDT,但它看起来像它不再在1.5的情况下
In Spark < 1.5 Python UDT requires a paired Scala UDT, but it look like it is no longer the case in 1.5.
有关像你这样一个简单的UDT可以使用简单的类型(例如
IntegerType
而不是整个结构体
)For a simple UDT like you can use simple types (for example
IntegerType
instead of wholeStruct
).这篇关于如何定义在星火SQL自定义类型的模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- I tried to use
Questions
- 我试图用