如何定义在星火SQL自定义类型的模式? [英] How to define schema for custom type in Spark SQL?

查看:660
本文介绍了如何定义在星火SQL自定义类型的模式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的例子code试图把一些情况下,对象成数据帧。在code包括外壳对象层次的定义和使用这种特质情况下类:

The following example code tries to put some case objects into a dataframe. The code includes the definition of a case object hierarchy and a case class using this trait:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext

sealed trait Some
case object AType extends Some
case object BType extends Some

case class Data( name : String, t: Some)

object Example {
  def main(args: Array[String]) : Unit = {
    val conf = new SparkConf()
      .setAppName( "Example" )
      .setMaster( "local[*]")

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._

    val df = sc.parallelize( Seq( Data( "a", AType), Data( "b", BType) ), 4).toDF()
    df.show()
  }
}    

在执行code,我不幸遇到以下异常:

When executing the code, I unfortunately encounter the following exception:

java.lang.UnsupportedOperationException: Schema for type Some is not supported

问题


  • 是否有可能增加或定义模式为某些类型(在这里键入部分)?

  • 另一种方法是存在的,重新present这种枚举的?

    • 我试图用枚举直接,也没有成功。 (见下文)

    • Questions

      • Is there a possibility to add or define a schema for certain types (here type Some)?
      • Does another approach exist to represent this kind of enumerations?
        • I tried to use Enumeration directly, but also without success. (see below)
        • code为枚举

          object Some extends Enumeration {
            type Some = Value
            val AType, BType = Value
          }
          

          在此先感谢。我希望,最好的办法是不要使用字符串来代替。

          Thanks in advance. I hope, that the best approach is not to use strings instead.

          推荐答案

          注意

          UserDefinedType 已取得私人Spark中2.0.0和现在它已经没有数据集友好的替代品。

          UserDefinedType has been made private in Spark 2.0.0 and as for now it has no Dataset friendly replacement.

          有没有一种可能添加或定义某些类型的架构(这​​里键入一些)?

          Is there a possibility to add or define a schema for certain types (here type Some)?

          我猜的答案取决于你需要这个有多严重。它看起来像它可以创建一个 UserDefinedType ,但它需要访问 DeveloperApi类和不完全直接的或有据可查

          I guess the answer depends on how badly you need this. It looks like it is possible to create an UserDefinedType but it requires access to DeveloperApi and is not exactly straightforward or well documented.

          import org.apache.spark.sql.types._
          
          @SQLUserDefinedType(udt = classOf[SomeUDT])
          sealed trait Some
          case object AType extends Some
          case object BType extends Some
          
          class SomeUDT extends UserDefinedType[Some] {
            override def sqlType: DataType = IntegerType
          
            override def serialize(obj: Any) = {
              obj match {
                case AType => 0
                case BType => 1
              }
            }
          
            override def deserialize(datum: Any): Some = {
              datum match {
                case 0 => AType
                case 1 => BType
              }
            }
          
            override def userClass: Class[Some] = classOf[Some]
          }
          

          您或许应该重写散code 等于以及

          其PySpark对手可以是这样的:

          Its PySpark counterpart can look like this:

          from enum import Enum, unique
          from pyspark.sql.types import UserDefinedType, IntegerType
          
          class SomeUDT(UserDefinedType):
              @classmethod
              def sqlType(self):
                  return IntegerType()
          
              @classmethod
              def module(cls):
                  return cls.__module__
          
              @classmethod 
              def scalaUDT(cls): # Required in Spark < 1.5
                  return 'net.zero323.enum.SomeUDT'
          
              def serialize(self, obj):
                  return obj.value
          
              def deserialize(self, datum):
                  return {x.value: x for x in Some}[datum]
          
          @unique
          class Some(Enum):
              __UDT__ = SomeUDT()
              AType = 0
              BType = 1
          

          在星火&LT; 1.5的Python UDT需要成对的Scala UDT,但它看起来像它不再在1.5的情况下

          In Spark < 1.5 Python UDT requires a paired Scala UDT, but it look like it is no longer the case in 1.5.

          有关像你这样一个简单的UDT可以使用简单的类型(例如 IntegerType 而不是整个结构体

          For a simple UDT like you can use simple types (for example IntegerType instead of whole Struct).

          这篇关于如何定义在星火SQL自定义类型的模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆