如何创建从UDF定制变压器? [英] How to create a custom Transformer from a UDF?

查看:218
本文介绍了如何创建从UDF定制变压器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图创建和保存的管道定制阶段。我需要添加一个我的数据帧使用 UDF 。因此,我想知道是否有可能转换 UDF 或类似的操作成变压器

我的自定义 UDF 看起来是这样的,我想学习如何使用 UDF 为做到这一点一个自定义的变压器

 高清getFeatures(N:字符串)= {
    VAL NUMBER_FEATURES = 4
    VAL名称= n.split(+)(0).toLowerCase
    ((1至NUMBER_FEATURES)
         .filter(大小=>大小< = name.length)
         .MAP(大小=> name.substring(name.length - 大小)))
}VAL tokenizeUDF = sqlContext.udf.register(标记化​​(名称:字符串)=> getFeatures(名))


解决方案

这不是一个全功能的解决方案,但你可以像这样开始:

 进口org.apache.spark.ml。{} UnaryTransformer
进口org.apache.spark.ml.util.Identifiable
进口org.apache.spark.sql.types {数组类型,数据类型,StringType}类NGramTokenizer(覆盖VAL UID:字符串)
  扩展UnaryTransformer [字符串,序号[字符串],NGramTokenizer] {  高清这个()=这(Identifiable.randomUID(ngramtokenizer))  能战胜保护DEF createTransformFunc:字符串=>序号[字符串] = {
    getFeatures _
  }  能战胜保护DEF validateInputType(inputType下:数据类型):单位= {
    需要(inputType下== StringType)
  }  覆盖保护DEF outputDataType:数据类型= {
    新的数组类型(StringType,真)
  }
}

快速检查:

  VAL DF = SEQ((1L,ABCDEF),(2L,FOOBAR))。toDF(K,V)
VAL变压器=新NGramTokenizer()。setInputCol(V)。setOutputCol(与)transformer.transform(DF).show
// + --- + ------ + ------------------ +
// | K | v | VS |
// + --- + ------ + ------------------ +
// | 1 | ABCDEF | [F,EF,闪避,CDEF] |
// | 2 | foobar的| [R,AR,酒吧,OBAR] |
// + --- + ------ + ------------------ +

您甚至可以尝试将其推广到这样的事情:

 类UnaryUDFTransformer [T,U](覆盖VAL UID:字符串,F:T => U,
    法菜单:数据类型,outType:数据类型)
  扩展UnaryTransformer [T,U,UnaryUDFTransformer [T,U]] {  能战胜保护DEF createTransformFunc:T => û= F  能战胜保护DEF validateInputType(inputType下:数据类型):单位=
    需要(inputType下==法菜单)  覆盖保护DEF outputDataType:数据类型= outType
}VAL变压器=新UnaryUDFTransformer(特征化,getFeatures,
    StringType,新的数组类型(StringType,真))
  .setInputCol(V)
  .setOutputCol(与)

如果你想使用UDF不是包装的函数,你必须变压器直接扩展并覆盖变换方法。不幸的是大多数有用的类是私有的,因此它可以是相当棘手的。

I was trying to create and save a Pipeline with custom stages. I need to add a column to my DataFrame by using a UDF. Therefore, I was wondering if it was possible to convert a UDF or a similar action into a Transformer?

My custom UDF looks like this and I'd like to learn how to do it using an UDF as a custom Transformer.

def getFeatures(n: String) = {
    val NUMBER_FEATURES = 4  
    val name = n.split(" +")(0).toLowerCase
    ((1 to NUMBER_FEATURES)
         .filter(size => size <= name.length)
         .map(size => name.substring(name.length - size)))
} 

val tokenizeUDF = sqlContext.udf.register("tokenize", (name: String) => getFeatures(name))

解决方案

It is not a fully featured solution but your can start with something like this:

import org.apache.spark.ml.{UnaryTransformer}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{ArrayType, DataType, StringType}

class NGramTokenizer(override val uid: String)
  extends UnaryTransformer[String, Seq[String], NGramTokenizer]  {

  def this() = this(Identifiable.randomUID("ngramtokenizer"))

  override protected def createTransformFunc: String => Seq[String] = {
    getFeatures _
  }

  override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType == StringType)
  }

  override protected def outputDataType: DataType = {
    new ArrayType(StringType, true)
  }
}

Quick check:

val df = Seq((1L, "abcdef"), (2L, "foobar")).toDF("k", "v")
val transformer = new NGramTokenizer().setInputCol("v").setOutputCol("vs")

transformer.transform(df).show
// +---+------+------------------+
// |  k|     v|                vs|
// +---+------+------------------+
// |  1|abcdef|[f, ef, def, cdef]|
// |  2|foobar|[r, ar, bar, obar]|
// +---+------+------------------+

You can even try to generalize it to something like this:

class UnaryUDFTransformer[T, U](override val uid: String, f: T => U, 
    inType: DataType, outType: DataType)
  extends UnaryTransformer[T, U, UnaryUDFTransformer[T, U]]  {

  override protected def createTransformFunc: T => U = f

  override protected def validateInputType(inputType: DataType): Unit = 
    require(inputType == inType)

  override protected def outputDataType: DataType = outType
}

val transformer = new UnaryUDFTransformer("featurize", getFeatures, 
    StringType, new ArrayType(StringType, true))
  .setInputCol("v")
  .setOutputCol("vs")

If you want to use UDF not the wrapped function you'll have to extend Transformer directly and override transform method. Unfortunately majority of the useful classes is private so it can be rather tricky.

这篇关于如何创建从UDF定制变压器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆