如何创建从UDF定制变压器? [英] How to create a custom Transformer from a UDF?
问题描述
我试图创建和保存的管道定制阶段。我需要添加一个列
我的数据帧
使用 UDF
。因此,我想知道是否有可能转换 UDF
或类似的操作成变压器
?
我的自定义 UDF
看起来是这样的,我想学习如何使用 UDF
为做到这一点一个自定义的变压器
。
高清getFeatures(N:字符串)= {
VAL NUMBER_FEATURES = 4
VAL名称= n.split(+)(0).toLowerCase
((1至NUMBER_FEATURES)
.filter(大小=>大小< = name.length)
.MAP(大小=> name.substring(name.length - 大小)))
}VAL tokenizeUDF = sqlContext.udf.register(标记化(名称:字符串)=> getFeatures(名))
这不是一个全功能的解决方案,但你可以像这样开始:
进口org.apache.spark.ml。{} UnaryTransformer
进口org.apache.spark.ml.util.Identifiable
进口org.apache.spark.sql.types {数组类型,数据类型,StringType}类NGramTokenizer(覆盖VAL UID:字符串)
扩展UnaryTransformer [字符串,序号[字符串],NGramTokenizer] { 高清这个()=这(Identifiable.randomUID(ngramtokenizer)) 能战胜保护DEF createTransformFunc:字符串=>序号[字符串] = {
getFeatures _
} 能战胜保护DEF validateInputType(inputType下:数据类型):单位= {
需要(inputType下== StringType)
} 覆盖保护DEF outputDataType:数据类型= {
新的数组类型(StringType,真)
}
}
快速检查:
VAL DF = SEQ((1L,ABCDEF),(2L,FOOBAR))。toDF(K,V)
VAL变压器=新NGramTokenizer()。setInputCol(V)。setOutputCol(与)transformer.transform(DF).show
// + --- + ------ + ------------------ +
// | K | v | VS |
// + --- + ------ + ------------------ +
// | 1 | ABCDEF | [F,EF,闪避,CDEF] |
// | 2 | foobar的| [R,AR,酒吧,OBAR] |
// + --- + ------ + ------------------ +
您甚至可以尝试将其推广到这样的事情:
类UnaryUDFTransformer [T,U](覆盖VAL UID:字符串,F:T => U,
法菜单:数据类型,outType:数据类型)
扩展UnaryTransformer [T,U,UnaryUDFTransformer [T,U]] { 能战胜保护DEF createTransformFunc:T => û= F 能战胜保护DEF validateInputType(inputType下:数据类型):单位=
需要(inputType下==法菜单) 覆盖保护DEF outputDataType:数据类型= outType
}VAL变压器=新UnaryUDFTransformer(特征化,getFeatures,
StringType,新的数组类型(StringType,真))
.setInputCol(V)
.setOutputCol(与)
如果你想使用UDF不是包装的函数,你必须变压器
直接扩展并覆盖变换
方法。不幸的是大多数有用的类是私有的,因此它可以是相当棘手的。
I was trying to create and save a Pipeline with custom stages. I need to add a column
to my DataFrame
by using a UDF
. Therefore, I was wondering if it was possible to convert a UDF
or a similar action into a Transformer
?
My custom UDF
looks like this and I'd like to learn how to do it using an UDF
as a custom Transformer
.
def getFeatures(n: String) = {
val NUMBER_FEATURES = 4
val name = n.split(" +")(0).toLowerCase
((1 to NUMBER_FEATURES)
.filter(size => size <= name.length)
.map(size => name.substring(name.length - size)))
}
val tokenizeUDF = sqlContext.udf.register("tokenize", (name: String) => getFeatures(name))
It is not a fully featured solution but your can start with something like this:
import org.apache.spark.ml.{UnaryTransformer}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
class NGramTokenizer(override val uid: String)
extends UnaryTransformer[String, Seq[String], NGramTokenizer] {
def this() = this(Identifiable.randomUID("ngramtokenizer"))
override protected def createTransformFunc: String => Seq[String] = {
getFeatures _
}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType)
}
override protected def outputDataType: DataType = {
new ArrayType(StringType, true)
}
}
Quick check:
val df = Seq((1L, "abcdef"), (2L, "foobar")).toDF("k", "v")
val transformer = new NGramTokenizer().setInputCol("v").setOutputCol("vs")
transformer.transform(df).show
// +---+------+------------------+
// | k| v| vs|
// +---+------+------------------+
// | 1|abcdef|[f, ef, def, cdef]|
// | 2|foobar|[r, ar, bar, obar]|
// +---+------+------------------+
You can even try to generalize it to something like this:
class UnaryUDFTransformer[T, U](override val uid: String, f: T => U,
inType: DataType, outType: DataType)
extends UnaryTransformer[T, U, UnaryUDFTransformer[T, U]] {
override protected def createTransformFunc: T => U = f
override protected def validateInputType(inputType: DataType): Unit =
require(inputType == inType)
override protected def outputDataType: DataType = outType
}
val transformer = new UnaryUDFTransformer("featurize", getFeatures,
StringType, new ArrayType(StringType, true))
.setInputCol("v")
.setOutputCol("vs")
If you want to use UDF not the wrapped function you'll have to extend Transformer
directly and override transform
method. Unfortunately majority of the useful classes is private so it can be rather tricky.
这篇关于如何创建从UDF定制变压器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!