创建PySpark ML定制变压器 [英] Create a custom Transformer in PySpark ML

查看:293
本文介绍了创建PySpark ML定制变压器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的星火他们(PySpark)SQL DataFrames和ML。
我怎样才能建立一个服装标记生成器,这对于例如去除停止词,并使用一些图书馆从的 NLTK ?我可以延长默认的?

感谢。


解决方案

  

我可以扩展默认的?


不是真的。默认标记生成器 pyspark.ml.wrapper.JavaTransformer 的一个子类,并像其他transfromers和估计从 pyspark.ml.feature ,代表实际加工到斯卡拉对应。由于要使用Python,你应该延长 pyspark.ml.pipeline.Transformer 直接。

 进口NLTK
从pyspark.ml.pipeline进口变压器
从pyspark.ml.param.shared进口HasInputCol,HasOutputCol,参数
从pyspark.ml.util进口keyword_only从pyspark.sql.functions进口UDF
从pyspark.sql.types输入数组类型,StringType类NLTKWordPunctTokenizer(变压器,HasInputCol,HasOutputCol):    @keyword_only
    高清__init __(自我,inputCol =无,outputCol =无禁用词=无):
        超(NLTKWordPunctTokenizer,个体经营).__的init __()
        self.stopwords =参数(自我,禁用词,)
        self._setDefault(停用词=集())
        kwargs =自.__初始化__._ input_kwargs
        self.setParams(** kwargs)    @keyword_only
    高清setParams(个体经营,inputCol =无,outputCol =无禁用词=无):
        kwargs = self.setParams._input_kwargs
        返回self._set(** kwargs)    高清setStopwords(个体经营,价值):
        self._paramMap [self.stopwords] =价值
        自回归    高清getStopwords(个体经营):
        返回self.getOrDefault(self.stopwords)    高清_transform(个体经营,数据集):
        停用词= self.getStopwords()        DEF F(S):
            令牌= nltk.tokenize.wordpunct_tokenize(多个)
            返回[在令牌吨t如此t.lower()不是禁用词]        T =数组类型(StringType())
        out_col = self.getOutputCol()
        in_col =数据集[self.getInputCol()]
        返回dataset.withColumn(out_col,UDF(F,T)(in_col))

用法示例(从 ML数据 - 功能):

  sentenceDataFrame = sqlContext.createDataFrame([
  (0,我听说星火),
  (0,我希望能Java的使用情况班),
  (1,逻辑回归模型整齐)
],[标签,句])标记生成器= NLTKWordPunctTokenizer(
    inputCol =句,outputCol =字,
    停用词=集(nltk.corpus.stopwords.words('英语')))tokenizer.transform(sentenceDataFrame).show()

I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a costume tokenizer, which for example removes stop words and uses some libraries from ? Can I extend the default one?

Thanks.

解决方案

Can I extend the default one?

Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature, delegates actual processing to its Scala counterpart. Since you want to use Python you should extend pyspark.ml.pipeline.Transformer directly.

import nltk
from pyspark.ml.pipeline import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param
from pyspark.ml.util import keyword_only

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

class NLTKWordPunctTokenizer(Transformer, HasInputCol, HasOutputCol):

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None, stopwords=None):
        super(NLTKWordPunctTokenizer, self).__init__()
        self.stopwords = Param(self, "stopwords", "")
        self._setDefault(stopwords=set())
        kwargs = self.__init__._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None, stopwords=None):
        kwargs = self.setParams._input_kwargs
        return self._set(**kwargs)

    def setStopwords(self, value):
        self._paramMap[self.stopwords] = value
        return self

    def getStopwords(self):
        return self.getOrDefault(self.stopwords)

    def _transform(self, dataset):
        stopwords = self.getStopwords()

        def f(s):
            tokens = nltk.tokenize.wordpunct_tokenize(s)
            return [t for t in tokens if t.lower() not in stopwords]

        t = ArrayType(StringType())
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        return dataset.withColumn(out_col, udf(f, t)(in_col))

Example usage (data from ML - Features):

sentenceDataFrame = sqlContext.createDataFrame([
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
], ["label", "sentence"])

tokenizer = NLTKWordPunctTokenizer(
    inputCol="sentence", outputCol="words",  
    stopwords=set(nltk.corpus.stopwords.words('english')))

tokenizer.transform(sentenceDataFrame).show()

这篇关于创建PySpark ML定制变压器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆