处理看不见的分类字符串Spark CountVectorizer [英] Handle unseen categorical string Spark CountVectorizer

查看:50
本文介绍了处理看不见的分类字符串Spark CountVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到 StringIndexer 对于看不见的标签有问题(请参阅这里).

我的问题是:

  1. CountVectorizer 是否有相同的限制?如何处理不在词汇表中的字符串?

  2. 此外,词汇量受输入数据影响还是根据词汇量参数固定?

  3. 最后,从ML的角度来看,假设简单的分类器(例如Logistic回归),不应将看不见的类别编码为零行,因此应将其视为未知",以便获得某种默认预测?

解决方案

CountVectorizer是否有相同的限制?如何处理不在词汇表中的字符串?

它不在乎看不见的价值.

词汇量受输入数据影响还是根据词汇量参数固定?

向量的大小不能超过词汇量,并且进一步受到不同值的数量的限制.

不应将看不见的类别编码为零行,以便将其视为未知"以便得到一些

这正是发生的情况.问题虽然稍微复杂一些. StringIndexer 通常与 OneHotEncoder 配对,默认情况下,该编码将基本类别编码为零向量,以避免虚拟变量陷阱.因此,将相同的方法用于索引编制将是模棱两可的.

为说明所有要点,请考虑以下示例:

  import org.apache.spark.ml.feature.CountVectorizerval train = Seq(Seq("foo"),Seq("bar")).toDF("text")val test = Seq(Seq("foo"),Seq("foobar")).toDF("text")//val vectorizer = new CountVectorizer().setInputCol("text")vectorizer.setVocabSize(1000).fit(train).vocabulary//Array [String] = Array(foo,bar)/*词汇量被截断为该值由VocabSize Param提供*/vectorizer.setVocabSize(1).fit(train).vocabulary//Array [String] = Array(bar)/*看不见的值将被忽略,如果没有已知值我们得到零向量((2,[],[]))*/vectorizer.setVocabSize(1000).fit(train).transform(test).show//+ -------- + --------------------------- +//|文字| cntVec_0a49b1315206__输出|//+ -------- + --------------------------- +//|[foo] |(2,[1],[1.0])|//| [foobar] |(2,[],[])|//+ -------- + --------------------------- + 

I have seen StringIndexer has problems with unseen labels (see here).

My question are:

  1. Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?

  2. Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?

  3. Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort of default prediction?

解决方案

Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?

It doesn't care about unseen values.

is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?

Size of the vector cannot exceed vocabulary size and is further limited by the number of the distinct values.

shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some

This exactly what happens. Problem is slightly more complicated though. StringIndexer is typically paired with OneHotEncoder which by default encodes the base category as a vector of zeros to avoid dummy variable trap. So using the same approach with indexing would be ambiguous.

To illustrate all the points consider following example:

import org.apache.spark.ml.feature.CountVectorizer

val train = Seq(Seq("foo"), Seq("bar")).toDF("text")
val test = Seq(Seq("foo"), Seq("foobar")).toDF("text")

// 
val vectorizer = new CountVectorizer().setInputCol("text")

vectorizer.setVocabSize(1000).fit(train).vocabulary
// Array[String] = Array(foo, bar)

/* Vocabulary size is truncated to the value 
provided by VocabSize Param */

vectorizer.setVocabSize(1).fit(train).vocabulary
// Array[String] = Array(bar)

/* Unseen values are ignored and if there are no known values
we get vector of zeros ((2,[],[])) */

vectorizer.setVocabSize(1000).fit(train).transform(test).show
// +--------+---------------------------+
// |    text|cntVec_0a49b1315206__output|
// +--------+---------------------------+
// |   [foo]|              (2,[1],[1.0])|
// |[foobar]|                  (2,[],[])|
// +--------+---------------------------+

这篇关于处理看不见的分类字符串Spark CountVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆