处理看不见的分类字符串Spark CountVectorizer [英] Handle unseen categorical string Spark CountVectorizer
问题描述
我已经看到 StringIndexer
对于看不见的标签有问题(请参阅这里).
我的问题是:
-
CountVectorizer
是否有相同的限制?如何处理不在词汇表中的字符串? -
此外,词汇量受输入数据影响还是根据词汇量参数固定?
最后,从ML的角度来看,假设简单的分类器(例如Logistic回归),不应将看不见的类别编码为零行,因此应将其视为未知",以便获得某种默认预测?
CountVectorizer是否有相同的限制?如何处理不在词汇表中的字符串?
它不在乎看不见的价值.
词汇量受输入数据影响还是根据词汇量参数固定?
向量的大小不能超过词汇量,并且进一步受到不同值的数量的限制.
不应将看不见的类别编码为零行,以便将其视为未知"以便得到一些
这正是发生的情况.问题虽然稍微复杂一些. StringIndexer
通常与 OneHotEncoder
配对,默认情况下,该编码将基本类别编码为零向量,以避免虚拟变量陷阱.因此,将相同的方法用于索引编制将是模棱两可的.
为说明所有要点,请考虑以下示例:
import org.apache.spark.ml.feature.CountVectorizerval train = Seq(Seq("foo"),Seq("bar")).toDF("text")val test = Seq(Seq("foo"),Seq("foobar")).toDF("text")//val vectorizer = new CountVectorizer().setInputCol("text")vectorizer.setVocabSize(1000).fit(train).vocabulary//Array [String] = Array(foo,bar)/*词汇量被截断为该值由VocabSize Param提供*/vectorizer.setVocabSize(1).fit(train).vocabulary//Array [String] = Array(bar)/*看不见的值将被忽略,如果没有已知值我们得到零向量((2,[],[]))*/vectorizer.setVocabSize(1000).fit(train).transform(test).show//+ -------- + --------------------------- +//|文字| cntVec_0a49b1315206__输出|//+ -------- + --------------------------- +//|[foo] |(2,[1],[1.0])|//| [foobar] |(2,[],[])|//+ -------- + --------------------------- +
I have seen StringIndexer
has problems with unseen labels (see here).
My question are:
Does
CountVectorizer
have the same limitation? How does it treat a string not in the vocabulary?Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?
Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort of default prediction?
Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?
It doesn't care about unseen values.
is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?
Size of the vector cannot exceed vocabulary size and is further limited by the number of the distinct values.
shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some
This exactly what happens. Problem is slightly more complicated though. StringIndexer
is typically paired with OneHotEncoder
which by default encodes the base category as a vector of zeros to avoid dummy variable trap. So using the same approach with indexing would be ambiguous.
To illustrate all the points consider following example:
import org.apache.spark.ml.feature.CountVectorizer
val train = Seq(Seq("foo"), Seq("bar")).toDF("text")
val test = Seq(Seq("foo"), Seq("foobar")).toDF("text")
//
val vectorizer = new CountVectorizer().setInputCol("text")
vectorizer.setVocabSize(1000).fit(train).vocabulary
// Array[String] = Array(foo, bar)
/* Vocabulary size is truncated to the value
provided by VocabSize Param */
vectorizer.setVocabSize(1).fit(train).vocabulary
// Array[String] = Array(bar)
/* Unseen values are ignored and if there are no known values
we get vector of zeros ((2,[],[])) */
vectorizer.setVocabSize(1000).fit(train).transform(test).show
// +--------+---------------------------+
// | text|cntVec_0a49b1315206__output|
// +--------+---------------------------+
// | [foo]| (2,[1],[1.0])|
// |[foobar]| (2,[],[])|
// +--------+---------------------------+
这篇关于处理看不见的分类字符串Spark CountVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!