Spark ML CountVectorizer输出的说明 [英] Explanation of Spark ML CountVectorizer output
问题描述
Please help understand the output of the Spark ML CountVectorizer and suggest which documentation explains it.
val cv = new CountVectorizer()
.setInputCol("Tokens")
.setOutputCol("Frequencies")
.setVocabSize(5000)
.setMinTF(1)
.setMinDF(2)
val fittedCV = cv.fit(tokenDF.select("Tokens"))
fittedCV.transform(tokenDF.select("Tokens")).show(false)
2374应该是词典中术语(单词)的数量.什么是"[2,6,328,548,1234]"?
2374 should be the number of terms (words) in the dictionary. What is the "[2,6,328,548,1234]"?
它们是否在词典中索引了"[航空公司,包包,年份,世界,冠军]"一词?如果是这样,为什么相同的单词"airline"在第二行中具有不同的索引"0"?
Are they indices of the words "[airline, bag, vintage, world, champion]" in the dictionary? If so, why the same word "airline" has a different index "0" in the second line?
+------------------------------------------+----------------------------------------------------------------+
|Tokens |Frequencies |
+------------------------------------------+----------------------------------------------------------------+
...
|[airline, bag, vintage, world, champion] |(2374,[2,6,328,548,1234],[1.0,1.0,1.0,1.0,1.0]) |
|[airline, bag, vintage, jet, set, brown] |(2374,[0,2,6,328,405,620],[1.0,1.0,1.0,1.0,1.0,1.0]) |
+------------------------------------------+----------------------------------------------------------------+
[1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer
推荐答案
There is some doc explaining the basics. However this is pretty bare.
是的.数字代表词汇索引中的单词.但是,频率向量中的顺序与令牌向量中的顺序不对应. airline,bag,vintage
都在这两行中,因此它们对应于索引[2,6,328].但是您不能依赖相同的顺序.
Yes. The numbers represent the words in a vocabulary index. However the order in the frequencies vector does not correspond to the order in tokens vector.
airline, bag, vintage
are in both rows, hence they correspond to indices [2,6,328]. But you can't rely on the same order.
行数据类型为 SparseVector .第一个数组显示索引,第二个数组显示值.
The row data type is a SparseVector. The first array, shows the indices and the second the values.
例如
vector[328]
=> 1.0
映射可能如下:
vocabulary
airline 328
bag 6
vintage 2
Frequencies
2734, [2, 6 ,328], [99, 5, 7]
# counts
vintage x 99
bag x 5
airline 7
为了找回单词,您可以在词汇表中进行查找.这需要广播给不同的工人.您还很可能希望将每个文档的数量分解为单独的行.
In order to get the words back , you can do a lookup in the vocabulary. This needs to be broadcasted to different workers. You also most probably want to explode the counts per doc into separate rows.
下面是一些 python
代码段,用于使用udf将每个文档中的前25个常用单词提取到单独的行中,并计算每个单词的均值
Here is some python
code snippet to extract top 25 frequent words per doc with a udf into separate rows and compute the mean for each word
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row
vocabulary = sc.broadcast(fittedCV.vocabulary)
def _top_scores(v):
# create count tuples for each index(i) in a vector(v)
# `.item()` is used, because in python the count value is a numpy datatype, in `scala` it will be just double
counts = [Row(i=i.item(),count=v[i.item()].item()) for i in v.indices]
# => [Row(i=2, count=30, Row(i=362, count=40)]
# return 25 top count rows
counts = sorted(counts, reverse=True, key=lambda x: x.count)
return counts[:25]
top_scores = F.udf(_top_scores, T.ArrayType(T.StructType().add('i', T.IntegerType()).add('count', T.DoubleType())))
vec_to_word = F.udf(_vecToWord, T.StringType())
def _vecToWord(i):
return vocabulary.value[i]
res = df.withColumn('word_count', explode(top_scores('Frequencies')))
=>
+-----+-----+----------+
doc_id, ..., word_count
(i, count)
+-----+-----+----------+
4711, ..., (2, 30.0)
4711, ..., (362, 40.0)
+-----+-----+----------+
res = res \
.groupBy('word_count.i').agg( \
avg('word_count.count').alias('mean')
.orderBy('mean', ascending=False)
res = res.withColumn('token', vec_to_word('i'))
=>
+---+---------+----------+
i, token, mean
+---+---------+----------+
2, vintage, 15
328, airline, 30
+--+----------+----------+
这篇关于Spark ML CountVectorizer输出的说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!