Spark ML CountVectorizer输出的说明 [英] Explanation of Spark ML CountVectorizer output

查看:106
本文介绍了Spark ML CountVectorizer输出的说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请帮助了解Spark ML的输出

Please help understand the output of the Spark ML CountVectorizer and suggest which documentation explains it.

val cv = new CountVectorizer()
  .setInputCol("Tokens")
  .setOutputCol("Frequencies")
  .setVocabSize(5000)
  .setMinTF(1)
  .setMinDF(2)
val fittedCV = cv.fit(tokenDF.select("Tokens"))
fittedCV.transform(tokenDF.select("Tokens")).show(false)

2374应该是词典中术语(单词)的数量.什么是"[2,6,328,548,1234]"?

2374 should be the number of terms (words) in the dictionary. What is the "[2,6,328,548,1234]"?

它们是否在词典中索引了"[航空公司,包包,年份,世界,冠军]"一词?如果是这样,为什么相同的单词"airline"在第二行中具有不同的索引"0"?

Are they indices of the words "[airline, bag, vintage, world, champion]" in the dictionary? If so, why the same word "airline" has a different index "0" in the second line?

+------------------------------------------+----------------------------------------------------------------+
|Tokens                                    |Frequencies                                                     |
+------------------------------------------+----------------------------------------------------------------+
...
|[airline, bag, vintage, world, champion]  |(2374,[2,6,328,548,1234],[1.0,1.0,1.0,1.0,1.0])                 |
|[airline, bag, vintage, jet, set, brown]  |(2374,[0,2,6,328,405,620],[1.0,1.0,1.0,1.0,1.0,1.0])            |
+------------------------------------------+----------------------------------------------------------------+


  [1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer

推荐答案

有一些

There is some doc explaining the basics. However this is pretty bare.

是的.数字代表词汇索引中的单词.但是,频率向量中的顺序与令牌向量中的顺序不对应. airline,bag,vintage 都在这两行中,因此它们对应于索引[2,6,328].但是您不能依赖相同的顺序.

Yes. The numbers represent the words in a vocabulary index. However the order in the frequencies vector does not correspond to the order in tokens vector. airline, bag, vintage are in both rows, hence they correspond to indices [2,6,328]. But you can't rely on the same order.

行数据类型为 SparseVector .第一个数组显示索引,第二个数组显示值.

The row data type is a SparseVector. The first array, shows the indices and the second the values.

例如

vector[328] 
   => 1.0

映射可能如下:

vocabulary
airline 328
bag 6
vintage 2

Frequencies
2734, [2, 6 ,328], [99, 5, 7]

# counts
vintage x 99
bag x 5
airline 7

为了找回单词,您可以在词汇表中进行查找.这需要广播给不同的工人.您还很可能希望将每个文档的数量分解为单独的行.

In order to get the words back , you can do a lookup in the vocabulary. This needs to be broadcasted to different workers. You also most probably want to explode the counts per doc into separate rows.

下面是一些 python 代码段,用于使用udf将每个文档中的前25个常用单词提取到单独的行中,并计算每个单词的均值

Here is some python code snippet to extract top 25 frequent words per doc with a udf into separate rows and compute the mean for each word

import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row

vocabulary = sc.broadcast(fittedCV.vocabulary)

def _top_scores(v):
    # create count tuples for each index(i) in a vector(v)
    # `.item()` is used, because in python the count value is a numpy datatype, in `scala` it will be just double 

    counts = [Row(i=i.item(),count=v[i.item()].item()) for i in v.indices]
    # => [Row(i=2, count=30, Row(i=362, count=40)]

    # return 25 top count rows
    counts = sorted(counts, reverse=True, key=lambda x: x.count)
    return counts[:25]

top_scores = F.udf(_top_scores, T.ArrayType(T.StructType().add('i', T.IntegerType()).add('count', T.DoubleType())))                  
vec_to_word = F.udf(_vecToWord, T.StringType())


def _vecToWord(i):
    return vocabulary.value[i]



res = df.withColumn('word_count', explode(top_scores('Frequencies')))
=>
+-----+-----+----------+ 
doc_id, ..., word_count
             (i,  count)
+-----+-----+----------+
4711, ...,   (2, 30.0)
4711, ...,   (362, 40.0)
+-----+-----+----------+

res = res \
    .groupBy('word_count.i').agg( \
        avg('word_count.count').alias('mean')
    .orderBy('mean', ascending=False)

res = res.withColumn('token', vec_to_word('i')) 


=>
+---+---------+----------+ 
 i,   token,    mean
+---+---------+----------+ 
 2,   vintage,  15
 328, airline,  30  

+--+----------+----------+ 

这篇关于Spark ML CountVectorizer输出的说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆