apache spark MLLib:如何为字符串特征构建标记点? [英] apache spark MLLib: how to build labeled points for string features?

查看:27
本文介绍了apache spark MLLib:如何为字符串特征构建标记点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Spark 的 MLLib 构建一个 NaiveBayes 分类器,该分类器将一组文档作为输入.

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents.

我想把一些东西作为特征(即作者、显式标签、隐式关键字、类别),但查看 文档 似乎 LabeledPoint 只包含双打,即它看起来像 LabeledPoint[Double, List[配对[Double,Double]].

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]].

相反,我的其余代码的输出将类似于 LabeledPoint[Double, List[Pair[String,Double]].

Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]].

我可以自己进行转换,但这似乎很奇怪.我应该如何使用 MLLib 处理这个问题?

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

我相信答案在 HashingTF 类(即散列功能)中,但我不明白它是如何工作的,它似乎需要某种容量值,但我的关键字列表并且主题实际上是无界的(或者更好的是,一开始是未知的).

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).

推荐答案

HashingTF 使用 散列技巧 将潜在无限数量的特征映射到有界大小的向量.存在特征冲突的可能性,但可以通过在构造函数中选择更多特征来减少这种冲突.

HashingTF uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There is the possibility of feature collisions but this can be made smaller by choosing a larger number of features in the constructor.

为了不仅基于特征的内容而且基于一些元数据(例如,具有cats"标签而不是文档中的cats"一词)来创建特征,您可以提供 HashingTF 类类似于tag:cats",这样带有单词的标签就会散列到不同的槽中,而不仅仅是单词.

In order to create features based on not only the content of a feature but also some metadata (e.g. having a tag of 'cats' as opposed to having the word 'cats' in the document) you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

如果您使用 HashingTF 创建了特征计数向量,您可以使用它们通过将任何高于零的计数设置为 1 来创建词袋特征.您还可以使用 HashingTF 创建 TF-IDF 向量code>IDF 类如下:

If you've created feature count vectors using HashingTF you can use them to create bag of words features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

在您的情况下,您似乎已经计算了每个文档的字数.这不适用于 HashingTF 类,因为它旨在为您进行计数.

In your case it looks like you've already computed the counts of words per document. This won't work with the HashingTF class since it's designed to do the counting for you.

这篇论文有一些关于为什么特征的争论在语言应用程序中,冲突并不是什么大问题.根本原因是大多数单词都不常见(由于语言的特性)并且冲突与词频无关(由于散列特性),因此足够常见以帮助建立模型的单词不太可能同时散列到同一个插槽.

This paper has some arguments about why feature collisions aren't that much of a problem in language applications. The essential reasons are that most words are uncommon (due to properties of languages) and that collisions are independent of word frequencies (due to hashing properties) so that it's unlikely that words that are common enough to help with one's models will both hash to the same slot.

这篇关于apache spark MLLib:如何为字符串特征构建标记点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆