Apache的火花MLLib:如何建立标记点串特点? [英] apache spark MLLib: how to build labeled points for string features?

查看:193
本文介绍了Apache的火花MLLib:如何建立标记点串特点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图建立与星火的MLLib一个NaiveBayes分类器作为输入一组文档。

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents.

我想放些东西作为特征(即作者,明确的标签,关键字隐,类别),但看的的文档似乎一个 LabeledPoint 只包含双打,也就是说,它看起来像 LabeledPoint [双,列表[对[双,双]

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]].

相反,我有什么从我的code其余输出将类似于 LabeledPoint [双,列表[对[字符串,双]

Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]].

我可以弥补我自己的转换,但它似乎很奇怪。我怎么来处理这个使用MLLib?

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

我相信答案是在 HashingTF 类(即哈希功能),但我不明白它是如何工作的,看来,它需要某​​种能力值,但我的关键字和主题的列表是无限的有效(或更好的,未知的开头)。

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).

推荐答案

HashingTF 使用的散列特技来的特征的潜在无限数目映射到有限尺寸的向量。有特征的碰撞的可能性,但是这可以通过在构造选择的特征的大量的小型化。

HashingTF uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There is the possibility of feature collisions but this can be made smaller by choosing a larger number of features in the constructor.

为了创建一个基于特征的不仅是内容,但也有一些元数据(例如,有'猫'而不是让文档中的单词'猫'的标签),你可以养活<$ C功能$ C> HashingTF 类像'标签:猫',这样一个字一个标签就能把另一个插槽不仅仅是字

In order to create features based on not only the content of a feature but also some metadata (e.g. having a tag of 'cats' as opposed to having the word 'cats' in the document) you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

如果您已经使用创建的要素计数矢量 HashingTF 您可以使用它们通过设置大于零的任意数为1,您还可以创建TF创造的词功能包使用 IDF 类,像这样-IDF载体:

If you've created feature count vectors using HashingTF you can use them to create bag of words features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

在你的情况下,它看起来像你已经计算每个文档的单词计数。因为它是专门用来做计数为你这不会与 HashingTF 类工作。

In your case it looks like you've already computed the counts of words per document. This won't work with the HashingTF class since it's designed to do the counting for you.

本文有关于为什么一些功能参数碰撞是没有太大的语言应用程序的问题。的根本原因是大多数的话是罕见的(由于语言特性)和碰撞是独立的词频(由于散列属性),因此它不太可能的话是很常见,以帮助自己的车型将两者散列到相同的插槽。

This paper has some arguments about why feature collisions aren't that much of a problem in language applications. The essential reasons are that most words are uncommon (due to properties of languages) and that collisions are independent of word frequencies (due to hashing properties) so that it's unlikely that words that are common enough to help with one's models will both hash to the same slot.

这篇关于Apache的火花MLLib:如何建立标记点串特点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆