如何使用 Spark Naive Bayes 分类器对 IDF 进行文本分类? [英] How to use spark Naive Bayes classifier for text classification with IDF?

查看:36
本文介绍了如何使用 Spark Naive Bayes 分类器对 IDF 进行文本分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 tf-idf 将文本文档转换为特征向量,然后训练一个朴素贝叶斯算法来对它们进行分类.

I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them.

我可以轻松加载没有标签的文本文件,并使用 HashingTF() 将其转换为向量,然后使用 IDF() 根据单词的重要性对其进行加权.但是如果我这样做,我会去掉标签,即使顺序相同,似乎也不可能将标签与向量重新组合.

I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same.

另一方面,我可以在每个单独的文档上调用 HashingTF() 并保留标签,但是我不能在它上面调用 IDF(),因为它需要整个文档语料库(并且标签会进入方式).

On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole corpus of documents (and the labels would get in the way).

朴素贝叶斯的 spark 文档只有一个示例,其中点已经被标记和矢量化,因此没有太大帮助.

The spark documentation for naive bayes only has one example where the points are already labeled and vectorized so that isn't much help.

我也看过这个指南:http://help.mortardata.com/technologies/spark/train_a_machine_learning_model但这里他只对每个没有 idf 的文档应用哈希函数.

I also had a look at this guide: http://help.mortardata.com/technologies/spark/train_a_machine_learning_model but here he only applies the hashing function on each document without idf.

所以我的问题是,对于朴素贝叶斯分类器,是否有一种方法不仅可以矢量化,还可以使用 idf 对单词进行加权?主要问题似乎是 sparks 坚持只接受标记点的 rdds 作为 NaiveBayes 的输入.

So my question is whether there is a way to not only vectorize but also weight the words using idf for the naive bayes classifier? The main problem seems to be sparks's insistence on only accepting rdds of labeledPoints as input to NaiveBayes.

def parseLine(line):
    label = row[1] # the label is the 2nd element of each row
    features = row[3] # the text is the 4th element of each row
    features = tokenize(features)
    features = hashingTF.transform(features)
    return LabeledPoint(label, features)
labeledData = data1.map(parseLine)

推荐答案

标准 PySpark 方法 (split -> transform -> zip) 似乎工作得很好:

Standard PySpark approach (split -> transform -> zip) seems to work just fine:

from pyspark.mllib.feature import HashingTF, IDF
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes   

training_raw = sc.parallelize([
    {"text": "foo foo foo bar bar protein", "label": 1.0},
    {"text": "foo bar dna for bar", "label": 0.0},
    {"text": "foo bar foo dna foo", "label": 0.0},
    {"text": "bar foo protein foo ", "label": 1.0}])


# Split data into labels and features, transform
# preservesPartitioning is not really required
# since map without partitioner shouldn't trigger repartitiong
labels = training_raw.map(
    lambda doc: doc["label"],  # Standard Python dict access 
    preservesPartitioning=True # This is obsolete.
)

tf = HashingTF(numFeatures=100).transform( ## Use much larger number in practice
    training_raw.map(lambda doc: doc["text"].split(), 
    preservesPartitioning=True))

idf = IDF().fit(tf)
tfidf = idf.transform(tf)

# Combine using zip
training = labels.zip(tfidf).map(lambda x: LabeledPoint(x[0], x[1]))

# Train and check
model = NaiveBayes.train(training)
labels_and_preds = labels.zip(model.predict(tfidf)).map(
    lambda x: {"actual": x[0], "predicted": float(x[1])})

要获得一些统计信息,您可以使用 MulticlassMetrics:

To get some statistics you can use MulticlassMetrics:

from pyspark.mllib.evaluation import MulticlassMetrics
from operator import itemgetter

metrics = MulticlassMetrics(
    labels_and_preds.map(itemgetter("actual", "predicted")))

metrics.confusionMatrix().toArray()
## array([[ 2.,  0.],
##        [ 0.,  2.]])

相关

这篇关于如何使用 Spark Naive Bayes 分类器对 IDF 进行文本分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆