文本分类 - 如何处理 [英] Text classification - how to approach

查看：136 发布时间：2016/5/22 15:27:38 apache-spark machine-learning apache-spark-mllib apache-spark-ml

本文介绍了文本分类 - 如何处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我会尽力做描述一下我的想法。

I'll try do describe what I have in mind.

有是存储在MS SQL数据库中的文本内容。内容每天当属流。有些人每天都要经过的内容，如果内容符合一定的标准，将其标记为验证。只有一个类别。这是不是合法与否。

There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not.

我要的是创建基于已经验证的内容模型，将其保存并使用此模式为pre-验证或标记新传入的内容。也曾经在一段时间更新基于新验证的内容模型。希望我解释清楚自己

What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly validated content. Hopefully I explained myself clearly.

我想用星火基础上创建的模型流对数据进行分类。和朴素贝叶斯算法。但是，你会如何着手制作，更新和存储模式？有各种长度的〜200K +验证合力（文本）。我需要为模型这么多？
以及如何使用这种模式在星火流。

I am thinking to use Spark streaming for data classification based on model created. And Naive Bayes algorithm. But how would you approach creating, updating and storing the model? There is ~200K+ validated resultants (texts) of various length. Do I need so many for the model? And how to use this model in Spark Streaming.

先谢谢了。

推荐答案

哇，这问题很广泛，而且更关系到机器学习比阿帕奇星火，但是我会尽量给你一些提示或步骤如下（我不会为你做的工作）。

Wow this question is very broad, and more related to Machine Learning than to Apache Spark, however I will try to give you some hints or steps to follow (I won't do the work for you).

导入你需要的所有库

Import all the libraries you need

from pyspark.mllib.classification import LogisticRegressionWithSGD, LogisticRegressionModel
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
import re

加载数据到的 RDD

msgs = [("I love Star Wars but I can't watch it today", 1.0),
        ("I don't love Star Wars and people want to watch it today", 0.0),
        ("I dislike not being able to watch Star Wars", 1.0),
        ("People who love Star Wars are my friends", 1.0),
        ("I preffer to watch Star Wars on Netflix", 0.0),
        ("George Lucas shouldn't have sold the franchise", 1.0),
        ("Disney makes better movies than everyone else", 0.0)]

rdd = sc.parallelize(msgs)

标记化您的数据（如果你使用<一个href=\"https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=ml#module-pyspark.ml\"相对=nofollow> ML 中可能更容易）以及

Tokenize your data (if you use ML it might be easier) and

rdd = rdd.map(lambda (text, label): ([w.lower() for w in re.split(" +", text)], label))

删除所有不必要的字（广为人知的停止字的），和符号例如，＆安培;

commons = ["and", "but", "to"]
rdd = rdd.map(lambda (tokens, label): (filter(lambda token: token not in commons, tokens), label))

创建于所有不同的字的字典中的您所有的数据集，这听起来巨大的，但他们没有那么多，你会想到，我敢打赌，他们会适合你的主节点（但也有其他的方法来实现这一点，但为了简单起见，我会继续这样）。

Create a dictionary with all distinct words in all your dataset, it sounds huge but they are not so many as you would expect, and I bet they will fit in your master node (however there are other ways to approach this, but for simplicity I will keep this way).

# finds different words
words = rdd.flatMap(lambda (tokens, label): tokens).distinct().collect()
diffwords = len(words)

将您的功能到<一个href=\"https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=densevector#pyspark.mllib.linalg.DenseVector\"相对=nofollow> DenseVector 或<一个href=\"https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=densevector#pyspark.mllib.linalg.SparseVector\"相对=nofollow> SparseVector ，我显然会推荐第二种方法，因为通常一个 SparseVector 需要更少的空间被重新presented，但是这取决于上的数据。请注意，有喜欢散列更好的选择，但我想保持忠于我的详细方法。之后，将元组到<一个href=\"https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=labeledpoint#pyspark.mllib.regression.LabeledPoint\"相对=nofollow> LabeledPoint

Convert your features into a DenseVector or SparseVector, I would obviously recommend the second way because normally a SparseVector requires less space to be represented, however it depends on the data. Note, there are better alternatives like hashing, but I am trying to keep loyal to my verbose approach. After that transform the tuple into a LabeledPoint

def sparsify(length, tokens):
    indices = [words.index(t) for t in set(tokens)]
    quantities = [tokens.count(words[i]) for i in indices]

    return SparseVector(length, [(indices[i], quantities[i]) for i in xrange(len(indices))])

rdd = rdd.map(lambda (tokens, label): LabeledPoint(label, sparsify(diffwords, tokens)))

适合您最喜爱的模型，在这种情况下，我用<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=logisticregression#pyspark.mllib.classification.LogisticRegressionWithSGD\"相对=nofollow> LogisticRegressionWithSGD 截止别有用心。

lrm = LogisticRegressionWithSGD.train(rdd)

lrm.save(sc, "mylovelymodel.model")

<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=logisticregression#pyspark.mllib.classification.LogisticRegressionModel.load\"相对=nofollow>加载您的<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=logisticregression#pyspark.mllib.classification.LogisticRegressionModel\"相对=nofollow> LogisticRegressionModel 在另一个应用程序。

Load your LogisticRegressionModel in another application.

lrm = LogisticRegressionModel.load(sc, "mylovelymodel.model")

lrm.predict(SparseVector(37,[2,4,5,13,15,19,23,26,27,29],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
# outputs 0

请注意，我没有评估模型的精度，但它看起来很不是吗？

Note that I didn't evaluate the accuracy of the model, however it looks very doesn't it?

这篇关于文本分类 - 如何处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

文本分类 - 如何处理 [英] Text classification - how to approach

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

文本分类 - 如何处理 [英] Text classification - how to approach

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭