文本分类 - 如何处理 [英] Text classification - how to approach

查看:26
本文介绍了文本分类 - 如何处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我会尽量描述我的想法.

I'll try do describe what I have in mind.

MS SQL 数据库中存储了文本内容.内容每天都以流的形式出现.有些人每天都会浏览内容,如果内容符合某些标准,则将其标记为已验证.只有一个类别.它要么有效",要么无效.

There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not.

我想要的是基于已经验证的内容创建一个模型,保存它并使用这个模型来预验证"或标记新的传入内容.也偶尔根据新验证的内容更新模型.希望我解释清楚了.

What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly validated content. Hopefully I explained myself clearly.

我正在考虑根据创建的模型使用 Spark 流进行数据分类.和朴素贝叶斯算法.但是您将如何创建、更新和存储模型?有大约 200K+ 不同长度的验证结果(文本).我需要这么多模型吗?以及如何在 Spark Streaming 中使用这个模型.

I am thinking to use Spark streaming for data classification based on model created. And Naive Bayes algorithm. But how would you approach creating, updating and storing the model? There is ~200K+ validated resultants (texts) of various length. Do I need so many for the model? And how to use this model in Spark Streaming.

提前致谢.

推荐答案

哇,这个问题非常广泛,与 Machine Learning 的相关性比与 Apache Spark 的相关性更高,但是我会尽量给你一些提示或步骤(我不会为你做这项工作).

Wow this question is very broad, and more related to Machine Learning than to Apache Spark, however I will try to give you some hints or steps to follow (I won't do the work for you).

  1. 导入您需要的所有库

  1. Import all the libraries you need

from pyspark.mllib.classification import LogisticRegressionWithSGD, LogisticRegressionModel
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
import re

  • 将您的数据加载到 RDD

    msgs = [("I love Star Wars but I can't watch it today", 1.0),
            ("I don't love Star Wars and people want to watch it today", 0.0),
            ("I dislike not being able to watch Star Wars", 1.0),
            ("People who love Star Wars are my friends", 1.0),
            ("I preffer to watch Star Wars on Netflix", 0.0),
            ("George Lucas shouldn't have sold the franchise", 1.0),
            ("Disney makes better movies than everyone else", 0.0)]
    
    rdd = sc.parallelize(msgs)
    

  • 标记您的数据(如果您使用 ML 可能更容易)和

  • Tokenize your data (if you use ML it might be easier) and

    rdd = rdd.map(lambda (text, label): ([w.lower() for w in re.split(" +", text)], label))
    

  • 删除所有不必要的词(广为人知的停用词)和符号,例如,.&

    commons = ["and", "but", "to"]
    rdd = rdd.map(lambda (tokens, label): (filter(lambda token: token not in commons, tokens), label))
    

  • 所有数据集中创建一个包含所有distinct单词的字典,这听起来很庞大,但没有你想象的那么多,我敢打赌将适合您的主节点(但是还有其他方法可以解决此问题,但为简单起见,我将保持这种方式).

  • Create a dictionary with all distinct words in all your dataset, it sounds huge but they are not so many as you would expect, and I bet they will fit in your master node (however there are other ways to approach this, but for simplicity I will keep this way).

    # finds different words
    words = rdd.flatMap(lambda (tokens, label): tokens).distinct().collect()
    diffwords = len(words)
    

  • 将您的 features 转换为 DenseVectorSparseVector,我显然会推荐第二种方式,因为通常 SparseVector 需要更少的空间来表示,但是这取决于数据.请注意,有更好的替代方法,例如 hashing,但我正在努力保持我的详细方法.之后将 tuple 转换为 LabeledPoint

  • Convert your features into a DenseVector or SparseVector, I would obviously recommend the second way because normally a SparseVector requires less space to be represented, however it depends on the data. Note, there are better alternatives like hashing, but I am trying to keep loyal to my verbose approach. After that transform the tuple into a LabeledPoint

    def sparsify(length, tokens):
        indices = [words.index(t) for t in set(tokens)]
        quantities = [tokens.count(words[i]) for i in indices]
    
        return SparseVector(length, [(indices[i], quantities[i]) for i in xrange(len(indices))])
    
    rdd = rdd.map(lambda (tokens, label): LabeledPoint(label, sparsify(diffwords, tokens)))
    

  • 适合你最喜欢的模型,在这种情况下我使用了 LogisticRegressionWithSGD别有用心.

    lrm = LogisticRegressionWithSGD.train(rdd)
    

  • 保存您的模型.

    lrm.save(sc, "mylovelymodel.model")
    

  • 加载您的LogisticRegressionModel 在另一个应用程序中.

  • Load your LogisticRegressionModel in another application.

    lrm = LogisticRegressionModel.load(sc, "mylovelymodel.model")
    

  • 预测类别.

    lrm.predict(SparseVector(37,[2,4,5,13,15,19,23,26,27,29],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
    # outputs 0
    

  • 请注意,我没有评估模型的 accuracy,但它看起来很不错,不是吗?

    Note that I didn't evaluate the accuracy of the model, however it looks very doesn't it?

    这篇关于文本分类 - 如何处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆