如何使用n-gram执行分类任务? [英] How to work with n-grams for classification tasks?

查看:95
本文介绍了如何使用n-gram执行分类任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将使用 n-gram 在样本数据集上训练分类器.我搜索了相关内容,并在下面编写了代码.因为我是python的初学者,所以我有两个问题.

I'm going to train a classifier on a sample dataset using n-gram. I searched for related content and wrote the code below. As I'm a beginner in python, I have two questions.

1-为什么词典要具有真"结构(标有注释)?这与朴素贝叶斯分类器输入有关吗?

1- Why should the dictionary have this 'True' structure (marked with comment)? Is this related to Naive Bayes Classifier input?

2-您建议使用哪个分类器来执行此任务?

2- Which classifier do you recommend to do this task?

欢迎其他任何缩短代码的建议:).

Any other suggestion to shorten the code are welcome :).

from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk import ngrams
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util


stoplist = set(stopwords.words("english"))


def stopword_removal(words):
    useful_words = [word for word in words if word not in stoplist]
    return useful_words


def create_ngram_features(words, n):
    ngram_vocab = ngrams(words, n)
    my_dict = dict([(ng, True) for ng in ngram_vocab])  # HERE
    return my_dict


for n in [1,2]:
    positive_data = []
    for fileid in movie_reviews.fileids('pos'):
        words = stopword_removal(movie_reviews.words(fileid))
        positive_data.append((create_ngram_features(words, n), "positive"))
    print('\n\n---------- Positive Data Sample----------\n', positive_data[0])

    negative_data = []
    for fileid in movie_reviews.fileids('neg'):
        words = stopword_removal(movie_reviews.words(fileid))
        negative_data.append((create_ngram_features(words, n), "negative"))
    print('\n\n---------- Negative Data Sample ----------\n', negative_data[0])

    train_set = positive_data[:100] + negative_data[:100]
    test_set = positive_data[100:] + negative_data[100:]

    classifier = NaiveBayesClassifier.train(train_set)

    accuracy = nltk.classify.util.accuracy(classifier, test_set)
    print('\n', str(n)+'-gram accuracy:', accuracy)

推荐答案

在进行数据训练之前,您需要将您的n元语法转换为尺寸为< number_of_documents,max_document_representation_length>的代码矩阵.例如,文档表示是一个词袋,其中语料词典的每个词/n-gram在文档中都有其频率.

Before data training, you need to transform your n-grams into matrix of codes with size <number_of_documents, max_document_representation_length>. For example, document representation is a bag-of-words where each word/n-gram of a corpus dictionary has its frequency in a document.

朴素贝叶斯分类器是最简单的分类器.但是它在嘈杂的数据上效果不佳,并且需要平衡的数据类分布来进行训练.您可以尝试使用任何增强分类器,例如,梯度增强机或支持向量机.

Naive Bayes classifier is the most simple classifier. But it works bad on noisy data and needs balanced data classes' distribution for training. You can try to use any boosting classifier, for example, gradient boosting machine or support vector machine.

所有分类器和转换器都可以在 scikit-learn 库中找到.

All classifiers and transformers are available in scikit-learn library.

这篇关于如何使用n-gram执行分类任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆