结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除 [英] Combining text stemming and removal of punctuation in NLTK and scikit-learn

查看：13 发布时间：2021/12/22 20:03:48 python text scikit-learn nltk

本文介绍了结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我结合使用 NLTK 和 scikit-learn 的 CountVectorizer 来进行词干提取和标记化.

I am using a combination of NLTK and scikit-learn's CountVectorizer for stemming words and tokenization.

以下是 CountVectorizer 的简单用法示例:

Below is an example of the plain usage of the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

打印哪个

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

现在，假设我想删除停用词并提取词干.一种选择是这样做:

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

打印:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

但是我怎样才能最好地去掉第二个版本中的标点符号呢?

But how would I best get rid of the punctuation characters in this second version?

已编辑

上面的代码可以工作，但速度很慢，因为它多次循环遍历相同的文本:

EDITED

The above code will work but it's rather slow because it's looping through the same text multiple times:

一次删除标点
第二次标记化
第三次停止.

如果您有更多步骤，例如删除数字或删除停用词或小写等.

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

最好将这些步骤尽可能地放在一起，如果您的数据需要更多的预处理步骤，这里有几个更好的答案会更有效:

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

这篇关于结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除 [英] Combining text stemming and removal of punctuation in NLTK and scikit-learn

问题描述

推荐答案

已编辑

EDITED

相关文章

Python最新文章

热门教程

热门工具

登录关闭

结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除 [英] Combining text stemming and removal of punctuation in NLTK and scikit-learn

问题描述

推荐答案

已编辑

EDITED

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭