结合NLTK和scikit-learn中的文本词干和标点符号消除 [英] Combining text stemming and removal of punctuation in NLTK and scikit-learn
问题描述
我使用NLTK和scikit-learn
的CountVectorizer
组合来词干和标记化.
I am using a combination of NLTK and scikit-learn
's CountVectorizer
for stemming words and tokenization.
下面是CountVectorizer
的普通用法的示例:
Below is an example of the plain usage of the CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
将打印哪个
Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]
现在,假设我要删除停用词并阻止这些词.一种选择是这样做:
Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
########
vect = CountVectorizer(tokenizer=tokenize, stop_words='english')
vect.fit(vocab)
sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
哪些印刷品:
Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]
但是我该如何最好地摆脱第二版中的标点符号呢?
But how would I best get rid of the punctuation characters in this second version?
推荐答案
有几种选择,请尝试在标记化之前删除标点符号.但这意味着don't
-> dont
There are several options, try remove the punctuation before tokenization. But this would mean that don't
-> dont
import string
def tokenize(text):
text = "".join([ch for ch in text if ch not in string.punctuation])
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
或者尝试在标记化后删除标点符号.
Or try removing punctuation after tokenization.
def tokenize(text):
tokens = nltk.word_tokenize(text)
tokens = [i for i in tokens if i not in string.punctuation]
stems = stem_tokens(tokens, stemmer)
return stems
已编辑
上面的代码可以工作,但是速度很慢,因为它多次遍历同一文本:
EDITED
The above code will work but it's rather slow because it's looping through the same text multiple times:
- 一旦删除标点符号
- 第二次令牌化
- 第三次阻止.
如果您还有其他一些步骤,例如删除数字或删除停用词或小写字母等,则
If you have more steps like removing digits or removing stopwords or lowercasing, etc.
最好将步骤尽可能地集中在一起,如果您的数据需要更多的预处理步骤,那么以下几个更好的答案会更有效:
It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:
- Applying NLTK-based text pre-proccessing on a pandas dataframe
- Why is my NLTK function slow when processing the DataFrame?
- https://www.kaggle.com/alvations/basic-nlp-with-nltk
这篇关于结合NLTK和scikit-learn中的文本词干和标点符号消除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!