基于NLTK的词干和词根化 [英] NLTK-based stemming and lemmatization
问题描述
我正在尝试使用lemmatizer
预处理字符串,然后删除标点符号和数字.我正在使用下面的代码来做到这一点.我没有收到任何错误,但是文本没有经过适当的预处理.仅除去停用词,但去词法不起作用,标点符号和数字也保留下来.
I am trying to preprocess a string using lemmatizer
and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
我得到的最终输出是:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
预期的输出应如下所示:
And expected output should look like:
This beautiful day I work exercise text
推荐答案
不,您当前的方法行不通,因为您必须一次将一个单词传递给词干分析器/词干分析器,否则,这些功能不会将字符串解释为句子(它们需要单词).
No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
或者,您可以使用PorterStemmer
,它与词义化相同,但没有上下文.
Alternatively, you can use a PorterStemmer
, which does the same thing as lemmatisation, but without context.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
然后,像这样称呼词干:
And, call the stemmer like this:
stemmer.stem(i)
这篇关于基于NLTK的词干和词根化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!