基于NLTK的词干和词根化 [英] NLTK-based stemming and lemmatization

查看:575
本文介绍了基于NLTK的词干和词根化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lemmatizer预处理字符串,然后删除标点符号和数字.我正在使用下面的代码来做到这一点.我没有收到任何错误,但是文本没有经过适当的预处理.仅除去停用词,但去词法不起作用,标点符号和数字也保留下来.

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.

from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)

我得到的最终输出是:

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

预期的输出应如下所示:

And expected output should look like:

This beautiful day I work exercise text

推荐答案

不,您当前的方法行不通,因为您必须一次将一个单词传递给词干分析器/词干分析器,否则,这些功能不会将字符串解释为句子(它们需要单词).

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

import re

__stop_words = set(nltk.corpus.stopwords.words('english'))

def clean(tweet):
    cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
    return ' '.join([lemmatizer.lemmatize(i, 'v') 
                for i in cleaned_tweet.split() if i not in __stop_words])

或者,您可以使用PorterStemmer,它与词义化相同,但没有上下文.

Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

from nltk.stem.porter import PorterStemmer  
stemmer = PorterStemmer() 

然后,像这样称呼词干:

And, call the stemmer like this:

stemmer.stem(i)

这篇关于基于NLTK的词干和词根化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆