摆脱停用词和标点符号 [英] Get rid of stopwords and punctuation

查看:335
本文介绍了摆脱停用词和标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为NLTK停用词苦苦挣扎.

I'm struggling with NLTK stopword.

这是我的一些代码.有人可以告诉我这是怎么回事吗?

Here's my bit of code.. Could someone tell me what's wrong?

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

推荐答案

您的问题是字符串的迭代器返回每个字符而不是每个单词.

Your problem is that the iterator for a string returns each character not each word.

例如:

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

您需要迭代并检查每个单词,幸运的是,拆分功能已经存在于python标准库的此处中查找更多信息使用re模块的可靠答案.

You need to iterate and check each word, fortunately the split function already exists in the python standard library under the string module. However you are dealing with natural language including punctuation you should look here for a more robust answer that uses the re module.

一旦有了单词列表,您应该在比较之前将它们全部小写,然后以已经显示的方式比较它们.

Once you have a list of words you should lowercase them all before comparison and then compare them in the manner that you have shown already.

Buena suerte.

Buena suerte.

好的,请尝试使用此代码,它应该对您有用.它显示了两种实现方法,它们本质上是相同的,但是第一种方法更清晰一些,而第二种则更加pythonic.

Okay try this code, it should work for you. It shows two ways to do it, they are essentially identical but the first is a bit clearer while the second is more pythonic.

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words 

希望这对您有所帮助.

这篇关于摆脱停用词和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆