从停用词之外的NLTK分布中删除特定词 [英] Dropping specific words out of an NLTK distribution beyond stopwords

查看:105
本文介绍了从停用词之外的NLTK分布中删除特定词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的句子,就像这样.我想将介词和诸如AIT之类的单词从列表中删除.我浏览了自然语言工具包(NLTK)文档,但找不到任何东西.有人可以告诉我如何吗?这是我的代码:

I have a simple sentence like so. I want to drop the prepositions and words such as A and IT out of the list. I looked through the Natural Language Toolkit (NLTK) documentation, but I can't find anything. Can someone show me how? Here is my code:

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)

推荐答案

本质上,nltk.probability.FreqDistcollections.Counter对象(

Essentially, nltk.probability.FreqDist is a collections.Counter object (https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L61). Given a dictionary object, there are several way to filter it:

1.读入FreqDist并使用lambda函数对其进行过滤

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> word_freq = nltk.FreqDist(tokenized_text)
>>> dict_filter = lambda word_freq, stopwords: dict( (word,word_freq[word]) for word in word_freq if word not in stopwords )
>>> filtered_word_freq = dict_filter(word_freq, stopwords)
>>> len(word_freq)
17
>>> len(filtered_word_freq)
8
>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

2.读入FreqDist并使用字典理解对其进行过滤

>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq = dict((word, freq) for word, freq in word_freq.items() if word not in stopwords)
>>> filtered_word_freq 
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

3.读入FreqDist之前先过滤单词

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> filtered_tokenized_text = [word for word in tokenized_text if word not in stopwords]
>>> filtered_word_freq = nltk.FreqDist(filtered_tokenized_text)
>>> filtered_word_freq
FreqDist({'sentence': 2, 'information': 1, ',': 1, 'It': 1, '.': 1, 'much': 1, 'basic': 1, 'Hello': 1})

这篇关于从停用词之外的NLTK分布中删除特定词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆