使用python从NLTK中提取名词短语 [英] Extracting noun phrases from NLTK using python
问题描述
我是python和nltk的新手.我已将代码从 https://gist.github.com/alexbowe/879414 转换为下面给出的代码使其可以运行于许多文档/文本块.但是我遇到了以下错误
I am new to both python and nltk. I have converted the code from https://gist.github.com/alexbowe/879414 to the below given code to make it run for many documents/text chunks. But I got the following error
Traceback (most recent call last):
File "E:/NLP/PythonProgrames/NPExtractor/AdvanceMain.py", line 16, in <module>
result = np_extractor.extract()
File "E:\NLP\PythonProgrames\NPExtractor\NPExtractorAdvanced.py", line 67, in extract
for term in terms:
File "E:\NLP\PythonProgrames\NPExtractor\NPExtractorAdvanced.py", line 60, in get_terms
for leaf in self.leaves(tree):
TypeError: leaves() takes 1 positional argument but 2 were given
任何人都可以帮助我解决此问题.我必须从数以百万计的产品评论中提取名词短语.我使用Java使用Standford NLP套件,但速度非常慢,因此我认为在python中使用nltk会更好.如果有更好的解决方案,也请提出建议.
Can any one help me to fix this problem. I have to extract noun phrases from millions of product reviews. I used Standford NLP kit using Java, but it was extremely slow, so I thought using nltk in python will be better. Please also recommend if there is any better solution.
import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
class NounPhraseExtractor(object):
def __init__(self, sentence):
self.sentence = sentence
def execute(self):
# Taken from Su Nam Kim Paper...
chunker = nltk.RegexpParser(grammar)
#toks = nltk.regexp_tokenize(text, sentence_re)
# #postoks = nltk.tag.pos_tag(toks)
toks = nltk.word_tokenize(self.sentence)
postoks = nltk.tag.pos_tag(toks)
tree = chunker.parse(postoks)
return tree
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(self,tree):
for leaf in self.leaves(tree):
term = [self.normalise(w) for w, t in leaf if self.acceptable_word(w)]
yield term
def extract(self):
terms = self.get_terms(self.execute())
matches = []
for term in terms:
for word in term:
matches.append(word)
return matches
推荐答案
您需要:
- 用@static方法装饰
normalize
,acceptable_word
和leaves
中的每一个,或 - 添加
self
参数作为这些方法的第一个参数.
- decorate each of
normalize
,acceptable_word
, andleaves
with @staticmethod, or - add a
self
parameter as the first parameter of these methods.
您正在调用self.leaves
,它将把self
作为隐式第一个参数传递给leaves
方法(但您的方法仅使用一个参数).进行这些静态方法或添加self
参数将解决此问题.
You're calling self.leaves
which will pass self
as an implicit first parameter to the leaves
method (but your method only takes a single parameter). Making these static methods, or adding a self
parameter will fix this issue.
(您以后对self.acceptable_word
和self.normalize
的调用将有相同的问题)
(your later calls to self.acceptable_word
,and self.normalize
will have the same issue)
You could read about Python's static methods in their docs, or possibly from an external site that may be easier to digest.
这篇关于使用python从NLTK中提取名词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!