使用python从NLTK中提取名词短语 [英] Extracting noun phrases from NLTK using python

查看:979
本文介绍了使用python从NLTK中提取名词短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python和nltk的新手.我已将代码从 https://gist.github.com/alexbowe/879414 转换为下面给出的代码使其可以运行于许多文档/文本块.但是我遇到了以下错误

I am new to both python and nltk. I have converted the code from https://gist.github.com/alexbowe/879414 to the below given code to make it run for many documents/text chunks. But I got the following error

 Traceback (most recent call last):
 File "E:/NLP/PythonProgrames/NPExtractor/AdvanceMain.py", line 16, in    <module>
  result = np_extractor.extract()
 File "E:\NLP\PythonProgrames\NPExtractor\NPExtractorAdvanced.py", line 67,   in extract
 for term in terms:
File "E:\NLP\PythonProgrames\NPExtractor\NPExtractorAdvanced.py", line 60, in get_terms
 for leaf in self.leaves(tree):
 TypeError: leaves() takes 1 positional argument but 2 were given

任何人都可以帮助我解决此问题.我必须从数以百万计的产品评论中提取名词短语.我使用Java使用Standford NLP套件,但速度非常慢,因此我认为在python中使用nltk会更好.如果有更好的解决方案,也请提出建议.

Can any one help me to fix this problem. I have to extract noun phrases from millions of product reviews. I used Standford NLP kit using Java, but it was extremely slow, so I thought using nltk in python will be better. Please also recommend if there is any better solution.

import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
grammar = r"""
 NBAR:
    {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
 NP:
    {<NBAR>}
    {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""
   lemmatizer = nltk.WordNetLemmatizer()
   stemmer = nltk.stem.porter.PorterStemmer()

class NounPhraseExtractor(object):

    def __init__(self, sentence):
        self.sentence = sentence

    def execute(self):
        # Taken from Su Nam Kim Paper...
        chunker = nltk.RegexpParser(grammar)
        #toks = nltk.regexp_tokenize(text, sentence_re)
        # #postoks = nltk.tag.pos_tag(toks)
        toks = nltk.word_tokenize(self.sentence)
        postoks = nltk.tag.pos_tag(toks)
        tree = chunker.parse(postoks)
        return tree

    def leaves(tree):
        """Finds NP (nounphrase) leaf nodes of a chunk tree."""
        for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
            yield subtree.leaves()

    def normalise(word):
        """Normalises words to lowercase and stems and lemmatizes it."""
        word = word.lower()
        word = stemmer.stem_word(word)
        word = lemmatizer.lemmatize(word)
        return word

    def acceptable_word(word):
        """Checks conditions for acceptable word: length, stopword."""
        accepted = bool(2 <= len(word) <= 40
                    and word.lower() not in stopwords)
        return accepted

    def get_terms(self,tree):
        for leaf in self.leaves(tree):
            term = [self.normalise(w) for w, t in leaf if self.acceptable_word(w)]
        yield term

    def extract(self):
        terms = self.get_terms(self.execute())
        matches = []
        for term in terms:
            for word in term:
                matches.append(word)
        return matches

推荐答案

您需要:

  • 用@static方法装饰normalizeacceptable_wordleaves中的每一个,或
  • 添加self参数作为这些方法的第一个参数.
  • decorate each of normalize, acceptable_word, and leaves with @staticmethod, or
  • add a self parameter as the first parameter of these methods.

您正在调用self.leaves,它将把self作为隐式第一个参数传递给leaves方法(但您的方法仅使用一个参数).进行这些静态方法或添加self参数将解决此问题.

You're calling self.leaves which will pass self as an implicit first parameter to the leaves method (but your method only takes a single parameter). Making these static methods, or adding a self parameter will fix this issue.

(您以后对self.acceptable_wordself.normalize的调用将有相同的问题)

(your later calls to self.acceptable_word,and self.normalize will have the same issue)

您可以在 docs ,或者可能来自外部站点可能更容易消化.

You could read about Python's static methods in their docs, or possibly from an external site that may be easier to digest.

这篇关于使用python从NLTK中提取名词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆