快速NLTK解析为语法树 [英] Quick NLTK parse into syntax tree

查看:495
本文介绍了快速NLTK解析为语法树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将数百个句子解析为其语法树,并且我需要快速执行此操作,问题是,如果我使用NLTK,则需要定义一个语法,而我不知道我只知道它会是英语.我尝试使用统计分析器,它对于我的目的非常有用,但是速度可能会好很多,有没有办法在没有语法的情况下使用nltk解析? 在此代码段中,我正在使用处理池以并行"方式进行处理,但是速度还有很多不足.

I am trying to parse several hundreds of sentences into their syntax trees and i need to do that fast, the problem is that if i use NLTK then i need to define a grammar, and i cant know that i only know its gonna be english. I tried using this statistical parser, and it works great for my purposes however the speed could be a lot better, is there a way to use nltk parsing without a grammar? In this snippet i am using a processing pool to do the processing in "parallel" but the speed leaves a lot to be desired.

import pickle
import re
from stat_parser.parser import Parser
from multiprocessing import Pool
import HTMLParser
def multy(a):
    global parser
    lst=re.findall('(\S.+?[.!?])(?=\s+|$)',a[1])
    if len(lst)==0:
        lst.append(a[1])
    try:
        ssd=parser.norm_parse(lst[0])
    except:
        ssd=['NNP','nothing']
    with open('/var/www/html/internal','a') as f:
        f.write("[[ss")
        pickle.dump([a[0],ssd], f)
        f.write("ss]]")
if __name__ == '__main__':
    parser=Parser()
    with open('/var/www/html/interface') as f:
        data=f.read()
    data=data.split("\n")
    p = Pool(len(data))
    Totalis_dict=dict()
    listed=list()
    h = HTMLParser.HTMLParser()
    with open('/var/www/html/internal','w') as f:
        f.write("")
    for ind,each in enumerate(data):
        listed.append([str(ind),h.unescape(re.sub('[^\x00-\x7F]+','',each))])
    p.map(multy,listed)

推荐答案

解析是一个计算量很大的操作.通过更精细的解析器,例如 bllip ,您可能可以获得更好的性能.它是用C ++编写的,并且受益于长期工作的团队.有一个与之交互的python模块.

Parsing is a fairly computationally intensive operation. You can probably get much better performance out of a more polished parser, such as bllip. It is written in c++ and benefits from a team having worked on it over a prolonged period. There is a python module which interacts with it.

这是一个比较bllip和您使用的解析器的示例:

Here's an example comparing bllip and the parser you are using:

import timeit

# setup stat_parser
from stat_parser import Parser
parser = Parser()

# setup bllip
from bllipparser import RerankingParser
from bllipparser.ModelFetcher import download_and_install_model
# download model (only needs to be done once)
model_dir = download_and_install_model('WSJ', '/tmp/models')
# Loading the model is slow, but only needs to be done once
rrp = RerankingParser.from_unified_model_dir(model_dir)

sentence = "In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language."

if __name__=='__main__':
    from timeit import Timer
    t_bllip = Timer(lambda: rrp.parse(sentence))
    t_stat = Timer(lambda: parser.parse(sentence))
    print "bllip", t_bllip.timeit(number=5)
    print "stat", t_stat.timeit(number=5)

它在我的计算机上的运行速度快了大约10倍:

And it runs about 10 times faster on my computer:

(vs)[jonathan@ ~]$ python /tmp/test.py 
bllip 2.57274985313
stat 22.748554945

此外,在将bllip解析器集成到NLTK中时,还有一个请求待处理: https://github.com/nltk/nltk/pull/605

Also, there's a pull request pending on integrating the bllip parser into NLTK: https://github.com/nltk/nltk/pull/605

此外,您声明:在您的问题中我不知道我只会知道它会是英语".如果通过这种方式您也需要解析其他语言,那么它将更加复杂.对这些统计解析器进行一些输入方面的训练,这些输入通常是从Penn TreeBanks中的《华尔街日报》解析的内容.一些解析器还将为其他语言提供训练有素的模型,但是您需要首先识别语言,然后将适当的模型加载到解析器中.

Also, you state: "i cant know that i only know its gonna be english" in your question. If by this you mean it needs to parse other languages as well, it will be much more complicated. These statistical parsers are trained on some input, often parsed content from the WSJ in the Penn TreeBanks. Some parses will provide trained models for other languages as well, but you'll need to identify the language first, and load an appropriate model into the parser.

这篇关于快速NLTK解析为语法树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆