如何使用 python nltk 使用 stanford NER 加速 NE 识别 [英] how to speed up NE recognition with stanford NER with python nltk

查看:47
本文介绍了如何使用 python nltk 使用 stanford NER 加速 NE 识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先我将文件内容标记为句子,然后对每个句子调用斯坦福 NER.但是这个过程真的很慢.我知道如果我在整个文件内容上调用它会更快,但是我在每个句子上调用它,因为我想在 NE 识别之前和之后索引每个句子.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')对于文件列表中的文件名:句子 = sent_tokenize(filecontent) #将文件内容分成句子对于 j,在 enumerate(sentences) 中发送:words = word_tokenize(sent) #将句子标记为单词ne_tags = st.tag(words) #从斯坦福 NER 获取标记的 NE

这可能是因为每个句子都调用了st.tag(),但是有什么办法可以让它跑得更快吗?

编辑

我想单独标记句子的原因是我想将句子写入文件(如句子索引),以便在稍后阶段给出 ne 标记的句子,我可以获得未处理的句子(我也是在这里做词形还原)

文件格式:

<块引用>

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

解决方案

来自 StanfordNERTagger,有tag_sents() 函数,见https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

<预><代码>>>>st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')>>>tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]>>>st.tag_sents(tokenized_sents)

First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

This is probably due to calling st.tag() for each sentence, but is there any way to make it run faster?

EDIT

The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

解决方案

From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)

这篇关于如何使用 python nltk 使用 stanford NER 加速 NE 识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆