带有 python NLTK 的斯坦福 NER 因包含多个“!!"的字符串而失败? [英] Stanford NER with python NLTK fails with strings containing multiple "!!"s?

查看:87
本文介绍了带有 python NLTK 的斯坦福 NER 因包含多个“!!"的字符串而失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设这是我的filecontent:

当他们超过 45 岁时!!这肯定会帮助迈克尔·乔丹.

When they are over 45 years old!! It would definitely help Michael Jordan.

以下是我标记 setence 的代码.

Below is my code for tagging setences.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

我希望 tokenized_sentstaggedsents 包含相同数量的句子.

I would expect both tokenized_sents and taggedsents contain the same number of sentences.

但这是它们包含的内容:

But here is what they contain:

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

这是因为有双!"在假定的第一句话的末尾.在使用 st.tag_sents()

This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()

我应该如何解决这个问题?

How should I resolve this?

推荐答案

如果您从另一个问题中遵循我的解决方案而不是使用 nltk,您将获得 JSON,该 JSON 可以正确地将此文本拆分为两个句子.

If you follow my solution from the other question instead of using nltk you will get JSON that properly splits this text into two sentences.

上一个问题的链接:如何使用 python nltk 使用 stanford NER 加速 NE 识别

这篇关于带有 python NLTK 的斯坦福 NER 因包含多个“!!"的字符串而失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆