使用python NLTK的Stanford NER失败,其中包含多个"!!"的字符串? [英] Stanford NER with python NLTK fails with strings containing multiple "!!"s?

查看:97
本文介绍了使用python NLTK的Stanford NER失败,其中包含多个"!!"的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设这是我的filecontent:

45岁以上!!肯定会对迈克尔·乔丹有帮助.

下面是我的标记设置的代码.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

我希望tokenized_sentstaggedsents包含相同数量的句子.

但是它们包含的内容是

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

这是因为有doulbe!"在假定的第一句话的末尾.使用st.tag_sents()

之前必须删除双!"吗?

我应该如何解决?

解决方案

如果您从另一个问题中遵循我的解决方案,而不是使用nltk,您将获得JSON,该JSON可将该文本正确地分成两个句子.

链接到上一个问题:如何使用python nltk的斯坦福大学NER加快NE识别速度

Suppose this is my filecontent:

When they are over 45 years old!! It would definitely help Michael Jordan.

Below is my code for tagging setences.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

I would expect both tokenized_sents and taggedsents contain the same number of sentences.

But here is what they contain:

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()

How should I resolve this?

解决方案

If you follow my solution from the other question instead of using nltk you will get JSON that properly splits this text into two sentences.

Link to previous question: how to speed up NE recognition with stanford NER with python nltk

这篇关于使用python NLTK的Stanford NER失败,其中包含多个"!!"的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆