nltk StanfordNERTagger:如何在没有大写的情况下获得专有名词 [英] nltk StanfordNERTagger : How to get proper nouns without capitalization

查看:22
本文介绍了nltk StanfordNERTagger:如何在没有大写的情况下获得专有名词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 StanfordNERTagger 和 nltk 从一段文本中提取关键字.

"

docText="John Donk 为 POI 工作.Brian Jones 希望与 Xyz Corp. 会面,以衡量 POI 的短期绩效指标.words = re.split("W+",docText)stop = set(stopwords.words("english"))#从列表中删除停用词words = [w for w in words 如果 w 不在停止和 len(w) >2]str = " ".join(words)打印字符串stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')stp = StanfordPOSTagger('english-bidirectional-dissim.tagger')stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']打印斯坦福 POS 标记"打印 stanfordPosTagList标记 = stn.tag(stanfordPosTagList)打印标记

这给了我

John Donk 工作 POI Brian Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标斯坦福 POS 标记[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term'][(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

很明显,ShortTerm 之类的东西被标记为 NNP.我拥有的数据包含许多这样的例子,其中NNP 单词大写.这可能是由于拼写错误,或者它们可能是标题.我对此没有太多控制权.

如何解析或清理数据,以便我可以检测到非 NNP 术语,即使它可能大写?我不希望像 ShortTerm 这样的术语被归类为 NNP

另外,不确定为什么 John Donk 被捕获为一个人,而 Brian Jones 不是.可能是由于我的数据中其他大写的非 NNP 吗?这会影响 StanfordNERTagger 对待其他一切的方式吗?

更新,一种可能的解决方案

这是我打算做的

  1. 取每个单词并转换为小写
  2. 标记小写单词
  3. 如果标签是NNP,那么我们知道原词也必须是一个NNP
  4. 如果不是,那么原始单词的大小写错误

这是我尝试做的

str = " ".join(words)打印字符串stp = StanfordPOSTagger('english-bidirectional-dissim.tagger')对于 str.split() 中的单词:wl = word.lower()打印 wlw,pos = stp.tag(wl)打印位置如果 pos=="NNP":打印得到了 NNP"打印 w

但这给了我错误

John Donk 工作 POI Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标约翰回溯(最近一次调用最后一次): 中的文件X:crp.py",第 37 行w,pos = stp.tag(wl)ValueError:解包的值太多

我尝试了多种方法,但总是出现一些错误.如何标记一个词?

我不想将整个字符串转换为小写然后标记.如果我这样做,StanfordPOSTagger 将返回一个空字符串

解决方案

首先,请参阅您的另一个问题,以设置从命令行或 Python 调用的 Stanford CoreNLP:nltk:如何防止专有名词词干.

对于正确的大小写句子,我们看到 NER 工作正常:

<预><代码>>>>从 corenlp 导入 StanfordCoreNLP>>>nlp = StanfordCoreNLP('http://localhost:9000')>>>text = ('John Donk 工作 POI Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标.'... 'john donk 工作 poi jones 希望满足 xyz corp 测量 poi 短期绩效指标')>>>output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})>>>annotated_sent0 = 输出['句子'][0]>>>annotated_sent1 = 输出['句子'][1]>>>对于 annotated_sent0['tokens'] 中的令牌:... 打印 token['word']、token['lemma']、token['pos']、token['ner']...约翰约翰 NNP 人Donk Donk NNP 人工作 工作 VBZ OPOI POI NNP 组织琼斯琼斯 NNP 组织想要VBZ O见见见见VB OXyz Xyz NNP 组织Corp Corp NNP 组织测量测量 VBG OPOI POI N O短 短 JJ O术语 术语 NN O性能 性能 N O指标 指标 NNS O...哦

对于小写的句子,你不会得到 NNP POS 标签或任何 NER 标签:

<预><代码>>>>对于 annotated_sent1['tokens'] 中的令牌:... 打印 token['word']、token['lemma']、token['pos']、token['ner']...约翰约翰 NN O驴 驴 J​​J O工作 工作 NNS Opoi poi VBP O琼斯·琼斯 NNS O想要VBZ O见见见见VB Oxyz xyz NN Ocorp corp NN O测量测量 VBG Opoi poi NN O短 短 JJ O术语 术语 NN O性能 性能 N O指标 指标 NNS O

所以你的问题应该是:

  • 您的 NLP 应用程序的最终目标是什么?
  • 为什么您的输入是小写的?是您做的还是数据是如何提供的?

回答完这些问题后,您可以继续决定您真正想用 NER 标签做什么,即

  • 如果输入是小写的并且是因为您构建 NLP 工具链的方式,那么

    • 不要那样做!!! 对正常文本执行 NER 操作,不要扭曲您所创建的内容.这是因为 NER 是在普通文本上训练的,所以它不会真正脱离普通文本的上下文.
    • 还要尽量不要混合使用来自不同套件的 NLP 工具,它们通常不会很好用,尤其是在 NLP 工具链的末端
  • 如果输入是小写的,因为原始数据就是这样,那么:

  • 如果输入有错误的大小写,例如`Some big Some Small 但并非都是专有名词,然后

    • 也尝试使用 truecasing 解决方案.

I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."

words = re.split("W+",docText) 

stops = set(stopwords.words("english"))

    #remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

this gives me

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

so clearly, things like Short and Term were tagged as NNP. The data that i have contains many such instances where non NNP words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.

How can i parse or clean up the data so that i can detect a non NNP term even though it may be capitalized? I dont want terms like Short and Term to be categorized as NNP

Also, not sure why John Donk was captured as a person but Brian Jones was not. Could it be due to the other capitalized non NNPs in my data? Could that be having an effect on how the StanfordNERTagger treats everything else?

Update, one possible solution

Here is what i plan to do

  1. Take each word and convert to lower case
  2. Tag the lowercase word
  3. If the tag is NNP then we know that the original word must also be an NNP
  4. If not, then the original word was mis-capitalized

Here is what i tried to do

str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
for word in str.split():
    wl = word.lower()
    print wl
    w,pos = stp.tag(wl)
    print pos
    if pos=="NNP":
        print "Got NNP"
        print w

but this gives me error

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
  File "X:crp.py", line 37, in <module>
    w,pos = stp.tag(wl)
ValueError: too many values to unpack

i have tried multiple approaches but some error always shows up. How can i Tag a single word?

I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger returns an empty string

解决方案

Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.

For the proper cased sentence we see that the NER works properly:

>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O

And for the lowered cased sentence, you will not get NNP for POS tag nor any NER tag:

>>> for token in annotated_sent1['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O

So the question to your question should be:

  • What is the ultimate aim of your NLP application?
  • Why is your input lower-cased? Was it your doing or how the data was provided?

And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.

  • If the input is lower-cased and it's because of how you structured your NLP tool chain, then

    • DO NOT do that!!! Perform the NER on the normal text without distortions you've created. It's because the NER was trained on normal text so it won't really work out of the context of normal text.
    • Also try to not mix it NLP tools from different suites they will usually not play nice, especially at the end of your NLP tool chain
  • If the input is lower-cased because that's how the original data was, then:

  • If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then

    • Try the truecasing solution too.

这篇关于nltk StanfordNERTagger:如何在没有大写的情况下获得专有名词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆