nltk StanfordNERTagger:如何在没有大写的情况下获得专有名词 [英] nltk StanfordNERTagger : How to get proper nouns without capitalization
问题描述
我正在尝试使用 StanfordNERTagger 和 nltk 从一段文本中提取关键字.
"docText="John Donk 为 POI 工作.Brian Jones 希望与 Xyz Corp. 会面,以衡量 POI 的短期绩效指标.words = re.split("W+",docText)stop = set(stopwords.words("english"))#从列表中删除停用词words = [w for w in words 如果 w 不在停止和 len(w) >2]str = " ".join(words)打印字符串stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')stp = StanfordPOSTagger('english-bidirectional-dissim.tagger')stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']打印斯坦福 POS 标记"打印 stanfordPosTagList标记 = stn.tag(stanfordPosTagList)打印标记
这给了我
John Donk 工作 POI Brian Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标斯坦福 POS 标记[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term'][(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
很明显,Short
和 Term
之类的东西被标记为 NNP
.我拥有的数据包含许多这样的例子,其中非 NNP
单词大写.这可能是由于拼写错误,或者它们可能是标题.我对此没有太多控制权.
如何解析或清理数据,以便我可以检测到非 NNP
术语,即使它可能大写?我不希望像 Short
和 Term
这样的术语被归类为 NNP
另外,不确定为什么 John Donk
被捕获为一个人,而 Brian Jones
不是.可能是由于我的数据中其他大写的非 NNP
吗?这会影响 StanfordNERTagger
对待其他一切的方式吗?
更新,一种可能的解决方案
这是我打算做的
- 取每个单词并转换为小写
- 标记小写单词
- 如果标签是
NNP
,那么我们知道原词也必须是一个NNP
- 如果不是,那么原始单词的大小写错误
这是我尝试做的
str = " ".join(words)打印字符串stp = StanfordPOSTagger('english-bidirectional-dissim.tagger')对于 str.split() 中的单词:wl = word.lower()打印 wlw,pos = stp.tag(wl)打印位置如果 pos=="NNP":打印得到了 NNP"打印 w
但这给了我错误
John Donk 工作 POI Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标约翰回溯(最近一次调用最后一次): 中的文件X:crp.py",第 37 行w,pos = stp.tag(wl)ValueError:解包的值太多
我尝试了多种方法,但总是出现一些错误.如何标记一个词?
我不想将整个字符串转换为小写然后标记.如果我这样做,StanfordPOSTagger
将返回一个空字符串
首先,请参阅您的另一个问题,以设置从命令行或 Python 调用的 Stanford CoreNLP:nltk:如何防止专有名词词干.
对于正确的大小写句子,我们看到 NER 工作正常:
<预><代码>>>>从 corenlp 导入 StanfordCoreNLP>>>nlp = StanfordCoreNLP('http://localhost:9000')>>>text = ('John Donk 工作 POI Jones 希望与 Xyz Corp 会面,测量 POI 短期绩效指标.'... 'john donk 工作 poi jones 希望满足 xyz corp 测量 poi 短期绩效指标')>>>output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})>>>annotated_sent0 = 输出['句子'][0]>>>annotated_sent1 = 输出['句子'][1]>>>对于 annotated_sent0['tokens'] 中的令牌:... 打印 token['word']、token['lemma']、token['pos']、token['ner']...约翰约翰 NNP 人Donk Donk NNP 人工作 工作 VBZ OPOI POI NNP 组织琼斯琼斯 NNP 组织想要VBZ O见见见见VB OXyz Xyz NNP 组织Corp Corp NNP 组织测量测量 VBG OPOI POI N O短 短 JJ O术语 术语 NN O性能 性能 N O指标 指标 NNS O...哦对于小写的句子,你不会得到 NNP
POS 标签或任何 NER 标签:
所以你的问题应该是:
- 您的 NLP 应用程序的最终目标是什么?
- 为什么您的输入是小写的?是您做的还是数据是如何提供的?
回答完这些问题后,您可以继续决定您真正想用 NER 标签做什么,即
如果输入是小写的并且是因为您构建 NLP 工具链的方式,那么
- 不要那样做!!! 对正常文本执行 NER 操作,不要扭曲您所创建的内容.这是因为 NER 是在普通文本上训练的,所以它不会真正脱离普通文本的上下文.
- 还要尽量不要混合使用来自不同套件的 NLP 工具,它们通常不会很好用,尤其是在 NLP 工具链的末端
如果输入是小写的,因为原始数据就是这样,那么:
- 对一小部分数据进行注释,或查找小写的带注释数据,然后重新训练模型.
- 解决这个问题并使用普通文本训练 truecaser,然后将 truecasing 模型应用于小写文本.请参阅 https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
如果输入有错误的大小写,例如`Some big Some Small 但并非都是专有名词,然后
- 也尝试使用 truecasing 解决方案.
I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."
words = re.split("W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']
print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged
this gives me
John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
so clearly, things like Short
and Term
were tagged as NNP
. The data that i have contains many such instances where non NNP
words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.
How can i parse or clean up the data so that i can detect a non NNP
term even though it may be capitalized? I dont want terms like Short
and Term
to be categorized as NNP
Also, not sure why John Donk
was captured as a person but Brian Jones
was not. Could it be due to the other capitalized non NNP
s in my data? Could that be having an effect on how the StanfordNERTagger
treats everything else?
Update, one possible solution
Here is what i plan to do
- Take each word and convert to lower case
- Tag the lowercase word
- If the tag is
NNP
then we know that the original word must also be anNNP
- If not, then the original word was mis-capitalized
Here is what i tried to do
str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
wl = word.lower()
print wl
w,pos = stp.tag(wl)
print pos
if pos=="NNP":
print "Got NNP"
print w
but this gives me error
John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
File "X:crp.py", line 37, in <module>
w,pos = stp.tag(wl)
ValueError: too many values to unpack
i have tried multiple approaches but some error always shows up. How can i Tag a single word?
I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger
returns an empty string
Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.
For the proper cased sentence we see that the NER works properly:
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O
And for the lowered cased sentence, you will not get NNP
for POS tag nor any NER tag:
>>> for token in annotated_sent1['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
So the question to your question should be:
- What is the ultimate aim of your NLP application?
- Why is your input lower-cased? Was it your doing or how the data was provided?
And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.
If the input is lower-cased and it's because of how you structured your NLP tool chain, then
- DO NOT do that!!! Perform the NER on the normal text without distortions you've created. It's because the NER was trained on normal text so it won't really work out of the context of normal text.
- Also try to not mix it NLP tools from different suites they will usually not play nice, especially at the end of your NLP tool chain
If the input is lower-cased because that's how the original data was, then:
- Annotate a small portion of the data, or find annotated data that was lowercased and then retrain a model.
- Work around it and train a truecaser with normal text then apply the truecasing model to the lower-cased text. See https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then
- Try the truecasing solution too.
这篇关于nltk StanfordNERTagger:如何在没有大写的情况下获得专有名词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!