为什么CoreNLP ner tagger和ner tagger将分隔的数字连在一起? [英] Why do CoreNLP ner tagger and ner tagger join the separated numbers together?
问题描述
这是代码段:
In [390]: t
Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
In [391]: ner_tagger.tag(t)
Out[391]:
[('my', 'O'),
('phone', 'O'),
('number', 'O'),
('is', 'O'),
('1111\xa01111\xa01111', 'NUMBER')]
我期望的是:
Out[391]:
[('my', 'O'),
('phone', 'O'),
('number', 'O'),
('is', 'O'),
('1111', 'NUMBER'),
('1111', 'NUMBER'),
('1111', 'NUMBER')]
如您所见,人工电话号码由\ xa0连起来,这被认为是一个不间断的空格.我可以通过设置CoreNLP而不更改其他默认规则来将其分开.
As you can see the artificial phone number is joined by \xa0 which is said to be a non-breaking space. Can I separate that by setting the CoreNLP without changing other default rules.
ner_tagger定义为:
The ner_tagger is defined as:
ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
推荐答案
TL; DR
NLTK正在将令牌列表读入字符串中,然后再将其传递给CoreNLP服务器.然后,CoreNLP重新标记输入,并将类似数字的标记与\xa0
(不间断空格)连接起来.
TL;DR
NLTK was reading the list of tokens into a string and before passing it to the CoreNLP server. And CoreNLP retokenize the inputs and concatenated the number-like tokens with \xa0
(non-breaking space).
让我们遍历代码,如果我们看一下CoreNLPParser
中的tag()
函数,我们会看到它调用了tag_sents()
函数,并在调用raw_tag_sents()
之前将字符串的输入列表转换为字符串.允许CoreNLPParser
重新标记输入,请参见 https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348 :
Lets walk through the code, if we look at the tag()
function from CoreNLPParser
, we see that it calls the tag_sents()
function and converted the input list of strings into a string before calling the raw_tag_sents()
which allows CoreNLPParser
to re-tokenized the input, see https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348:
def tag_sents(self, sentences):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of
tokens.
:param sentences: Input sentences to tag
:type sentences: list(list(str))
:rtype: list(list(tuple(str, str))
"""
# Converting list(list(str)) -> list(str)
sentences = (' '.join(words) for words in sentences)
return [sentences[0] for sentences in self.raw_tag_sents(sentences)]
def tag(self, sentence):
"""
Tag a list of tokens.
:rtype: list(tuple(str, str))
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> parser.tag(tokens)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> tokens = "What is the airspeed of an unladen swallow ?".split()
>>> parser.tag(tokens)
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
"""
return self.tag_sents([sentence])[0]
调用时,raw_tag_sents()
使用api_call()
将输入传递到服务器:
And when calling then the raw_tag_sents()
passes the input to the server using the api_call()
:
def raw_tag_sents(self, sentences):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
:param sentences: Input sentences to tag
:type sentences: list(str)
:rtype: list(list(list(tuple(str, str)))
"""
default_properties = {'ssplit.isOneSentence': 'true',
'annotators': 'tokenize,ssplit,' }
# Supports only 'pos' or 'ner' tags.
assert self.tagtype in ['pos', 'ner']
default_properties['annotators'] += self.tagtype
for sentence in sentences:
tagged_data = self.api_call(sentence, properties=default_properties)
yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
for tagged_sentence in tagged_data['sentences']]
所以问题是如何解决问题并在传递令牌时获取令牌?
如果我们在CoreNLP中查看Tokenizer的选项,则会看到tokenize.whitespace
选项:
If we look at the options for the Tokenizer in CoreNLP, we see the tokenize.whitespace
option:
- https://stanfordnlp.github.io/CoreNLP/tokenize.html#options
- Preventing tokens from containing a space in Stanford CoreNLP
如果我们在调用api_call()
之前对允许的其他properties
进行了一些更改,则可以在将令牌传递给由空格连接的CoreNLP服务器时强制执行令牌.更改代码:
If we make some changes to the allow additional properties
before calling api_call()
, we can enforce the tokens as it's passed to the CoreNLP server joined by whitespaces, e.g. changes to the code:
def tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of
tokens.
:param sentences: Input sentences to tag
:type sentences: list(list(str))
:rtype: list(list(tuple(str, str))
"""
# Converting list(list(str)) -> list(str)
sentences = (' '.join(words) for words in sentences)
if properties == None:
properties = {'tokenize.whitespace':'true'}
return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]
def tag(self, sentence, properties=None):
"""
Tag a list of tokens.
:rtype: list(tuple(str, str))
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> parser.tag(tokens)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> tokens = "What is the airspeed of an unladen swallow ?".split()
>>> parser.tag(tokens)
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
"""
return self.tag_sents([sentence], properties)[0]
def raw_tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
:param sentences: Input sentences to tag
:type sentences: list(str)
:rtype: list(list(list(tuple(str, str)))
"""
default_properties = {'ssplit.isOneSentence': 'true',
'annotators': 'tokenize,ssplit,' }
default_properties.update(properties or {})
# Supports only 'pos' or 'ner' tags.
assert self.tagtype in ['pos', 'ner']
default_properties['annotators'] += self.tagtype
for sentence in sentences:
tagged_data = self.api_call(sentence, properties=default_properties)
yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
for tagged_sentence in tagged_data['sentences']]
更改上面的代码后:
>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]
这篇关于为什么CoreNLP ner tagger和ner tagger将分隔的数字连在一起?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!