如何标记 spacy 中的新词汇? [英] how to tokenize new vocab in spacy?

查看:27
本文介绍了如何标记 spacy 中的新词汇?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 spacy 来从它的依赖解析中获益,我在使 spcay 分词器对我添加的新词汇进行分词时遇到了麻烦.这是我的代码:

i am using spacy to get a benefit from it's dependency parsing, i am having a trouble in making spcay tokenizer tokenize the new vocabs i am adding. this is my code:

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

输出:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone', 'NN'), ('morphogenetic', 'JJ'), ('protein', 'NN'), ('(BMP)-2', 'NNP'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(BMPRIB).', 'NN')]

欲望输出:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NN'), ('for', 'IN'), ('BMP receptor type IB', 'NNP'), ('(', '('), ('BMPRIB', 'NNP'), (')', ')'), ('.', '.')]

如何让 spacy 标记我添加的新词汇?

how can i make spacy tokenize the new vocabs i added?

推荐答案

查看 Doc.retokenize() 可以帮到你:

See if Doc.retokenize() may help you:

import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[6:11])

print([(token.text,token.tag_) for token in doc])

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(', '-LRB-'), ('BMPRIB', 'NNP'), (')', '-RRB-'), ('.', '.')]

这篇关于如何标记 spacy 中的新词汇?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆