将POS标签从TextBlob转换为Wordnet兼容的输入 [英] Converting POS tags from TextBlob into Wordnet compatible inputs

查看:263
本文介绍了将POS标签从TextBlob转换为Wordnet兼容的输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python和nltk + Textblob进行一些文本分析.有趣的是,您可以为wordnet添加一个POS,以使您对同义词的搜索更加具体,但是不幸的是,nltk和Textblob中的标记与wordnet期望用于其synset类的输入类型不兼容".

示例 Wordnet.synsets()要求您提供的POS是n,v,a,r之一,就像这样

wn.synsets("dog", POS="n,v,a,r")

但是来自upenn_treebank的标准POS标记看起来像

JJ, VBD, VBZ, etc.

所以我正在寻找一种在两者之间转换的好方法.

除了蛮力之外,还有人知道实现这种转换的好方法吗?

解决方案

如果textblob使用的是PennTreeBank(ptb)标签集,则只需使用POS标签中的第一个字符即可映射到WN pos标签.

WN POS标签集包括'a'=形容词/副词,'s'=卫星形容词,'n'=名词和'v'=动词.

尝试:

>>> from nltk import word_tokenize, pos_tag
>>> from nltk.corpus import wordnet as wn
>>> text = 'this is a pos tagset in some foo bar paradigm'
>>> pos_tag(word_tokenize(text))
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]
>>> for tok, pos in pos_tag(word_tokenize(text)):
...     pos = pos[0].lower()
...     if pos in ['a', 'n', 'v']:
...             wn.synsets(tok, pos)
... 
[Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')]
[Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')]
[]
[]
[Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')]
[Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')]

I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class.

Example Wordnet.synsets() requires that the POS you give it is one of n,v,a,r, like so

wn.synsets("dog", POS="n,v,a,r")

But a standard POS tagging from upenn_treebank looks like

JJ, VBD, VBZ, etc.

So I'm looking for a good way to convert between the two.

Does anyone know of a good way to make this conversion happen, besides brute force?

解决方案

If textblob is using the PennTreeBank (ptb) tagset, then just use the first character in the POS tag to map to the WN pos tag.

WN POS tagset includes 'a' = adjective/adverbs, 's'=satelite adjective, 'n' = nouns and 'v' = verbs.

try:

>>> from nltk import word_tokenize, pos_tag
>>> from nltk.corpus import wordnet as wn
>>> text = 'this is a pos tagset in some foo bar paradigm'
>>> pos_tag(word_tokenize(text))
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]
>>> for tok, pos in pos_tag(word_tokenize(text)):
...     pos = pos[0].lower()
...     if pos in ['a', 'n', 'v']:
...             wn.synsets(tok, pos)
... 
[Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')]
[Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')]
[]
[]
[Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')]
[Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')]

这篇关于将POS标签从TextBlob转换为Wordnet兼容的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆