NLTK将标记化的句子转换为同义词集格式 [英] NLTK convert tokenized sentence to synset format

查看:93
本文介绍了NLTK将标记化的句子转换为同义词集格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用NLTK来获得单个单词和句子中每个单词之间的相似性.

I'm looking to get the similarity between a single word and each word in a sentence using NLTK.

NLTK可以得到两个特定单词之间的相似性,如下所示.此方法要求给出对该单词的特定引用,在本例中为'dog.n.01',其中dog是名词,我们要使用第一个(01)NLTK定义.

NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.

dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print  dog.path_similarity(cat)
>> 0.2

问题是我需要从句子中的每个单词中获取语音信息的一部分. NLTK软件包具有获取句子中每个单词的词性的能力,如下所示.但是,这些语音部分("NN","VB","PRP" ...)与同义词集作为参数的格式不匹配.

The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.

text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

是否可以从NLTK中的pos_tag()结果中获取同义词集格式的数据?通过同义词集格式,我的意思是类似dog.n.01

Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01

推荐答案

您可以使用简单的转换功能:

You can use a simple conversion function:

from nltk.corpus import wordnet as wn

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

在标记句子后,您可以使用此功能将句子中的单词与SYNSET绑定在一起.这是一个示例:

After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))

synsets = []
lemmatzr = WordNetLemmatizer()

for token in tagged:
    wn_tag = penn_to_wn(token[1])
    if not wn_tag:
        continue

    lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
    synsets.append(wn.synsets(lemma, pos=wn_tag)[0])

print synsets

结果: [Synset('be.v.01'),Synset('travel.v.01'),Synset('buy.v.01'),Synset('gift.n.01) ')]

这篇关于NLTK将标记化的句子转换为同义词集格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆