如何使用 Python NLTK 计算 WordNet 中两个形容词之间的最短路径(测地线)距离? [英] How do I calculate the shortest path (geodesic) distance between two adjectives in WordNet using Python NLTK?

查看:26
本文介绍了如何使用 Python NLTK 计算 WordNet 中两个形容词之间的最短路径(测地线)距离?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 WordNet 中计算两个同义词集之间的语义相似度可以通过几个内置的相似度度量轻松完成,例如:

synset1.path_similarity(synset2)

synset1.lch_similarity(synset2),Leacock-Chodorow 相似度

synset1.wup_similarity(synset2), Wu-Palmer 相似度

(如所见在这里)

然而,所有这些都利用了 WordNet 的分类关系,即名词和动词的关系.形容词和副词通过同义词、反义词和相关词相关联.如何测量两个形容词之间的距离(跳数)?

我尝试了 path_similarity(),但正如预期的那样,它返回 'None':

from nltk.corpus import wordnet as wnx = wn.synset('good.a.01')y = wn.synset('bad.a.01')打印(wn.path_similarity(x,y))

如果有任何方法可以计算一个形容词和另一个形容词之间的距离,将不胜感激.

解决方案

要获得不是名词/动词的单词之间的相似度没有简单的方法.

如前所述,名词/动词的相似性很容易从

<预><代码>>>>从 nltk.corpus 导入 wordnet as wn>>>dog = wn.synset('dog.n.1')>>>cat = wn.synset('cat.n.1')>>>car = wn.synset('car.n.1')>>>wn.path_similarity(狗,猫)0.2>>>wn.path_similarity(狗,车)0.07692307692307693>>>wn.wup_similarity(狗,猫)0.8571428571428571>>>wn.wup_similarity(狗,车)0.4>>>wn.lch_similarity(狗,车)1.072636802264849>>>wn.lch_similarity(狗,猫)2.0281482472922856

对于形容词来说很难,因此您需要构建自己的文本相似度设备.最简单的方法是使用向量空间模型,基本上所有的词都用一些浮点数表示,例如

<预><代码>>>>将 numpy 导入为 np>>>蓝色 = np.array([0.2, 0.2, 0.3])>>>红色 = np.array([0.1, 0.2, 0.3])>>>粉红色 = np.array([0.1001, 0.221, 0.321])>>>汽车 = np.array([0.6, 0.9, 0.5])>>>定义余弦(x,y):...返回 np.dot(x,y)/(np.linalg.norm(x) * np.linalg.norm(y))...>>>余弦(粉色、红色)0.99971271929384864>>>余弦(粉色、蓝色)0.96756147991512709>>>余弦(蓝色,红色)0.97230558532824662>>>余弦(蓝色,汽车)0.91589118863996888>>>余弦(红色,汽车)0.87469454283170045>>>余弦(粉红色,汽车)0.87482313596223782

要为诸如pink = np.array([0.1001, 0.221, 0.321])之类的东西训练一堆向量,您应该尝试google for

  • 潜在语义索引/潜在语义分析
  • 词袋
  • 向量空间模型语义
  • Word2Vec、Doc2Vec、Wiki2Vec
  • 神经网络
  • 余弦相似度自然语言语义

您还可以尝试一些现成的软件/库,例如:

除了向量空间模型,您还可以尝试一些图形模型,该模型将单词放入图形中,并使用类似 pagerank 的东西在图形中走动,从而为您提供一些相似性度量.

另见:

Computing the semantic similarity between two synsets in WordNet can be easily done with several built-in similarity measures, such as:

synset1.path_similarity(synset2)

synset1.lch_similarity(synset2), Leacock-Chodorow Similarity

synset1.wup_similarity(synset2), Wu-Palmer Similarity

(as seen here)

However, all of these exploit WordNet's taxonomic relations, which are relations for nouns and verbs. Adjectives and adverbs are related via synonymy, antonymy and pertainyms. How can one measure the distance (number of hops) between two adjectives?

I tried path_similarity(), but as expected, it returns 'None':

from nltk.corpus import wordnet as wn
x = wn.synset('good.a.01')
y = wn.synset('bad.a.01')


print(wn.path_similarity(x,y))

If there is any way to compute the distance between one adjective and another, pointing it out would be greatly appreciated.

解决方案

There's no easy way to get similarity between words that are not nouns/verbs.

As noted, nouns/verbs similarity are easily extracted from

>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.1')
>>> cat = wn.synset('cat.n.1')
>>> car = wn.synset('car.n.1')
>>> wn.path_similarity(dog, cat)
0.2
>>> wn.path_similarity(dog, car)
0.07692307692307693
>>> wn.wup_similarity(dog, cat)
0.8571428571428571
>>> wn.wup_similarity(dog, car)
0.4
>>> wn.lch_similarity(dog, car)
1.072636802264849
>>> wn.lch_similarity(dog, cat)
2.0281482472922856

For adjective it's hard, so you would need to build your own text similarity device. The easiest way is to use vector space model, basically, all words are represented by a number of floating point numbers, e.g.

>>> import numpy as np
>>> blue = np.array([0.2, 0.2, 0.3])
>>> red = np.array([0.1, 0.2, 0.3])
>>> pink = np.array([0.1001, 0.221, 0.321])
>>> car = np.array([0.6, 0.9, 0.5])
>>> def cosine(x,y):
...     return np.dot(x,y) / (np.linalg.norm(x) * np.linalg.norm(y))
... 
>>> cosine(pink, red)
0.99971271929384864
>>> cosine(pink, blue)
0.96756147991512709
>>> cosine(blue, red)
0.97230558532824662
>>> cosine(blue, car)
0.91589118863996888
>>> cosine(red, car)
0.87469454283170045
>>> cosine(pink, car)
0.87482313596223782

To train a bunch of vectors for something like pink = np.array([0.1001, 0.221, 0.321]), you should try google for

  • Latent semantic indexing / Latent semantic analysis
  • Bag of Words
  • Vector space model semantics
  • Word2Vec, Doc2Vec, Wiki2Vec
  • Neural Nets
  • cosine similarity natural language semantics

You can also try some off the shelf software / libraries like:

Other than vector space model, you can try some graphical model that puts words into a graph and uses something like pagerank to walk around the graph to give you some similarity measure.

See also:

这篇关于如何使用 Python NLTK 计算 WordNet 中两个形容词之间的最短路径(测地线)距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆