如何使用 Python NLTK 计算 WordNet 中两个形容词之间的最短路径(测地线)距离? [英] How do I calculate the shortest path (geodesic) distance between two adjectives in WordNet using Python NLTK?
问题描述
在 WordNet 中计算两个同义词集之间的语义相似度可以通过几个内置的相似度度量轻松完成,例如:
synset1.path_similarity(synset2)
synset1.lch_similarity(synset2)
,Leacock-Chodorow 相似度
synset1.wup_similarity(synset2)
, Wu-Palmer 相似度
然而,所有这些都利用了 WordNet 的分类关系,即名词和动词的关系.形容词和副词通过同义词、反义词和相关词相关联.如何测量两个形容词之间的距离(跳数)?
我尝试了 path_similarity()
,但正如预期的那样,它返回 'None'
:
from nltk.corpus import wordnet as wnx = wn.synset('good.a.01')y = wn.synset('bad.a.01')打印(wn.path_similarity(x,y))
如果有任何方法可以计算一个形容词和另一个形容词之间的距离,将不胜感激.
要获得不是名词/动词的单词之间的相似度没有简单的方法.
如前所述,名词/动词的相似性很容易从
<预><代码>>>>从 nltk.corpus 导入 wordnet as wn>>>dog = wn.synset('dog.n.1')>>>cat = wn.synset('cat.n.1')>>>car = wn.synset('car.n.1')>>>wn.path_similarity(狗,猫)0.2>>>wn.path_similarity(狗,车)0.07692307692307693>>>wn.wup_similarity(狗,猫)0.8571428571428571>>>wn.wup_similarity(狗,车)0.4>>>wn.lch_similarity(狗,车)1.072636802264849>>>wn.lch_similarity(狗,猫)2.0281482472922856对于形容词来说很难,因此您需要构建自己的文本相似度设备.最简单的方法是使用向量空间模型,基本上所有的词都用一些浮点数表示,例如
<预><代码>>>>将 numpy 导入为 np>>>蓝色 = np.array([0.2, 0.2, 0.3])>>>红色 = np.array([0.1, 0.2, 0.3])>>>粉红色 = np.array([0.1001, 0.221, 0.321])>>>汽车 = np.array([0.6, 0.9, 0.5])>>>定义余弦(x,y):...返回 np.dot(x,y)/(np.linalg.norm(x) * np.linalg.norm(y))...>>>余弦(粉色、红色)0.99971271929384864>>>余弦(粉色、蓝色)0.96756147991512709>>>余弦(蓝色,红色)0.97230558532824662>>>余弦(蓝色,汽车)0.91589118863996888>>>余弦(红色,汽车)0.87469454283170045>>>余弦(粉红色,汽车)0.87482313596223782要为诸如pink = np.array([0.1001, 0.221, 0.321])
之类的东西训练一堆向量,您应该尝试google for
- 潜在语义索引/潜在语义分析
- 词袋
- 向量空间模型语义
- Word2Vec、Doc2Vec、Wiki2Vec
- 神经网络
- 余弦相似度自然语言语义
您还可以尝试一些现成的软件/库,例如:
- Gensim https://radimrehurek.com/gensim/
- http://webcache.googleusercontent.com/search?q=cache:u5y4He592qgJ:takelab.fer.hr/sts/+&cd=2&hl=en&ct=clnk&;gl=sg
除了向量空间模型,您还可以尝试一些图形模型,该模型将单词放入图形中,并使用类似 pagerank 的东西在图形中走动,从而为您提供一些相似性度量.
另见:
Computing the semantic similarity between two synsets in WordNet can be easily done with several built-in similarity measures, such as:
synset1.path_similarity(synset2)
synset1.lch_similarity(synset2)
, Leacock-Chodorow Similarity
synset1.wup_similarity(synset2)
, Wu-Palmer Similarity
However, all of these exploit WordNet's taxonomic relations, which are relations for nouns and verbs. Adjectives and adverbs are related via synonymy, antonymy and pertainyms. How can one measure the distance (number of hops) between two adjectives?
I tried path_similarity()
, but as expected, it returns 'None'
:
from nltk.corpus import wordnet as wn
x = wn.synset('good.a.01')
y = wn.synset('bad.a.01')
print(wn.path_similarity(x,y))
If there is any way to compute the distance between one adjective and another, pointing it out would be greatly appreciated.
There's no easy way to get similarity between words that are not nouns/verbs.
As noted, nouns/verbs similarity are easily extracted from
>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.1')
>>> cat = wn.synset('cat.n.1')
>>> car = wn.synset('car.n.1')
>>> wn.path_similarity(dog, cat)
0.2
>>> wn.path_similarity(dog, car)
0.07692307692307693
>>> wn.wup_similarity(dog, cat)
0.8571428571428571
>>> wn.wup_similarity(dog, car)
0.4
>>> wn.lch_similarity(dog, car)
1.072636802264849
>>> wn.lch_similarity(dog, cat)
2.0281482472922856
For adjective it's hard, so you would need to build your own text similarity device. The easiest way is to use vector space model, basically, all words are represented by a number of floating point numbers, e.g.
>>> import numpy as np
>>> blue = np.array([0.2, 0.2, 0.3])
>>> red = np.array([0.1, 0.2, 0.3])
>>> pink = np.array([0.1001, 0.221, 0.321])
>>> car = np.array([0.6, 0.9, 0.5])
>>> def cosine(x,y):
... return np.dot(x,y) / (np.linalg.norm(x) * np.linalg.norm(y))
...
>>> cosine(pink, red)
0.99971271929384864
>>> cosine(pink, blue)
0.96756147991512709
>>> cosine(blue, red)
0.97230558532824662
>>> cosine(blue, car)
0.91589118863996888
>>> cosine(red, car)
0.87469454283170045
>>> cosine(pink, car)
0.87482313596223782
To train a bunch of vectors for something like pink = np.array([0.1001, 0.221, 0.321])
, you should try google for
- Latent semantic indexing / Latent semantic analysis
- Bag of Words
- Vector space model semantics
- Word2Vec, Doc2Vec, Wiki2Vec
- Neural Nets
- cosine similarity natural language semantics
You can also try some off the shelf software / libraries like:
- Gensim https://radimrehurek.com/gensim/
- http://webcache.googleusercontent.com/search?q=cache:u5y4He592qgJ:takelab.fer.hr/sts/+&cd=2&hl=en&ct=clnk&gl=sg
Other than vector space model, you can try some graphical model that puts words into a graph and uses something like pagerank to walk around the graph to give you some similarity measure.
See also:
- Compare similarity of terms/expressions using NLTK?
- check if two words are related to each other
- How to determine semantic hierarchies / relations in using NLTK?
- Is there an algorithm that tells the semantic similarity of two phrases
- Semantic Relatedness Algorithms - python
这篇关于如何使用 Python NLTK 计算 WordNet 中两个形容词之间的最短路径(测地线)距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!