Python NLTK WUP 相似度分数对于完全相同的单词不统一 [英] Python NLTK WUP Similarity Score not unity for exact same word
问题描述
如下所示的简单代码给出了两种情况的 0.75 相似度得分.如您所见,这两个词完全相同.为了避免混淆,我还将一个词与其自身进行了比较.分数拒绝从 0.75 膨胀.这是怎么回事?
from nltk.corpus import wordnet as wn实际=wn.synsets('橙色')[0]预测=wn.synsets('橙色')[0]相似性=actual.wup_similarity(预测)印刷相似度相似性=实际.wup_similarity(实际)印刷相似度
这是一个有趣的问题.
TL;DR:
抱歉,这个问题没有简单的答案=(
<小时>太长,想看:
查看 wup_similarity()
的代码,问题不在于相似度计算,而在于 NLTK 遍历 WordNet 层次结构以获取 lowest_common_hypernym()
(参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).
通常,同义词集与其自身之间的最低通用上位词必须是自身:
<预><代码>>>>从 nltk.corpus 导入 wordnet as wn>>>y = wn.synsets('汽车')[0]>>>y.lowest_common_hypernyms(y, use_min_depth=True)[Synset('car.n.01')]但是在 orange
的情况下,它也会给 fruit
:
我们必须从 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805
<块引用>获取两个同义词集作为上位词的最低同义词集列表.当 use_min_depth == False
这意味着同义词集显示为返回 self
和 other
的具有最低最大深度的上位词或者如果在相同深度有多个这样的同义词集,它们都会被返回但是,如果 use_min_depth == True
则具有/具有最低的同义词集返回两个路径中的最小深度和出现
那么让我们用 use_min_depth=False
试试 lowest_common_hypernym()
:
似乎解决了绑定路径的歧义.但是 wup_similarity()
API 没有 use_min_depth
参数:
注意区别在于当 use_min_depth==False
时,lowest_common_hypernym 在遍历同义词集时检查最大深度.但是当 use_min_depth==True
时,它会检查最小深度,参见 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602
因此,如果我们跟踪最低的_common_hypernym 代码:
<预><代码>>>>synsets_to_search = x.common_hypernyms(x)>>>synsets_to_search[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'),Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]# 如果 use_min_depth==True>>>max_depth = max(x.min_depth() for x in synsets_to_search)>>>最大深度8>>>unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]>>>unsorted_lowest_common_hypernym[Synset('orange.n.01'), Synset('fruit.n.01')]>>># 如果 use_min_depth==False>>>max_depth = max(x.max_depth() for x in synsets_to_search)>>>最大深度11>>>unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]>>>unsorted_lowest_common_hypernym[Synset('orange.n.01')]<小时>
wup_similarity
这种奇怪的现象实际上在代码注释中突出显示,https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843
# 注意,为了保留 NLTK2 的行为,我们设置 use_min_depth=True# 有可能得到更准确的结果# 删除此设置,稍后应进行测试subsumers = self.lowest_common_hypernyms(其他,simulate_root=simulate_root 和need_root,use_min_depth=True)
当在 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:
subsumer = subsumers[0]
自然,在橙色同义词集的情况下,首先选择水果,它是与最低常见上位词相关联的列表中的第一个.
总而言之,默认参数是一种功能,而不是像 NLTK v2.x 那样维护可重复性的错误.
因此,解决方案可能是手动更改 NLTK 源以强制 use_min_depth=False
:
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845
<小时>已编辑
要解决此问题,您可能可以对同一同义词集进行临时检查:
def wup_similarity_hacked(synset1, synset2):如果同义词集 1 == 同义词集 2:返回 1.0别的:返回synset1.wup_similarity(synset2)
Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?
from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
This is an interesting problem.
TL;DR:
Sorry there's no short answer to this problem =(
Too long, want to read:
Looking at the code for wup_similarity()
, the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym()
(see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).
Normally, the lowest common hypernyms between a synset and itself would have to be itself:
>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]
But in the case of orange
it gives fruit
too:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]
We'll have to take a look at the code for the lowest_common_hypernym()
, from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805
Get a list of lowest synset(s) that both synsets have as a hypernym. When
use_min_depth == False
this means that the synset which appears as a hypernym of bothself
andother
with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, ifuse_min_depth == True
then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned
So let's try the lowest_common_hypernym()
with use_min_depth=False
:
>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]
Seems like that resolves the ambiguity of the tied path. But the wup_similarity()
API doesn't have the use_min_depth
parameter:
>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'
Note the difference is that when use_min_depth==False
, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True
, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602
So if we trace the lowest_common_hypernym code:
>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]
# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>>
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]
This weird phenomena with wup_similarity
is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843
# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)
And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:
subsumer = subsumers[0]
Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.
To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.
So the solution might be to either manually change the NLTK source to force use_min_depth=False
:
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845
EDITED
To resolve the problem, possibly you can do an ad-hoc check for same synset:
def wup_similarity_hacked(synset1, synset2):
if synset1 == synset2:
return 1.0
else:
return synset1.wup_similarity(synset2)
这篇关于Python NLTK WUP 相似度分数对于完全相同的单词不统一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!