从词干中获取最接近的名词 [英] Getting the closest noun from a stemmed word

查看:131
本文介绍了从词干中获取最接近的名词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短版本:
如果我有词干:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
有没有办法构造最接近的名词形式?
That is 'computer', or 'sugar' respectively

Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively

长版:
我正在使用python和NLTK,Wordnet在一堆单词上执行一些语义相似性任务.
我注意到大多数sem-sim分数仅对名词有效,而形容词和动词则没有任何结果.
了解了所涉及的不准确性之后,我想将单词从动词/形容词形式转换为名词形式,以便我可以估算它们的相似性(而不是通常由形容词返回的"NONE").

Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).

我认为实现此目的的一种方法是使用词干提取词根,然后尝试构建该词根最接近的名词形式.
此处的George-Bogdan Ivanov算法非常有效.我想尝试其他方法.有没有更好的方法将单词从形容词/动词形式转换为名词形式?

I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?

推荐答案

首先从wordnet同义词集中提取所有可能的候选者. 然后使用difflib将字符串与目标词干进行比较.

First extract all the possible candidates from wordnet synsets. Then use difflib to compare the strings against your target stem.

>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]

一种更容易理解的候选人计算方式:

A more human readable way to compute the candidates is as such:

candidates = set()
for ss in wn.all_synsets('n'):
  for ln in ss.lemma_names: # get all possible lemmas for this synset.
    for lemma in ln:
      if target in lemma:
        candidates.add(target)

这篇关于从词干中获取最接近的名词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆