NLTK每个单词的最常用同义词(Wordnet) [英] NLTK Most common synonym (Wordnet) for each word

查看:644
本文介绍了NLTK每个单词的最常用同义词(Wordnet)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有找到NLTK单词最常见的同义词的方法?我想使用每个单词的最常见同义词来简化句子.

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.

如果句子中使用的单词已经是同义词组中最常见的单词,则不应更改.

If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.

让我们说嗨"比你好"更常见; 亲爱的"比有价值的"更常见; 朋友"已经是其组os同义词中最常见的词.

Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.

Input: "Hello my valued friend"
Return: "Hi my dear friend"

推荐答案

同义词很棘手,但是如果您是从Wordnet的同义词集开始的,而您只想选择集合中最常见的成员,那就非常简单了:只需从语料库构建您自己的频率列表,然后查找同义词集的每个成员以选择最大值即可.

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.

使用nltk,您只需几行代码即可构建频率表.这是一个基于布朗语料库的

The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:

from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())

然后您可以查询这样的单词的频率:

You can then look up the frequency of a word like this:

>>> print(freqs["valued"]) 
14

当然,您需要做更多的工作:我将对语音的每个主要部分分别计算单词(wordnet提供nvar,分别是.nounverbadjectiveadverb),然后使用这些POS特定的频率(在调整了不同的标签集表示法之后)选择正确的替代词.

Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.

>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in 
        brown.tagged_words(tagset="universal"))

>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45

这篇关于NLTK每个单词的最常用同义词(Wordnet)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆