自动同义词检测的方法 [英] Methods for automated synonym detection

查看:159
本文介绍了自动同义词检测的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究基于神经网络的短文档分类方法,由于我使用的语料库通常为10个单词左右,因此标准的统计文档分类方法使用有限.由于这个事实,我正在尝试对培训中提供的匹配项实施某种形式的自动同义词检测.我的问题更具体地是关于解决如下情况:

I am currently working on a neural network based approach to short document classification, and since the corpuses I am working with are usually around ten words, the standard statistical document classification methods are of limited use. Due to this fact I am attempting to implement some form of automated synonym detection for the matches provided in the training. My question more specifically is about resolving a situation as follows:

说我有涉及食物"的分类和涉及领域"之一,数据集如下:

Say I have classifications of "Involving Food", and one of "Involving Spheres" and a data set as follows:

"Eating Apples"(Food);"Eating Marbles"(Spheres); "Eating Oranges"(Food, Spheres);
"Throwing Baseballs(Spheres)";"Throwing Apples(Food)";"Throwing Balls(Spheres)";
"Spinning Apples"(Food);"Spinning Baseballs";

我正在寻找一种逐步实现以下联系的增量方法:

I am looking for an incremental method that would move towards the following linkages:

Eating --> Food
Apples --> Food
Marbles --> Spheres
Oranges --> Food, Spheres
Throwing --> Spheres
Baseballs --> Spheres
Balls --> Spheres
Spinning --> Neutral
Involving --> Neutral

我确实意识到,在这种特定情况下,这些可能有点可疑,但它说明了我遇到的问题.我的一般想法是,如果我增加了一个单词,使其出现在类别中的单词对面,但是在那种情况下,我最终会偶然地将所有内容都链接到"Involving"这个单词,那么我想我会简单地减少一个单词出现在"Involving"中与多个同义词或非同义词结合使用,但随后我将失去饮食"和食物"之间的联系.是否有人对我如何组合可以使我朝上述方向移动的算法有任何线索?

I do realize that in this specific case these might be slightly suspect matches, but it illustrates the problems I am having. My general thoughts were that if I incremented a word for appearing opposite the words in a category, but in that case I would end up incidentally linking everything to the word "Involving", I then thought that I would simply decrement a word for appearing in conjunction with multiple synonyms, or with non-synonyms, but then I would lose the link between "Eating" and "Food". Does anyone have any clue as to how I would put together an algorithm that would move me in the directions indicated above?

推荐答案

有一种非上层的引导方法,我被解释为这样做.

There is an unsupervized boot-strapping approach that was explained to me to do this.

有多种方法可以应用此方法和变体,但这是一个简化的版本.

There are different ways of applying this approach, and variants, but here's a simplified version.

首先假设两个单词是同义词,那么在您的语料库中它们将以类似的设置出现. (吃葡萄,吃三明治等)

Start by a assuming that if two words are synonyms, then in your corpus they will appear in similar settings. (eating grapes, eating sandwich, etc.)

(在此变体中,我将使用同时出现作为设置).

(In this variant I will use co-occurence as the setting).

我们有两个列表

  • 一个列表中将包含与食品共存的单词
  • 一个列表中将包含食品单词

首先为列表中的一个添加种子,例如,我可能在食品列表中写上Apple一词.

Start by seeding one of the lists, for instance I might write the word Apple on the food items list.

现在让计算机接管.

它将首先找到语料库中所有出现在Apple之前的单词,然后按出现次数最多的顺序对其进行排序.

It will first find all words in the corpus that appear just before Apple, and sort them in order of most occuring.

采用前两个(或您想要的两个)并将它们添加到同时发生的食品"列表中.例如,也许吃"和美味"是头两个.

Take the top two (or however many you want) and add them into the co-occur with food items list. For example, perhaps "eating" and "Delicious" are the top two.

现在使用该列表通过对列表中每个单词右侧出现的单词进行排名来查找接下来的两个美食单词.

Now use that list to find the next two top food words by ranking the words that appear to the right of each word in the list.

继续此过程,扩展每个列表,直到您对结果满意为止.

Continue this process expanding each list until you are happy with the results.

(您可能需要手动从列表中删除某些明显不对的东西.)

(you may need to manually remove some things from the lists as you go which are clearly wrong.)

如果您考虑关键字的语法设置,则可以使此过程非常有效.

This procedure can be made quite effective if you take into account the grammatical setting of the keywords.

Subj ate NounPhrase
NounPhrase are/is Moldy

The workers harvested the Apples. 
   subj       verb     Apples 

That might imply harvested is an important verb for distinguishing foods.

Then look for other occurrences of subj harvested nounPhrase

您可以扩展此过程以将单词移动到类别中,而不是在每个步骤中都将其移动到单个类别中.

You can expand this process to move words into categories, instead of a single category at each step.

几年前,在犹他大学开发的系统中使用了这种方法,该系统通过仅查看新闻文章就成功地编制了一系列体面的武器词,受害者词和地名词.

This approach was used in a system developed at the University of Utah a few years back which was successful at compiling a decent list of weapon words, victim words, and place words by just looking at news articles.

一种有趣的方法,并取得了良好的结果.

An interesting approach, and had good results.

不是神经网络方法,而是一种有趣的方法.

Not a neural network approach, but an intriguing methodology.

犹他大学的系统称为AutoSlog-TS,可以看到有关它的简短幻灯片此处

the system at the University of Utah was called AutoSlog-TS, and a short slide about it can be seen here towards the end of the presentation. And a link to a paper about it here

这篇关于自动同义词检测的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆