查找与特定单词相关的单词(特定为物理对象) [英] Finding related words (specifically physical objects) to a specific word

查看:86
本文介绍了查找与特定单词相关的单词(特定为物理对象)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找与单个单词相关的单词(特别是物理对象).例如:

I am trying to find words (specifically physical objects) related to a single word. For example:

网球:网球拍,网球,网球鞋

Tennis: tennis racket, tennis ball, tennis shoe

桌球:桌球杆,桌球,粉笔

Snooker: snooker cue, snooker ball, chalk

国际象棋:棋盘,棋子

书架:书

我试图使用WordNet,特别是使用同义词的语义关系;但是,此方法不一致,如下结果所示:

I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:

网球:发球,凌空,犯规,设定点,回传,优势

Tennis: serve, volley, foot-fault, set point, return, advantage

斯诺克:没事

国际象棋:国际象棋移动,棋盘格(其自身的别名关系显示为正方形"和对角线")

Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')

书架:搁置

最终将需要对术语进行加权,但这并不是现在真正要考虑的问题.

Weighting of terms will eventually be required, but that is not really a concern now.

有人对此有任何建议吗?

Anyone have any suggestions on how to do this?

只是一个更新:最终使用Jeff和StompChicken的答案混合在一起.

Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.

从Wikipedia检索到的信息质量非常好,特别是(毫无疑问)有这么多相关信息(与不存在诸如博客"和"ipod"之类的某些语料库相比).

The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).

来自Wikipedia的结果范围是最好的部分.该软件可以匹配以下术语,例如(为简洁起见,列表简短):

The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):

  • 高尔夫球:[球,铁,发球,球袋,球杆]
  • 摄影:[相机,电影,照片,艺术,图像]
  • 捕鱼:[鱼,网,钩,陷阱,诱饵,鱼饵,鱼竿]

最大的问题是将某些单词归类为人工制品;默认的WordNet不是可靠的资源,因为其中不存在许多术语(例如"ipod",甚至"trampolining").

The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.

推荐答案

我认为您所要求的是概念之间语义关系的来源.为此,我可以想到许多方法:

I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:

  1. 语义相似性算法 .这些算法通常对Wordnet中的关系执行树遍历,以得出两个术语之间的相关性的实值得分.这些将受到WordNet对您感兴趣的概念建模的良好程度的限制.相似性(用Perl编写)非常好.
  2. 尝试使用 OpenCyc 作为知识库. OpenCyc是Cyc的开源版本,Cyc是真实"事实的非常大的知识库.它应该具有比WordNet更丰富的语义属性集.但是,我从未使用过OpenCyc,因此无法说说它的完整性或易用性.
  3. n-gram频率分析.如Jeff Moser所述.一种数据驱动的方法,可以从大量数据中发现"关系,但通常会产生嘈杂的结果.
  4. 潜在语义分析 .一种类似于n-gram频率分析的数据驱动方法,可以找到语义相关的单词集.
  1. Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
  2. Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
  3. n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
  4. Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.

[...]

从您想说的话来看,我认为后两种选择更有可能取得成功.如果这些关系不在Wordnet中,则语义相似性将不起作用,并且OpenCyc似乎对,除了它存在的事实.

Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.

我认为将n-gram和LSA(或类似的东西)结合使用是一个好主意. N-gram频率会发现与您的目标概念紧密相关的概念(例如网球),而LSA会发现同一句子/文档(例如网,发球)中提到的相关概念.另外,如果您仅对名词感兴趣,请过滤输出以仅包含名词或名词短语(使用

I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.

这篇关于查找与特定单词相关的单词(特定为物理对象)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆