查找与特定单词相关的单词(特别是物理对象) [英] Finding related words (specifically physical objects) to a specific word

查看:19
本文介绍了查找与特定单词相关的单词(特别是物理对象)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找与单个单词相关的单词(特别是物理对象).例如:

I am trying to find words (specifically physical objects) related to a single word. For example:

网球:网球拍、网球、网球鞋

Tennis: tennis racket, tennis ball, tennis shoe

斯诺克:斯诺克球杆、斯诺克球、粉笔

Snooker: snooker cue, snooker ball, chalk

国际象棋:棋盘、棋子

书柜:书

我试过用WordNet,特别是meronym语义关系;然而,这种方法并不一致,如下图所示:

I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:

网球:发球、截击、失误、定位点、回球、优势

Tennis: serve, volley, foot-fault, set point, return, advantage

斯诺克:没什么

Chess:国际象棋走法、棋盘格(自己的词义关系显示‘方’&‘对角线’)

Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')

书柜:书架

最终需要对术语进行加权,但现在这不是真正的问题.

Weighting of terms will eventually be required, but that is not really a concern now.

有人对如何做到这一点有任何建议吗?

Anyone have any suggestions on how to do this?

只是更新:最终混合使用了 Jeff 和 StompChicken 的答案.

Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.

从维基百科检索的信息质量非常好,特别是如何(不出所料)有如此多的相关信息(与一些不存在诸如博客"和ipod"等术语的语料库相比).

The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).

维基百科的结果范围是最好的部分.该软件能够匹配诸如(为简洁起见而删除的列表)之类的术语:

The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):

  • 高尔夫:[球、铁、发球台、球袋、球杆]
  • 摄影:[相机、胶卷、照片、艺术、图像]
  • 钓鱼:[鱼、网、钩、陷阱、诱饵、诱饵、鱼竿]

最大的问题是将某些词归类为实物;默认 WordNet 不是可靠的资源,因为其中不存在许多术语(例如ipod",甚至蹦床").

The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.

推荐答案

我认为您要求的是概念之间语义关系的来源.为此,我可以想到很多方法:

I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:

  1. 语义相似度算法.这些算法通常对 Wordnet 中的关系执行树遍历,以得出两个术语相关程度的实值分数.这些将受到 WordNet 对您感兴趣的概念的建模效果的限制.WordNet::相似性(用 Perl 编写)非常好.
  2. 尝试使用OpenCyc作为知识库.OpenCyc 是 Cyc 的开源版本,Cyc 是一个非常大的真实世界"事实知识库.它应该拥有比 WordNet 更丰富的语义关系集.但是,我从未使用过 OpenCyc,所以我不能说它有多完整,或者它有多容易使用.
  3. n-gram 频率分析.正如杰夫·莫泽 (Jeff Moser) 所提到的.一种数据驱动的方法,可以从大量数据中发现"关系,但通常会产生嘈杂的结果.
  4. 潜在语义分析.一种类似于 n-gram 频率分析的数据驱动方法,可查找语义相关的词集.
  1. Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
  2. Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
  3. n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
  4. Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.

[...]

从你说你想做的事情来看,我认为最后两个选项更有可能成功.如果关系不在 Wordnet 中,那么语义相似性将不起作用,OpenCyc 似乎不太了解 snoker 除了它存在的事实.

Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.

我认为将 n-gram 和 LSA(或类似的东西)结合起来会是个好主意.N-gram 频率会发现与您的目标概念(例如网球)紧密相关的概念,而 LSA 会发现在同一句子/文档(例如网、发球)中提到的相关概念.此外,如果您只对名词感兴趣,请过滤您的输出以仅包含名词或名词短语(通过使用 词性标注器) 可能会改善结果.

I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.

这篇关于查找与特定单词相关的单词(特别是物理对象)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆