使用Lucene的同义词 [英] Synonyms using Lucene

查看:205
本文介绍了使用Lucene的同义词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Lucene处理同义词(短语)的最佳方法是什么? 特别是当我需要执行:a OR b OR c NOT d

What is the best way to handle synonyms (phrases) using Lucene? Especially, when I need to execute queries like :a OR b OR c NOT d

如何在索引时向每个文档添加一个称为同义词"的新字段? 该字段的值将包含所有同义词的列表.仅当该文档具有任何同义词时,才会将其添加到文档中.

How about adding a new field called "synonyms" to each document while indexing? This field's value would have a list of all synonyms. It would be added to a document only when that document has any of the synonyms.

然后我将执行或"搜索查询,该查询将在此字段以及其他字段中查找搜索关键字.

I would then execute an "OR" search query which would look for search keyword in this field along with other fields.

这种方法对任何类型的查询都能奏效吗?

Can this approach work well for any kind of query?

仅供参考, 我的应用程序中的同义词完全是自定义的,不是来自英语词典. 全球金融领导者"也可以指顶级投资银行"或财富500强金融公司"等.

FYI, The synonyms in my application are totally custom and not from English dictionary...ie. "Global Leader in Finance" could also mean "Top Investment Bank" or "Fortune 500 Finance company" etc etc.

请提出建议.

谢谢.

推荐答案

对Lucene项目有一个贡献,即"wordnet".根据其文档:

There is a contribution to the Lucene project called "wordnet". According to its documentation:

此程序包使用WordNet定义的同义词来构建存储它们的Lucene索引,该索引又可以用于查询扩展.您通常运行一次Syns2Index来构建查询索引/数据库",然后调用SynExpand.expand(...)来扩展查询.

This package uses synonyms defined by WordNet to build a Lucene index storing them, which in turn can be used for query expansion. You normally run Syns2Index once to build the query index/"database", and then call SynExpand.expand(...) to expand a query.

它包括一个示例:

如果您输入查询"big dog",则它会打印出来:

If you pass in the query "big dog" then it prints out:

查询:big adult^0.9 bad^0.9 bighearted^0.9 boastful^0.9 boastfully^0.9 bounteous^0.9 bountiful^0.9 braggy^0.9 crowing^0.9 freehanded^0.9 giving^0.9 grown^0.9 grownup^0.9 handsome^0.9 large^0.9 liberal^0.9 magnanimous^0.9 momentous^0.9 openhanded^0.9 prominent^0.9 swelled^0.9 vainglorious^0.9 vauntingly^0.9 dog andiron^0.9 blackguard^0.9 bounder^0.9 cad^0.9 chase^0.9 click^0.9 detent^0.9 dogtooth^0.9 firedog^0.9 frank^0.9 frankfurter^0.9 frump^0.9 heel^0.9 hotdog^0.9 hound^0.9 pawl^0.9 tag^0.9 tail^0.9 track^0.9 trail^0.9 weenie^0.9 wiener^0.9 wienerwurst^0.9

Query: big adult^0.9 bad^0.9 bighearted^0.9 boastful^0.9 boastfully^0.9 bounteous^0.9 bountiful^0.9 braggy^0.9 crowing^0.9 freehanded^0.9 giving^0.9 grown^0.9 grownup^0.9 handsome^0.9 large^0.9 liberal^0.9 magnanimous^0.9 momentous^0.9 openhanded^0.9 prominent^0.9 swelled^0.9 vainglorious^0.9 vauntingly^0.9 dog andiron^0.9 blackguard^0.9 bounder^0.9 cad^0.9 chase^0.9 click^0.9 detent^0.9 dogtooth^0.9 firedog^0.9 frank^0.9 frankfurter^0.9 frump^0.9 heel^0.9 hotdog^0.9 hound^0.9 pawl^0.9 tag^0.9 tail^0.9 track^0.9 trail^0.9 weenie^0.9 wiener^0.9 wienerwurst^0.9

您看到原始单词("big"和"dog")没有附加权重.但是,同义词具有权重(0.9),您可以自行配置.

You see that the original words ("big" and "dog") have no weighting attached to them. The synonyms, however, have a weighting (0.9) that you can configure yourself.

它与Lucene的标准发行版捆绑在一起,位于"contrib"目录中.

It comes bundled with the standard distribution of Lucene, in the "contrib" directory.

这篇关于使用Lucene的同义词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆