使用 Lucene 的同义词 [英] Synonyms using Lucene

查看:24
本文介绍了使用 Lucene 的同义词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Lucene 处理同义词(短语)的最佳方法是什么?特别是,当我需要执行查询时:a OR b OR c NOT d

What is the best way to handle synonyms (phrases) using Lucene? Especially, when I need to execute queries like :a OR b OR c NOT d

在索引时向每个文档添加一个名为同义词"的新字段怎么样?该字段的值将包含所有同义词的列表.只有当该文档具有任何同义词时,它才会被添加到该文档中.

How about adding a new field called "synonyms" to each document while indexing? This field's value would have a list of all synonyms. It would be added to a document only when that document has any of the synonyms.

然后我将执行一个OR"搜索查询,该查询将在此字段中与其他字段一起查找搜索关键字.

I would then execute an "OR" search query which would look for search keyword in this field along with other fields.

这种方法适用于任何类型的查询吗?

Can this approach work well for any kind of query?

仅供参考,我的应用程序中的同义词是完全自定义的,而不是来自英语词典......即.全球金融领袖"也可以指顶级投资银行"或财富500强金融公司"等.

FYI, The synonyms in my application are totally custom and not from English dictionary...ie. "Global Leader in Finance" could also mean "Top Investment Bank" or "Fortune 500 Finance company" etc etc.

请提出建议.

谢谢.

推荐答案

有一个对 Lucene 项目的贡献,叫做wordnet".根据其文档:

There is a contribution to the Lucene project called "wordnet". According to its documentation:

此包使用 WordNet 定义的同义词来构建存储它们的 Lucene 索引,该索引又可用于查询扩展.您通常运行一次 Syns2Index 来构建查询索引/数据库",然后调用 SynExpand.expand(...) 来展开查询.

This package uses synonyms defined by WordNet to build a Lucene index storing them, which in turn can be used for query expansion. You normally run Syns2Index once to build the query index/"database", and then call SynExpand.expand(...) to expand a query.

它包括一个示例:

如果你传入查询big dog",那么它会打印出来:

If you pass in the query "big dog" then it prints out:

查询:大成人^0.9坏^0.9心大^0.9自夸^0.9自夸^0.9慷慨^0.9慷慨^0.9吹牛^0.9啼哭^0.9徒手^0.9给予^0.9长大^0.9大人^0.9帅^0.9 大^0.9 自由^0.9 大度^0.9 重要^0.9 空手^0.9 突出^0.9 膨胀^0.9 虚荣^0.9 自负^0.9dog andiron^0.9 blackguard^0.9 bounder^0.9 cad^0.9 追逐^0.9 click^0.9 detent^0.9 dogtooth^0.9 firedog^0.9 frank^0.9 frankfurter^0.9 frump^0.9 heel^0.9 hotdog^0.9 hound^0.9 pawl^0.9 tag^0.9 tail^0.9 track^0.9 trail^0.9 weenie^0.9 wiener^0.9 wienerwurst^0.9

您会看到原始单词(big"和dog")没有附加权重.但是,同义词具有权重 (0.9),您可以自行配置.

You see that the original words ("big" and "dog") have no weighting attached to them. The synonyms, however, have a weighting (0.9) that you can configure yourself.

它与 Lucene 的标准发行版捆绑在一起,位于contrib"目录中.

It comes bundled with the standard distribution of Lucene, in the "contrib" directory.

这篇关于使用 Lucene 的同义词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆