Lucens最好的“开始"方式询问 [英] Lucens best way to do "starts-with" queries

查看:93
本文介绍了Lucens最好的“开始"方式询问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够进行以下类型的查询:

I want to be able to do the following types of queries:

要索引的数据由(例如)音乐视频组成,其中只有标题很有趣. 我只是想对它们建立索引,然后为它们创建查询,这样,无论用户在查询中使用的是哪个单词,包含这些单词的文档(在图块的开头按该顺序)将首先返回,然后返回(在标题中任何位置包含至少一个搜索到的单词的文档).同样,所有这些都应该不区分大小写.

The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.

示例:

对于文档:

  • Video1Title =大海是蓝色的
  • Video2Title =野海
  • Video3Title =狂野的大海
  • Video4Title =沿海地区

如果我搜索海",我想得到

If I search "sea" I want to get

  • "Video1Title =大海是蓝色的"

首先是所有其他标题中包含"sea"的文档,但开头不是.

first followed by all the other documents that contain "sea" in title, but not at the beginning.

如果我想搜索狂野海域"

If I search "Wild sea" I want to get

  • Video2Title =野海
  • Video3Title =狂野的大海

首先是标题为"Wild"或"Sea"但标题前缀为"Wild Sea"的所有其他文档.

first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.

如果我搜索"Seasi",我什么也不想得到(我不在乎关键字标记和前缀查询).

If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).

现在AFAIKS,没有实际的方法告诉Lucene找到文件word1和word2等位于位置1、2和3等位置的文档"

Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."

有一些变通办法"可以模拟这种行为:

There are "workarounds" to simulate that behaviour:

  • 对该字段编制两次索引.在field1中,您有单词标记化的单词(可能使用StandardAnalyzer),在field2中,您将它们全部聚集成一个元素(使用KeywordAnalyzer).然后,如果您搜索类似的内容:

  • Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :

+(field1:word1 word2 word3)(field2:"word1 word2 word3 *")

+(field1:word1 word2 word3) (field2:"word1 word2 word3*")

有效地告诉Lucene文档的标题中必须包含word1,word2或word3,而且与"title开头> word1 word2 word3<"相匹配的文档更好(获得更高的分数).

effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).

  • 为索引建立索引时,请在字段的开头添加"lucene_start_token" Video2Title = Wild sea的索引索引为"title:lucene_start_token Wild sea",其余的索引依此类推
  • Add a "lucene_start_token" to the beginning of the field when indexing them such that Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest

然后执行以下查询:

+(title:sea)(标题:"lucene_start_token sea")

+(title:sea) (title:"lucene_start_token sea")

让Lucene返回标题中包含我的搜索词的所有文档,并且在匹配"lucene_start_token +搜索词"的文档中给出更高的分数

and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"

然后我的问题是,是否确实有更好的方法(也许使用

My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?

推荐答案

您可以使用 Lucene有效负载.您可以为字段值的每个术语提供自定义增强.

You can use Lucene Payloads for that. You can give custom boost for every term of the field value.

因此,当您为标题编制索引时,可以使用3的提升因子开始(例如):

So, when you index your titles you can start using a boost factor of 3 (for example):

标题:野生 | 3.0 生物 | 2.5 蓝色 | 2.0 海洋 | 1.5

title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5

标题:海洋 | 3.0 生物 | 2.5

title: sea|3.0 creatures|2.5

通过这种方式建立索引,您可以将最接近的词条增加到标题的开头.

Indexing this way you are boosting nearest terms to the start of title.

使用此方法的主要问题是您必须自己标记并手动添加所有这些增强信息,因为分析仪需要以这种方式构造文本(term1 | 1.1 term2 | 3.0 term3).

The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).

这篇关于Lucens最好的“开始"方式询问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆