混合搜索和索引:Solr中的单词和令牌元数据 [英] Hybrid search and indexing: words and token metadata in Solr

查看:68
本文介绍了混合搜索和索引:Solr中的单词和令牌元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为Solr构建一组插件,以启用混合"搜索,该搜索将匹配单词或 token (不是文档!)元数据(特定ID号).相同的单词在不同的上下文中可能具有不同的ID号,这些ID号是在外部应用程序编制索引时生成的.例如,运行"在一种情况下可能具有12345,在另一种情况下可能具有54321(取决于上下文). ID号在搜索中应具有更大的权重. (它们将在搜索时由同一外部应用程序在查询中提供.)

I am building a set of plugins for Solr to enable a "hybrid" search which would match either words or token (not document!) metadata (specific ID numbers). Same words may have different ID numbers in different context, generated in indexing time by an external application. Such as, "run" may have 12345 in one case and 54321 in another (depends on the context). The ID numbers should have more weight in the search. (They will be provided in the query in search time by the same external application.)

我了解了文档的自定义字段,我想知道是否可以在其中存储带有这些ID的Blob,但是我不确定如何将其包含在搜索中.

I read about custom fields for documents and I was wondering if we could store a blob there with these IDs, but I am not sure how to include it in the search.

还是我应该假装这些ID是同义词"(也许将它们用某种独特的标记括起来,例如 [:12345:] )并使用同义词工厂标记器?

Or should I just pretend these IDs are "synonyms" (maybe surrounding them in some kind of unique marking, like [:12345:]) and use the synonym factory tokenizers?

我是Solr的新手,但是我已经阅读了相关文档,所以我认为我从概念上理解了这一切. 此阶段的性能并不重要,这是PoC.看起来有点类似于:在Solr中的不同字段上搜索不同的标记但不完全是.哦,我也想自己标记文本,但这不是问题.

I am new to Solr but I have read the relevant documentation so I think I understand how it all works conceptually. Performance does not matter at this stage, this is a PoC. Looks like somewhat similar to: Search different tokens on different fields in Solr but not exactly. Oh, and I want to tokenise the text myself, too, but that's not an issue.

[删除了有关有效负载的部分,此处无关紧要.抱歉造成混乱]

[removed the bit about payloads, it is irrelevant here. Sorry about the confusion]

推荐答案

除非我误解了,因为您已经生成了魔术令牌,所以唯一的要求是查看字段中是否存在魔术令牌值,如果是,则将该字段得分更高.

Unless I've misunderstood, as you've already generated the magic tokens, the only requirement is to see if the magic token value is present in a field, and if it is, score the field higher.

将魔术标记值索引到一个字段,将文本值索引到另一个字段.使用增强将魔术标记字段中的匹配优先于文本值字段中的匹配.魔术标记字段可能是基于您描述中的tint的整数字段.

Index the magic token values to one field, and the textual values to another. Use boosting to prioritise matches in the magic token field over a match in the textual values field. The magic token field can probably be an integer field based on tint from your description.

搜索时,您可以将搜索字符串生成为:

When searching, you can generate the search string as:

q=(token:12345^5 OR text:run) AND (token:32145^5 OR text:fast)

与文本字段中的匹配项相比,令牌中的匹配项得分应高出五倍.如果您不在乎是否在文本字段中也匹配12345,则可以使用:

This should give a match in the token a five times better score than a match in the text field. If you don't care if you match 12345 in the text field as well, you can use:

q=12345 run 32145 fast&qf=text token^5

您可能需要调整mm,以提供所需的匹配数,具体取决于您的应用程序需要.

You might have to tweak mm to give the required number of hits, depending on what your application needs.

这篇关于混合搜索和索引:Solr中的单词和令牌元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆