带有docIds的Lucene过滤器 [英] Lucene filter with docIds

查看:152
本文介绍了带有docIds的Lucene过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图做到以下几点:我想通过分别查询每个字段来创建一个候选集合,然后将最前面的k个匹配添加到这个集合中。完成之后,我需要在这个候选集上运行另一个查询。
我现在如何实现它的方式是使用QueryWrapperFilter和一个BooleanQuery匹配每个候选文档的唯一id字段。但是,这意味着我必须先为每个候选文档调用IndexSearcher.doc()。get(docId),然后才能将其添加到我的BooleanQuery中,这是主要的瓶颈。我只是通过MapFieldSelector(docId)加载docId字段。我想创建自己的Filter类,但是我不能使用内部的Lucene doc ID直接,因为它们是每段指定的。任何想法如何解决这个问题?解决方案

该字段(它可能已经是),并使用 FieldCache 以更快的速度检索docId,而不是在布尔查询中使用docIds,可以使用 TermsFilter FieldCacheTermsFilter 。后面的文档描述了性能的权衡。


I'm trying to do the following: I want to create a set of candidates by querying each field separately and then adding the top k matches to this set. After I'm done with that, I need to run another query on this candidate set. The way how I implemented it right now is using a QueryWrapperFilter with a BooleanQuery that matches the unique id field of each candidate document. However, this means I have to call IndexSearcher.doc().get("docId") for each candidate document before I can add it to my BooleanQuery, which is the major bottleneck. I'm only loading the docId field via MapFieldSelector("docId).

I wanted to create my own Filter class, but I can't use the internal Lucene doc ids directly, because they are specified per segment. Any thoughts on how to approach this?

解决方案

Instead of reading the stored docId, index the field (it probably already is) and use the FieldCache to retrieve docIds much faster. Then instead of using the docIds in a BooleanQuery, try using a TermsFilter or FieldCacheTermsFilter. The latter documentation describes the performance trade-offs.

这篇关于带有docIds的Lucene过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆