在Lucene中将来自多个文档的匹配合并为单个匹配 [英] Combining hits from multiple documents into a single hit in Lucene

查看:56
本文介绍了在Lucene中将来自多个文档的匹配合并为单个匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试进行特定的搜索,事实证明这是有问题的.实际的源数据非常复杂,但是可以通过以下示例进行总结:

我的文章被编入索引,因此 以便可以对其进行搜索.每个 文章还具有多个属性 与之相关的 索引和可搜索.当用户 搜索,他们都可以在其中找到匹配 主要文章或相关文章 特性.不管在哪里打 实现,返回文章 作为搜索命中(即属性 永远不会受到打击).

现在要考虑复杂性:

每个属性都具有安全性, 这意味着对于任何给定的用户, 他们可能会或可能不会看到 财产.如果用户看不到 财产,他们显然没有得到 搜索命中.此安全检查 是专有的,不能做 使用典型的存储机制 在索引中与 文档中的其他字段.

我目前有一个索引,其中包含分别索引的文章和属性(即,文章被索引为文档,并且每个属性都有其自己的文档).当发生搜索时,文章A的匹配项或文章A的任何属性的匹配项应单独归类为文章A的匹配项.

最初为实现此目的,对Lucene v1.3进行了修改,以通过更改BooleanQuery使其具有自定义记分器来实现,该记分器可以应用安全检查的逻辑并将不同文档中两个命中的组合归类为命中在一个文档中.我正在尝试将此版本升级到最新版本(v2.3.2-我正在使用Lucene.Net),但理想情况下无需进行任何修改.

如果我执行AND搜索,则会出现另一个问题.如果文章包含单词 foo ,并且其属性之一包含单词 bar ,则搜索"foo AND bar"将返回该文章作为匹配.我当前的代码在自定义记分器中处理此问题.

任何想法如何/是否可以做到?

我正在考虑使用自定义的HitCollector并将其传递到搜索中,但是当执行布尔搜索"foo AND bar"时,执行永远都不会到达我的HitCollector,因为ConjunctionScorer会过滤掉所有结果在到达子查询之前从子查询中获取.


用户是否看到属性不是基于属性本身,而是基于属性的值.因此,由于我不知道要作为过滤依据的值,因此我无法将额外的安全条件预先放入查询中.

例如:

+---------+------------+------------+
| Article | Property 1 | Property 2 |
+---------+------------+------------+
|    A    |     X      |     J      |
|    B    |     Y      |     K      |
|    C    |     Z      |     L      |
+---------+------------+------------+

如果用户可以看到所有内容,则搜索"B和Y"将返回文章B的单个搜索结果.

如果另一个用户看不到某个属性,如果其值包含Y,则搜索"B和Y"将不会返回任何匹配.

我无法知道用户可以预先看到哪些价值.他们唯一可以告知的方法是执行安全检查(当前是在过滤文档中某个字段的匹配数据时进行的检查),我显然不能为每个用户对每个可能的数据值执行此检查.

解决方案

现在已经实现了此功能(经过大量的努力和逐步完成Lucene搜索之后),我想我应该回过头来介绍如何实现它. /p>

因为我对所有结果都感兴趣(即一次不一页),所以我可以避免使用Hits对象(无论如何在更高版本的Lucene中已弃用).这意味着我可以使用IndexSearcherSearch(Weight, Filter, HitCollector)方法进行自己的匹配集合,遍历所有可能的结果,并适当地组合文档匹配.为此,我必须加入Lucene的查询机制,但前提是必须存在AND和NOT子句.这可以通过以下方式实现:

  1. 创建自定义QueryParser并覆盖GetBooleanQuery(ArrayList, bool)以返回我自己的实现.
  2. 创建自定义BooleanQuery(从自定义QueryParser返回)并覆盖CreateWeight(Searcher)以返回我自己的实现.
  3. 创建自定义Weight(从自定义BooleanQuery返回)并覆盖Scorer(IndexReader)以返回我自己的实现.
  4. 创建自定义BooleanScorer2(从自定义Weight返回)并覆盖Score(HitCollector)方法.这就是处理自定义逻辑的方法.

这似乎是很多类,但是其中大多数是从Lucene类派生的,并且只是重写了一个方法.

自定义BooleanScorer2类中Score(HitCollector)方法的实现现在负责执行自定义逻辑.如果没有必需的子评分器,则可以将评分传递给基本Score方法并正常运行.如果存在必需的子计分器,则意味着查询中存在一个NOT或AND子句.在这种情况下,问题中提到的特殊组合逻辑将发挥作用.我有一个名为ConjunctionScorer的类可以执行此操作(这与Lucene中的ConjunctionScorer不相关).

ConjunctionScorer取得一个记分器列表并对其进行迭代.对于每一个,我都提取命中及其得分(使用Doc()Score()方法),并创建自己的搜索命中集合,其中仅包含当前用户在执行相关安全检查后可以看到的那些命中.如果另一个得分手已经找到了匹配项,则将它们合并在一起(使用他们的得分平均值作为新得分).如果某个匹配项来自禁止的得分手,那么我会删除该匹配项(如果已找到).

在所有这些操作的最后,我将命中设置在传递给BooleanScorer2.Score(HitCollector)方法的HitCollector上.这是一个自定义的HitCollector,我将其传递给IndexSearcher.Search(Query, HitCollector)方法来最初执行搜索.当此方法返回时,我的自定义HitCollector现在包含我想要的搜索结果.

希望这些信息对遇到相同问题的其他人很有用.这听起来很费力,但实际上是微不足道的.大多数工作都是通过将ConjunctionScorer中的命中组合在一起来完成的.请注意,这是针对Lucene v2.3.2的,在以后的版本中可能会有所不同.

I am trying to get a particular search to work and it is proving problematic. The actual source data is quite complex but can be summarised by the following example:

I have articles that are indexed so that they can be searched. Each article also has multiple properties associated with it which are also indexed and searchable. When users search, they can get hits in either the main article or the associated properties. Regardless of where a hit is achieved, the article is returned as a search hit (ie. the properties are never a hit in their own right).

Now for the complexity:

Each property has security on it, which means that for any given user, they may or may not be able to see the property. If a user cannot see a property, they obviously do not get a search hit in it. This security check is proprietary and cannot be done using the typical mechanism of storing a role in the index alongside the other fields in the document.

I currently have an index that contains the articles and properties indexed separately (ie. an article is indexed as a document, and each property has its own document). When a search happens, a hit in article A or a hit in any of the properties of article A should be classed as hit for article A alone, with the scores combined.

To achieve this originally, Lucene v1.3 was modified to allow this to happen by changing BooleanQuery to have a custom Scorer that could apply the logic of the security check and the combination of two hits in different documents being classed as a hit in a single document. I am trying to upgrade this version to the latest (v2.3.2 - I am using Lucene.Net), but ideally without having to modify Lucene in any way.

An additional problem occurs if I do an AND search. If an article contains the word foo and one of its properties contains the word bar, then searching for "foo AND bar" will return the article as a hit. My current code deals with this inside the custom Scorer.

Any ideas how/if this can be done?

I am thinking along the lines of using a custom HitCollector and passing that into the search, but when doing the boolean search "foo AND bar", execution never reaches my HitCollector as the ConjunctionScorer filters out all of the results from the sub-queries before getting there.


EDIT:

Whether or not a user can see a property is not based on the property itself, but on the value of the property. I cannot therefore put the extra security conditions into the query upfront as I don't know the value to filter by.

As an example:

+---------+------------+------------+
| Article | Property 1 | Property 2 |
+---------+------------+------------+
|    A    |     X      |     J      |
|    B    |     Y      |     K      |
|    C    |     Z      |     L      |
+---------+------------+------------+

If a user can see everything, then searching for "B and Y" will return a single search result for article B.

If another user cannot see a property if its value contains Y, then searching for "B and Y" will return no hits.

I have no way of knowing what values a user can and cannot see upfront. They only way to tell is to perform the security check (currently done at the time of filtering a hit from a field in the document), which I obviously cannot do for every possible data value for each user.

解决方案

Having now implemented this (after a lot of head-scratching and stepping through Lucene searches), I thought I'd post back on how I achieved it.

Because I am interested in all of the results (ie. not a page at a time), I can avoid using the Hits object (which has been deprecated in later versions of Lucene anyway). This means I can do my own hit collection using the Search(Weight, Filter, HitCollector) method of IndexSearcher, iterating over all possible results and combining document hits as appropriate. To do this, I had to hook into Lucene's querying mechanism, but only when AND and NOT clauses are present. This is achieved by:

  1. Creating a custom QueryParser and overriding GetBooleanQuery(ArrayList, bool) to return my own implementation.
  2. Creating a custom BooleanQuery (returned from the custom QueryParser) and overriding CreateWeight(Searcher) to return my own implementation.
  3. Creating a custom Weight (returned from the custom BooleanQuery) and overriding Scorer(IndexReader) to return my own implementation.
  4. Creating a custom BooleanScorer2 (returned from the custom Weight) and overriding the Score(HitCollector) method. This is what deals with the custom logic.

This might seem like a lot of classes, but most of them derive from a Lucene class and just override a single method.

The implementation of the Score(HitCollector) method in the custom BooleanScorer2 class now has the responsibility of doing the custom logic. If there are no required sub-scorers, the scoring can be passed to the base Score method and run as normal. If there are required sub-scorers, it means there was a NOT or an AND clause in the query. In this case, the special combination logic mentioned in the question comes into play. I have a class called ConjunctionScorer that does this (this is not related to the ConjunctionScorer in Lucene).

The ConjunctionScorer takes a list of scorers and iterates over them. For each one, I extract the hits and their scores (using the Doc() and Score() methods) and create my own search hits collection containing only those hits that the current user can see after performing the relevant security checks. If a hit has already been found by another scorer, I combine them together (using the mean of their scores for their new score). If a hit is from a prohibited scorer, I remove the hit if it was already found.

At the end of all of this, I set the hits onto the HitCollector passed into the BooleanScorer2.Score(HitCollector) method. This is a custom HitCollector that I passed into the IndexSearcher.Search(Query, HitCollector) method to originally perform the search. When this method returns, my custom HitCollector now contains my search results combined together as I wanted.

Hopefully this information will be useful to someone else faced with the same problem. It sounds like a lot of effort, but it is actually pretty trivial. Most of the work is done in combining the hits together in the ConjunctionScorer. Note that this is for Lucene v2.3.2, and may be different in later versions.

这篇关于在Lucene中将来自多个文档的匹配合并为单个匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆