Lucene/Solr如何在多字段/多面搜索中实现高性能? [英] How does Lucene/Solr achieve high performance in multi-field / faceted search?

查看:89
本文介绍了Lucene/Solr如何在多字段/多面搜索中实现高性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文

这主要是关于Lucene(或可能是Solr)内部结构的问题.主要主题是多面搜索,其中搜索可以沿多个独立的对象维度(例如大小,速度,汽车价格)进行.

This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).

当使用关系数据库实现时,对于大量构面而言,多字段索引没有用处,因为可以按任意顺序搜索构面,因此使用特定顺序的多重索引的机会很小,并创建了所有可能的顺序的索引是无法忍受的.

When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.

Solr的广告旨在很好地应对分面搜索任务,如果我认为正确的话,必须将它与Lucene关联(据说)才能在多字段查询(其中文档的字段与对象的方面相关)上表现良好.

Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).

问题

Lucene的倒排索引可以存储在关系数据库中,自然也可以通过使用单字段索引的RDBMS轻松获取匹配文档的交集.

The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.

因此,Lucene可能具有一些用于多字段查询的高级技术,而不仅仅是基于倒排索引获取匹配文档的交集.

Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.

问题是,这是什么技巧?更广泛地讲:为什么Lucene/Solr在理论上可以比RDBMS获得更好的多面搜索性能(如果可以)?

So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?

注意:我的第一个猜测是Lucene将使用某种空间划分方法来划分从文档字段构建的向量空间作为维,但是据我所知Lucene并非纯粹基于向量空间.

推荐答案

紧缩

关于刻面有两个答案,因为刻面有两种类型.我不确定这两个都比RDBMS快.

There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.

  1. 枚举构面.查询的结果是一个位向量,如果第i个文档匹配,则第i位为1.构面也是位向量,因此交集只是按位与.我认为这不是一种新颖的方法,大多数RDBMS可能都支持它.
  2. 字段缓存.这只是一个正常的(非反向)索引.在此处运行的SQL样式查询类似于:

  1. Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
  2. Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:

选择方面,来自field_cache的count(*) 其中docId在query_results中 按构面分组

select facet, count(*) from field_cache where docId in query_results group by facet

同样,我不认为这是普通RDBMS无法做到的.索引是一个跳过列表,以docId作为键.

Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.

长期搜索

这是Lucene大放异彩的地方.为什么Lucene的方法如此好,现在在这里发布太久了,但是我可以推荐这篇文章在Lucene Performance上,或其中链接的论文.

This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.

这篇关于Lucene/Solr如何在多字段/多面搜索中实现高性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆