为什么 Solr 比 Postgres 快这么多? [英] Why is Solr so much faster than Postgres?

查看:27
本文介绍了为什么 Solr 比 Postgres 快这么多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近从 Postgres 切换到 Solr,发现查询速度提高了约 50 倍.我们运行的查询涉及多个范围,我们的数据是车辆列表.例如:查找里程 <50,000,$5,000 <价格 <10,000 美元,make=Mazda..."

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."

我在 Postgres 的所有相关列上创建了索引,所以这应该是一个相当公平的比较.查看 Postgres 中的查询计划,尽管它仍然只使用单个索引然后进行扫描(我认为是因为它无法使用所有不同的索引).

I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).

据我所知,Postgres 和 Solr 使用模糊相似的数据结构(B 树),并且它们都将数据缓存在内存中.所以我想知道这么大的性能差异是从哪里来的.

As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.

架构上的哪些差异可以解释这一点?

What differences in architecture would explain this?

推荐答案

首先,Solr 不使用 B 树.Lucene(Solr 使用的底层库)索引由只读 segments.对于每个段,Lucene 维护一个术语字典,其中包含出现在段中的术语列表,按字典顺序排列.在这个术语字典中查找术语是使用二分搜索进行的,因此单术语查找的成本是 O(log(t)) 其中 t 是术语的数量.相反,使用标准 RDBMS 的索引成本 O(log(d)),其中 d 是文档数.当许多文档对某个字段共享相同的值时,这可能是一个巨大的胜利.

First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.

此外,Lucene 提交者 Uwe Schindler 添加了对高性能 数字范围查询 几年前.对于 数字字段的每个值,Lucene 以不同的精度存储多个值.这允许 Lucene 非常有效地运行范围查询.由于您的用例似乎大量利用数字范围查询,这可以解释为什么 Solr 速度如此之快.(有关更多信息,请阅读非常有趣的 javadoc,并提供相关研究论文的链接.)

Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)

但是 Solr 只能这样做,因为它没有 RDBMS 具有的所有约束.例如,Solr 在一次更新单个文档方面非常糟糕(它更喜欢批量更新).

But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).

这篇关于为什么 Solr 比 Postgres 快这么多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆