在MongoDB中快速搜索数十亿个小文档的策略 [英] Strategies for fast searches of billions of small documents in MongoDB

查看:42
本文介绍了在MongoDB中快速搜索数十亿个小文档的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要存储数十亿个小型数据结构(每个约200个字节).到目前为止,将每个元素存储为单独的文档效果很好,Mongo每秒可提供约10,000个结果.我正在使用20字节哈希作为每个文档的_id,并在_id字段上使用一个索引.在测试中,这适用于包含5,000,000个文档的数据集.

在操作中,我们将每秒发出约10,000个请求,每秒更新现有文档约1,000次,并且每秒可能插入100次或更少的新文档.

当我们无法在RAM中存储整个索引时,如何管理更大的数据集?如果我们将多个元素合并到每个文档中,MongoDB的性能会更好吗-可以更快地搜索索引,但是每个查询返回的数据更多?

与SO上的其他问题不同,我不仅对我们可以填充到Mongo中的数据感兴趣.它可以清楚地管理我们正在查看的数据量.我担心的是,在RAM受限的情况下,如何才能最大化大型集合上find操作的速度.

我们的搜索将趋向于集群化;大约50,000个元素将满足大约50%的查询,但其余50%的元素将随机分布在所有数据中.我们可以通过将这50%的数据移入其自己的集合中来期望性能提高,以便始终将较小的最常用数据索引保持在ram中吗?

将_id字段的大小从20字节减小到8字节会对MnogoDB的索引速度产生重大影响吗?

解决方案

想到了一些策略:

1)为热门"文档使用不同的集合/数据库.

如果您知道哪些文件位于热集中,那么可以,将它们移动到单独的集合中会有所帮助.这将确保热文档在相同的范围/页面上共存.这还将使这些文档的索引更有可能完全存储在内存中.这是因为它更小并且被(完全?)使用的频率更高.

如果热门文档与其他文档随机混合,那么在加载文档时,您可能会不得不错失B树索引的更多叶元素,因为另一个文档最近加载或访问了索引块的可能性小.

2)缩短索引的.

索引值越短,适合单个B-Tree块的值就越多. (注意:键未包含在索引中.)单个存储桶中的条目越多,意味着存储桶越少,索引所需的总内存就越少.这意味着块将保留在内存中的可能性更高/寿命更长.在您的示例中,减少20-> 8个字符比节省50%更好.如果您可以将这8个字节转换为长整数,则可以节省更多,因为长整数没有长度前缀(4个字节)和结尾的空值(总共5个字节).

3)缩短键名.

字段名称越短,每个文档占用的空间越少. 不幸的是,这降低了可读性.

4)碎片

这实际上是面对整个语料库进行读取(从而耗尽内存和最终磁盘带宽)时提高性能的唯一方法.如果您要分片,您仍将要分片热门"收藏集.

5)将磁盘上的预读调整为较小的值.

由于非热"读取是从磁盘加载随机文档,因此我们实际上只希望将该文档以及周围的尽可能少的文档读/故障读取到内存中.一旦用户从文件的一部分读取,大多数系统将尝试预读大量数据.这恰恰与我们想要的相反.

如果您看到系统故障很多,但mongod进程的常驻内存未接近系统可用内存,则您可能会看到操作系统读取无用数据的影响.

6)尝试对键使用单调递增的值.

这将触发优化(针对基于ObjectId的索引),该优化将在索引块拆分时以90/10(而不是50/50)进行.结果是索引中的大多数块将接近容量,而您将需要更少的块.

如果您仅在事后知道50,000个热门"文档,则将它们按索引顺序添加到单独的集合中也会触发此优化.

Rob.

I need to store several billion small data structures (around 200 bytes each). So far, storing each element as a separate document is working well, with Mongo providing around 10,000 results per second. I'm using a 20-byte hash as the _id for each document, and a single index on the _id field. In testing, this is working for data sets with 5,000,000 documents.

In operation, we will be making around 10,000 requests per second, updating existing documents about 1,000 times per second, and inserting new documents maybe 100 times per second or less.

How can we manage larger data sets, when we cannot store an entire index in RAM? Will MongoDB perform better if we combine several elements into each document -- for a faster search through the index, but more data being returned in each query?

Unlike other questions on SO, I'm not only interested in how much data we can stuff into Mongo. It can clearly manage the amount of data we're looking at. My concern is how can we maximize the speed of find operations on huge collections, given constrained RAM.

Our searches will tend to be clustered; around 50,000 elements will be satisfy about 50% of the queries, but the remaining 50% will be randomly distributed across all of the data. Can we expect a performance gain by moving those 50% into their own collection, in order to keep a smaller index of the most-used data always in ram?

Would reducing the size of the _id field from 20-bytes to 8-bytes have a significant impact on MnogoDB's indexing speed?

解决方案

A few strategies come to mind:

1) Use a distinct collection/database for the 'hot' documents.

If you know which documents are in the hot set then, yes, moving them into a separate collection will help. This will ensure that the hot documents are co-resident on the same extents/pages. It will also make the index for those documents more likely to be completely in memory. This is due to it being smaller and being (completely?) used more often.

If the hot documents are randomly mixed with other documents then you will likely have to fault in more of the leaf elements of the B-Tree index when loading a document as the probability of another document having recently loaded or accessed the index block is small.

2) Shorten the indexed values.

The shorter the index value the more values that fit into a single B-Tree block. (Note: The keys are not included in the index.) The more entries in a single bucket means fewer buckets and less total memory needed for the index. That translates to the higher probability / longer lifetimes that blocks will stay in memory. In your example a 20->8 character reduction is a better than 50% savings. If you can convert those 8 bytes to a long there is a little more savings since longs do not have a length prefix (4 bytes) and a trailing null (5 bytes total).

3) Shorten the key names.

The shorter the field names the less space each document takes. This has the unfortunate side effect of decreasing readability.

4) Shard

This is really the only way to keep performance up in the face of reads across an entire corpus that exhausts memory and eventual disk bandwidth. If you do shard you will still want to shard the 'hot' collection.

5) Adjust the read-ahead on disk to a small value.

Since the 'non-hot' reads are loading a random document from disk we really only want to read/fault into memory that document and as few of the documents around it as possible. Most systems will try and read-ahead a large block of data once a user reads from a portion of a file. This is exactly the opposite of what we want.

If you see your system faulting a lot but the resident memory for the mongod process does not approach the systems available memory you are likely seeing the effect of the OS reading useless data.

6) Try to use monotonically increasing values for the keys.

This will trigger an optimization (for ObjectId based indexes) that when the index block splits it will do so at 90/10 instead of 50/50. The result is that most of the blocks in your index will be near capacity and you will need fewer of them.

If you only know the 'hot' 50,000 documents after the fact then adding them to the separate collection in index order will also trigger this optimization.

Rob.

这篇关于在MongoDB中快速搜索数十亿个小文档的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆