更新集合中的大量记录 [英] Updating large number of records in a collection

查看:57
本文介绍了更新集合中的大量记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为TimeSheet的集合,该集合现在有几千条记录.最终,这一数字将在一年内增加到3亿条.在此集合中,我嵌入了另一个名为Department的集合的几个字段,该字段大部分不会得到任何更新,并且很少会更新某些记录.我很少说一年一次或两次,也不是所有记录,只占馆藏记录的不到1%.

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.

大多数情况下,一旦创建部门,就不会有任何更新,即使有更新,也将在最初完成(如果TimeSheet中的相关记录不多)

Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)

现在,如果有人在一年后更新部门,在最坏的情况下,收集TimeSheet的机会将总共有大约3亿条记录,而要更新的部门有大约500万条匹配记录.更新查询条件将在索引字段上.

Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.

由于此更新很耗时并且会创建锁,所以我想知道是否还有更好的方法?我想的一种选择是通过添加诸如UpdatedDateTime> somedate && UpdatedDateTime < somedate之类的额外条件来批量运行更新查询.

Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.

其他详细信息:

单个文档的大小可能约为3或4 KB 我们有一个包含三个副本的副本集.

A single document size could be about 3 or 4 KB We have a replica set containing three replicas.

还有其他更好的方法吗?您如何看待这种设计?如果我给出的数字不像下面这样,您会怎么想?

Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?

1)总计1亿条记录和100,000条匹配记录用于更新查询

1) 100 million total records and 100,000 matching records for the update query

2)总计有1000万条记录和10,000条匹配记录用于更新查询

2) 10 million total records and 10,000 matching records for the update query

3)总计100万条记录和1000条匹配记录用于更新查询

3) 1 million total records and 1000 matching records for the update query

注意:集合名称departmenttimesheet及其用途是虚构的,不是真实的集合,而是我提供的统计数据是正确的.

Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.

推荐答案

根据我的全球知识和经验,让我给您一些提示:

Let me give you a couple of hints based on my global knowledge and experience:

MongoDB为每个文档存储相同的密钥.此重复会导致磁盘空间增加.在像您这样的非常大的数据库上,这可能会出现性能问题.

MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.

优点:

  • 文档较小,因此磁盘空间较小
  • 更多文档可放入RAM(更多缓存)
  • 在某些情况下,do索引的大小会减小

缺点:

  • 可读性差的名称

索引大小越小,它越适合RAM,并且索引丢失发生的次数就越少.例如,考虑git commit的SHA1哈希. git commit通常由前5-6个字符表示.然后只需存储5-6个字符而不是所有散列即可.

The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.

用于文档中发生的更新导致昂贵的文档移动.此文档移动导致删除旧文档并将其更新到新的空白位置并更新索引,这很昂贵.

For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.

如果发生某些更新,我们需要确保文档不会移动.对于每个集合,都有一个填充因子,该因子会在文档插入过程中告诉您除实际文档大小以外还要分配多少额外空间.

We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.

您可以使用以下方法查看集合填充因子:

You can see the collection padding factor using:

db.collection.stats().paddingFactor

手动添加填充

对于您而言,您一定会从一个会不断增长的小文档开始.片刻之后更新文档将导致多个文档移动.因此最好为文档添加一个填充.不幸的是,没有简单的方法来添加填充.我们可以通过在插入时在某些键上添加一些随机字节,然后在下一个更新查询中删除该键来做到这一点.

Add a padding manually

In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.

最后,如果您确定将来会在文档中使用某些密钥,请为这些密钥预先分配一些默认值,以免进一步的更新不会导致文档尺寸增大而导致文档移动.

Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.

您可以获得有关导致文档移动的查询的详细信息:

You can get details about the query causing document move:

db.system.profile.find({ moved: { $exists : true } })

大量馆藏VS少量馆藏大量文件

模式取决于应用程序需求.如果存在一个庞大的集合,我们仅查询最近N天的数据,则可以选择选择单独的集合,并且可以安全地存档旧数据.这样可以确保在RAM中的缓存操作正确完成.

Large number of collections VS large number of documents in few collection

Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.

每个创建的馆藏所产生的成本都大于创建馆藏的成本.每个集合的最小大小为几KB +一个索引(8 KB).每个集合都有一个关联的命名空间,默认情况下,我们有一些24K命名空间.例如,每个用户都有一个集合是一个不好的选择,因为它是不可伸缩的.过了一点,Mongo将不允许我们创建新的索引集合.

Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.

通常,拥有许多馆藏不会造成明显的性能损失.例如,如果我们知道我们始终基于月份进行查询,则可以选择每月收集一次.

Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.

始终建议将一个查询或一系列查询的所有相关数据保留在同一磁盘位置.您需要在不同文档之间复制信息.例如,在博客文章中,您希望将文章的评论存储在文章文档中.

Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.

优点:

  • 索引大小将大大减少,因为索引条目的数量会减少
  • 查询将非常快,其中包括获取所有必要的详细信息
  • 文档大小将与页面大小相当,这意味着当我们将这些数据带入RAM时,大多数时候我们不会将其他数据带到页面上
  • 文档移动将确保我们释放页面,而不是页面中的一个很小的小块,而该块可能不会在以后的插入中使用

上限集合的行为类似于循环缓冲区.它们是固定大小集合的特殊类型.这些集合可以接收非常高速的写入和顺序读取.固定大小后,一旦分配的空间已满,将通过删除较旧的文档来写入新文档.但是,仅当更新后的文档适合原始文档大小时,才允许文档更新(使用填充以提高灵活性).

Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).

这篇关于更新集合中的大量记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆