追随者——mongodb 数据库设计 [英] Followers - mongodb database design

查看:36
本文介绍了追随者——mongodb 数据库设计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我使用的是 mongodb,但我不确定我是否有正确/最佳的数据库集合设计来满足我的尝试.

So I'm using mongodb and I'm unsure if I've got the correct / best database collection design for what I'm trying to do.

可以有很多项目,用户可以创建包含这些项目的新群组.任何用户都可以关注任何群组!

There can be many items, and a user can create new groups with these items in. Any user may follow any group!

我不仅将关注者和项目添加到组集合中,因为组中可能有 5 个项目,或者可能有 10000 个(对于关注者也是如此),并且从研究中我认为您不应该使用未绑定的数组(其中限制未知)由于性能问题,当文档由于其尺寸扩大而必须移动时.(无论如何,在遇到性能问题之前,是否有建议的数组长度最大值?)

I have not just added the followers and items into the group collection because there could be 5 items in the group, or there could be 10000 (and the same for followers) and from research I believe that you should not use unbound arrays (where the limit is unknown) due to performance issues when the document has to be moved because of its expanding size. (Is there a recommended maximum for array lengths before hitting performance issues anyway?)

我认为对于以下设计,一个真正的性能问题可能是当我想要获取用户关注的特定项目的所有组(基于 user_id 和 item_id)时,因为那样我必须找到所有在用户关注的组中,从中找到所有具有 group_id $in 和 item id 的 item_groups.(但我实际上看不到任何其他方式来做到这一点)

I think with the following design a real performance issue could be when I want to get all of the groups that a user is following for a specific item (based off of the user_id and item_id), because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id. (but I can't actually see any other way of doing this)

Follower
.find({ user_id: "54c93d61596b62c316134d2e" })
.exec(function (err, following) {
  if (err) {throw err;};

  var groups = [];

  for(var i = 0; i<following.length; i++) {
    groups.push(following[i].group_id)
  }

  item_groups.find({
  'group_id': { $in: groups },
  'item_id': '54ca9a2a6508ff7c9ecd7810'
  })
  .exec(function (err, groups) {
    if (err) {throw err;};

    res.json(groups);

  });

})

是否有更好的数据库模式来处理这种类型的设置?

更新:在下面的评论中添加了示例用例.

UPDATE: Example use case added in comment below.

任何帮助/建议将不胜感激.

Any help / advice will be really appreciated.

非常感谢,苹果

推荐答案

我同意其他答案的一般概念,即这是一个边界关系问题.

I agree with the general notion of other answers that this is a borderline relational problem.

MongoDB 数据模型的关键是写入繁重,但这对于这个用例来说可能很棘手,主要是因为如果您想将用户直接链接到项目(对组的更改)其次是大量用户会导致大量写入,您需要一些工作人员来执行此操作).

The key to MongoDB data models is write-heaviness, but that can be tricky for this use case, mostly because of the bookkeeping that would be required if you wanted to link users to items directly (a change to a group that is followed by lots of users would incur a huge number of writes, and you need some worker to do this).

让我们研究一下读重模型在这里是否不适用,或者我们是否在做过早的优化.

Let's investigate whether the read-heavy model is inapplicable here, or whether we're doing premature optimization.

您主要关注以下用例:

一个真正的性能问题可能是当我想要获取用户关注的所有特定项目的群组时 [...] 因为那样我必须找到用户关注的所有群组,然后从中找到使用 group_id $in 和 item id 查找所有 item_groups.

a real performance issue could be when I want to get all of the groups that a user is following for a specific item [...] because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id.

让我们剖析一下:

  • 获取用户关注的所有群组

  • Get all groups that the user is following

这是一个简单的查询:db.followers.find({userId : userId}).我们将需要 userId 上的索引,这将使此操作的运行时间为 O(log n),或者即使对于大 n 也非常快.

That's a simple query: db.followers.find({userId : userId}). We're going to need an index on userId which will make the runtime of this operation O(log n), or blazing fast even for large n.

从中找到所有带有 group_id $in 和 item id

from that find all of the item_groups with the group_id $in and the item id

现在这是更棘手的部分.让我们暂时假设项目不太可能成为大量组的一部分.那么复合索引 { itemId, groupId } 效果最好,因为我们可以通过第一个标准显着减少候选集——如果一个项目只在 800 个组中共享,而用户关注 220 个组,mongodb只需要找到这些的交集,比较容易,因为两个集合都很小.

Now this the trickier part. Let's assume for a moment that it's unlikely for items to be part of a large number of groups. Then a compound index { itemId, groupId } would work best, because we can reduce the candidate set dramatically through the first criterion - if an item is shared in only 800 groups and the user is following 220 groups, mongodb only needs to find the intersection of these, which is comparatively easy because both sets are small.

不过,我们需要比这更深入:

We'll need to go deeper than this, though:

您的数据结构可能属于复杂网络.复杂网络有多种形式,但假设您的关注者图几乎无标度 是有道理的a>,这也几乎是最坏的情况.在无标度网络中,极少数节点(名人、超级碗、维基百科)吸引了大量注意力"(即有很多连接),而大量节点难以获得相同数量的注意力结合.

The structure of your data is probably that of a complex network. Complex networks come in many flavors, but it makes sense to assume your follower graph is nearly scale-free, which is also pretty much the worst case. In a scale free network, a very small number of nodes (celebrities, super bowl, Wikipedia) attract a whole lot of 'attention' (i.e. have many connections), while a much larger number of nodes have trouble getting the same amount of attention combined.

小节点无需担心,上述查询,包括到数据库的往返行程在我的开发机器上的2ms 范围数千万个连接和 > 5GB 的数据.既然数据集不是很大,但无论你选择什么技术,都会受到 RAM 限制,因为索引无论如何都必须在 RAM 中(数据在网络中的局部性和可分离性通常很差),并且设置的交集大小为根据定义小.换句话说:这个体制主要是硬件瓶颈.

The small nodes are no reason for concern, the queries above, including round-trips to the database are in the 2ms range on my development machine on a dataset with tens of millions of connections and > 5GB of data. Now that data set isn't huge, but no matter what technology you choose you, will be RAM bound because the indices must be in RAM in any case (data locality and separability in networks is generally poor), and the set intersection size is small by definition. In other words: this regime is dominated by hardware bottlenecks.

超级节点呢?

因为那是猜测,而且我对网络模型很感兴趣,所以我冒昧地实施一个大大简化的网络工具,基于您的数据模型进行一些测量.(抱歉,它是用 C# 编写的,但是用我最流利的语言生成结构良好的网络已经够难的了...).

Since that would be guesswork and I'm interested in network models a lot, I took the liberty of implementing a dramatically simplified network tool based on your data model to make some measurements. (Sorry it's in C#, but generating well-structured networks is hard enough in the language I'm most fluent in...).

查询超级节点时,我得到的结果范围在 7ms tops(即 1.3GB db 中的 1200 万个条目,最大的组中有 133,000 个项目 和一个关注 143 个群组的用户.)

When querying the supernodes, I get results in the range of 7ms tops (that's on 12M entries in a 1.3GB db, with the largest group having 133,000 items in it and a user that follows 143 groups.)

此代码中的假设是用户所关注的组数量并不多,但这在这里似乎是合理的.如果不是,我会采用大量写入的方法.

The assumption in this code is that the number of groups followed by a user isn't huge, but that seems reasonable here. If it's not, I'd go for the write-heavy approach.

随意使用代码.不幸的是,如果你想用超过几 GB 的数据来尝试这个,它需要一些优化,因为它根本没有优化并且在这里和那里做了一些非常低效的计算(尤其是 beta 加权随机洗牌可以改进).

Feel free to play with the code. Unfortunately, it will need a bit of optimization if you want to try this with more than a couple of GB of data, because it's simply not optimized and does some very inefficient calculations here and there (especially the beta-weighted random shuffle could be improved).

换句话说:我还没有担心重读方法的性能.问题通常不在于用户数量的增长,但用户以意想不到的方式使用系统.

In other words: I wouldn't worry about the performance of the read-heavy approach yet. The problem is often not so much that the number of users grows, but that users use the system in unexpected ways.

另一种方法可能是颠倒链接顺序:

The alternative approach is probably to reverse the order of linking:

UserItemLinker
{
 userId,
 itemId,
 groupIds[]  // for faster retrieval of the linker. It's unlikely that this grows large
}

这可能是最具可扩展性的数据模型,但除非我们讨论大量数据,其中分片是关键要求,否则我不会选择它.这里的关键区别在于,我们现在可以通过使用 userId 作为分片键的一部分来有效地划分数据.这有助于在多数据中心场景中并行化查询、高效分片和改善数据局部性.

This is probably the most scalable data model, but I wouldn't go for it unless we're talking about HUGE amounts of data where sharding is a key requirement. The key difference here is that we can now efficiently compartmentalize the data by using the userId as part of the shard key. That helps to parallelize queries, shard efficiently and improve data locality in multi-datacenter-scenarios.

这可以使用更复杂的测试平台进行测试,但我还没有找到时间,坦率地说,我认为这对大多数应用程序来说太过分了.

This could be tested with a more elaborate version of the testbed, but I didn't find the time yet, and frankly, I think it's overkill for most applications.

这篇关于追随者——mongodb 数据库设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆