关注者 - mongodb数据库设计 [英] Followers - mongodb database design

查看:124
本文介绍了关注者 - mongodb数据库设计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在使用mongodb,我不知道我是否有正确/最好的数据库收集设计,我正在努力做。



可以有很多项目,用户可以使用这些项目创建新组。任何用户都可以跟随任何组!





我没有将关注者和项目添加到组集合中,因为可以是组中的5个项目,或者可以有10000个(跟随者一样),并且从研究中我相信,当文档被移动时,由于性能问题,您不应该使用未绑定的数组(其中限制是未知的)因为它的规模不断扩大。 (在针对性能问题之前,是否建议使用数组长度的最大值?)



我认为使用以下设计,真正的性能问题可能是当我想要获得所有用户针对特定项目(基于user_id和item_id)关注的组,因为我必须找到用户正在关注的所有组,并从中找到所有具有group_id $的项目组,并且项目ID。 (但我实际上看不到任何其他方式)

 跟随者
.find({user_id: 54c93d61596b62c316134d2e})
.exec(function(err,following){
if(err){throw err;};

var groups = [];

for(var i = 0; i< following.length; i ++){
groups.push(following [i] .group_id)
}

item_groups .find({
'group_id':{$ in:groups},
'item_id':'54ca9a2a6508ff7c9ecd7810'
})
.exec(function(err,groups){
if(err){throw err;};

res.json(groups);

});

})

是否有更好的DB模式来处理这种类型的设置?



更新:在下面的评论中添加示例用例。



任何帮助/建议将非常感激。 / p>

许多谢谢,
Mac

解决方案

我同意其他答案的一般概念,这是一个关键问题。



MongoDB数据模型的关键是书写沉重,但对于这种用例可能会很棘手,主要是因为如果您想链接用户直接进入项目(对于大量用户后面的组的更改将导致大量写入,您需要一些工作人员来执行此操作)。



我们来研究这个重读模型是不适用的,还是我们是否进行过早的优化。



Read Heavy Approach



您的关键问题是以下用例:


一个真正的性能问题可能是当我想要得到所有的用户正在关注特定项目的组[...],因为我必须找到用户正在关注的所有组,并从中找到所有的group_groups,group_id $ in 和项目ID。


让我们剖析一下:




  • 获取用户关注的所有组



    这是一个简单的查询: db.followers .find({userId:userId})。我们将需要一个 userId 上的索引,这将使该操作的运行时间为O(log n),即使对于大的n也是快速的。


  • 从中找到group_id $的所有item_groups和项目ID



    现在这个比较棘手。让我们假设一段时间,物品不可能成为大量群体的一部分。然后,复合索引 {itemId,groupId} 将工作最好,因为我们可以通过第一个标准大大减少候选集 - 如果一个项目只共享800组,用户正在跟踪220组,mongodb只需要找到这些,这是比较容易的,因为两个集都很小。




我们需要比这更深入:



您的数据的结构可能 复杂网络。复杂网络有许多风格,但假设您的跟随者图是几乎无规模的< a>,这也是最糟糕的情况。在无规模的网络中,很少的节点(名人,超级碗,维基百科)吸引了很多关注(即有许多连接),而更多数量的节点遇到相同的注意力,



小节点不需要担心 ,上面的查询包括到数据库在我的开发机器上的 2ms范围中,数据集具有数千万个连接和5GB数据。现在数据集不是很大,但是无论你选择什么技术,都将被RAM绑定,因为在任何情况下,索引必须在RAM中(数据位置和网络中的可分离性通常很差),而设置的交集大小是根据定义小换句话说:这个制度是由硬件瓶颈所主导的。



然而,超节点呢?



由于这将是猜测,我对网络模型很感兴趣,我自由基于您的数据模型来实现一个大大简化的网络工具进行一些测量。 (对不起,它在C#中,但生成结构良好的网络是非常困难的语言,我最流利的...)。



当查询超节点时,我得到 7ms上限的范围(这是在1.3GB数据分析中的12M条目,其中最大组中有133,000个项目和143个组之后的用户。 )



此代码中的假设是用户跟随的组数不是很大,但这似乎是合理的。如果不是这样,我会去写这个重写的方法。



随意玩代码。不幸的是,如果要尝试使用超过几GB的数据,则需要一些优化,因为它根本没有被优化,并且在这里和那里执行了一些非常低效的计算(特别是可以改进的beta加权随机洗牌)



换句话说:我不用担心重读方式 的表现。 >问题往往不是因为用户数量的增加,而是用户以意想不到的方式使用系统。



写入重度方法



替代方法可能是扭转链接的顺序:

  UserItemLinker 
{
userId,
itemId,
groupIds [] //用于更快速地检索链接器。这不大可能是

这可能是最可扩展的数据模型,但是我不会去,除非我们谈论大量的数据,其中分片是一个关键要求。这里的主要区别是,现在我们可以通过使用userId作为分片键的一部分来高效地区分数据。这有助于并行化查询,在多数据中心场景中高效地分片并提高数据位置。



这可以用更详细的测试版本进行测试,但是我没有现在还没有找到时间,坦白说,我认为大多数应用程序都是过度的。


So I'm using mongodb and I'm unsure if I've got the correct / best database collection design for what I'm trying to do.

There can be many items, and a user can create new groups with these items in. Any user may follow any group!

I have not just added the followers and items into the group collection because there could be 5 items in the group, or there could be 10000 (and the same for followers) and from research I believe that you should not use unbound arrays (where the limit is unknown) due to performance issues when the document has to be moved because of its expanding size. (Is there a recommended maximum for array lengths before hitting performance issues anyway?)

I think with the following design a real performance issue could be when I want to get all of the groups that a user is following for a specific item (based off of the user_id and item_id), because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id. (but I can't actually see any other way of doing this)

Follower
.find({ user_id: "54c93d61596b62c316134d2e" })
.exec(function (err, following) {
  if (err) {throw err;};

  var groups = [];

  for(var i = 0; i<following.length; i++) {
    groups.push(following[i].group_id)
  }

  item_groups.find({
  'group_id': { $in: groups },
  'item_id': '54ca9a2a6508ff7c9ecd7810'
  })
  .exec(function (err, groups) {
    if (err) {throw err;};

    res.json(groups);

  });

})

Are there any better DB patterns for dealing with this type of setup?

UPDATE: Example use case added in comment below.

Any help / advice will be really appreciated.

Many Thanks, Mac

解决方案

I agree with the general notion of other answers that this is a borderline relational problem.

The key to MongoDB data models is write-heaviness, but that can be tricky for this use case, mostly because of the bookkeeping that would be required if you wanted to link users to items directly (a change to a group that is followed by lots of users would incur a huge number of writes, and you need some worker to do this).

Let's investigate whether the read-heavy model is inapplicable here, or whether we're doing premature optimization.

The Read Heavy Approach

Your key concern is the following use case:

a real performance issue could be when I want to get all of the groups that a user is following for a specific item [...] because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id.

Let's dissect this:

  • Get all groups that the user is following

    That's a simple query: db.followers.find({userId : userId}). We're going to need an index on userId which will make the runtime of this operation O(log n), or blazing fast even for large n.

  • from that find all of the item_groups with the group_id $in and the item id

    Now this the trickier part. Let's assume for a moment that it's unlikely for items to be part of a large number of groups. Then a compound index { itemId, groupId } would work best, because we can reduce the candidate set dramatically through the first criterion - if an item is shared in only 800 groups and the user is following 220 groups, mongodb only needs to find the intersection of these, which is comparatively easy because both sets are small.

We'll need to go deeper than this, though:

The structure of your data is probably that of a complex network. Complex networks come in many flavors, but it makes sense to assume your follower graph is nearly scale-free, which is also pretty much the worst case. In a scale free network, a very small number of nodes (celebrities, super bowl, Wikipedia) attract a whole lot of 'attention' (i.e. have many connections), while a much larger number of nodes have trouble getting the same amount of attention combined.

The small nodes are no reason for concern, the queries above, including round-trips to the database are in the 2ms range on my development machine on a dataset with tens of millions of connections and > 5GB of data. Now that data set isn't huge, but no matter what technology you choose you, will be RAM bound because the indices must be in RAM in any case (data locality and separability in networks is generally poor), and the set intersection size is small by definition. In other words: this regime is dominated by hardware bottlenecks.

What about the supernodes though?

Since that would be guesswork and I'm interested in network models a lot, I took the liberty of implementing a dramatically simplified network tool based on your data model to make some measurements. (Sorry it's in C#, but generating well-structured networks is hard enough in the language I'm most fluent in...).

When querying the supernodes, I get results in the range of 7ms tops (that's on 12M entries in a 1.3GB db, with the largest group having 133,000 items in it and a user that follows 143 groups.)

The assumption in this code is that the number of groups followed by a user isn't huge, but that seems reasonable here. If it's not, I'd go for the write-heavy approach.

Feel free to play with the code. Unfortunately, it will need a bit of optimization if you want to try this with more than a couple of GB of data, because it's simply not optimized and does some very inefficient calculations here and there (especially the beta-weighted random shuffle could be improved).

In other words: I wouldn't worry about the performance of the read-heavy approach yet. The problem is often not so much that the number of users grows, but that users use the system in unexpected ways.

The Write Heavy Approach

The alternative approach is probably to reverse the order of linking:

UserItemLinker
{
 userId,
 itemId,
 groupIds[]  // for faster retrieval of the linker. It's unlikely that this grows large
}

This is probably the most scalable data model, but I wouldn't go for it unless we're talking about HUGE amounts of data where sharding is a key requirement. The key difference here is that we can now efficiently compartmentalize the data by using the userId as part of the shard key. That helps to parallelize queries, shard efficiently and improve data locality in multi-datacenter-scenarios.

This could be tested with a more elaborate version of the testbed, but I didn't find the time yet, and frankly, I think it's overkill for most applications.

这篇关于关注者 - mongodb数据库设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆