以适度可扩展的方式投放活动Feed项 [英] Delivering activity feed items in a moderately scalable way

查看:136
本文介绍了以适度可扩展的方式投放活动Feed项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理的应用程序有一个活动源,每个用户可以看到他们的朋友的活动(很像Facebook)。我正在寻找一个适度可扩展的方式来即时显示给定用户的活动流。我说适度,因为我正在寻找这样做只是一个数据库(Postgresql)和可能 memcached。例如,我想要这个解决方案扩展到每个有100个朋友的200,000个用户。

The application I'm working on has an activity feed where each user can see their friends' activity (much like Facebook). I'm looking for a moderately scalable way to show a given users' activity stream on the fly. I say 'moderately' because I'm looking to do this with just a database (Postgresql) and maybe memcached. For instance, I want this solution to scale to 200k users each with 100 friends.

目前,有一个主活动表,存储为给定的活动渲染的html吉姆加了一个朋友,乔治安装了一个应用程序,等等)。这个主活动表保存源用户,html和时间戳。

Currently, there is a master activity table that stores the rendered html for the given activity (Jim added a friend, George installed an application, etc.). This master activity table keeps the source user, the html, and a timestamp.

然后,有一个单独的('join')表,谁应该在朋友提要中看到此活动,以及指向主活动表中对象的指针。

Then, there's a separate ('join') table that simply keeps a pointer to the person who should see this activity in their friend feed, and a pointer to the object in the main activity table.

因此,如果我有100个朋友,我做3个活动,那么连接表将增长到300个项目。

So, if I have 100 friends, and I do 3 activities, then the join table will then grow to 300 items.

很明显,这个表会快速增长。它有一个不错的属性,但是,抓取活动向用户显示一个(相对)廉价的查询。

Clearly this table will grow very quickly. It has the nice property, though, that fetching activity to show to a user takes a single (relatively) inexpensive query.

另一个选项是只保留主活动并通过说出类似的方式查询它:

The other option is to just keep the main activity table and query it by saying something like:

select * from activity where source_user in (1, 2, 44, 2423, ... my friend list)

这有一个缺点,你正在查询的用户可能永远不会活动,当你的朋友列表增长,这个查询可以变得越来越慢。

This has the disadvantage that you're querying for users who may never be active, and as your friend list grows, this query can get slower and slower.

我看到双方的利弊,但我想知道是否有SO人可能会帮助我衡量选项,建议一种方式或他们其他。我也打开其他解决方案,虽然我想保持简单,不安装像CouchDB等。

I see the pros and the cons of both sides, but I'm wondering if some SO folks might help me weigh the options and suggest one way or they other. I'm also open to other solutions, though I'd like to keep it simple and not install something like CouchDB, etc.

非常感谢!

推荐答案

我倾向于只有主活动表。如果你这样做,这是我会考虑实现:

I'm leaning towards just having the master activity table. If you go with that, this is what I would consider implementing:


  1. 你可以创建几个活动表,做一个UNION ALL当从数据库获取数据时。例如,每月滚动它们 - activity_2010_02等。只是按你的例子​​ - 200K个用户×100个朋友×3个活动= 6000万行。对于PostgreSQL来说,这不是一个关注性能的问题,但是你可能会认为这只是为了方便,现在最终可以轻松地扩展。

  1. You can create several activity tables and do a UNION ALL when fetching the data from the database. For example, roll them over monthly - activity_2010_02, etc. Just going by your example - 200K users x 100 friends x 3 activities = 60 million rows. Not a concern performance-wise for PostgreSQL, but you might consider this purely for convenience now and eventually for effortless future expansion.


这有一个缺点,即您正在查询可能永远无法使用的用户,随着您的朋友列表增加,此查询可能会变得越来越慢。

This has the disadvantage that you're querying for users who may never be active, and as your friend list grows, this query can get slower and slower.


您要显示整个活动供稿,回到时间的开始吗?你没有提供很多细节的原始问题,但我会危险一个猜测,你会显示最后10/20/100项目按时间戳排序。几个索引和LIMIT子句应该足以提供即时响应(因为我刚刚在一个大约2000万行的表上测试)。在繁忙的服务器上可能会更慢,但这是应该使用硬件和缓存解决方案,Postgres不会成为瓶颈。

Are you going to display the entire activity feed, going back to the beginning of times? You haven't provided much detail in the original question but I'd hazard a guess that you'd be showing the last 10/20/100 items sorted by time stamp. A couple of indexes and the LIMIT clause should be enough to provide an instant response (as I've just tested on a table with about 20 million rows). It can be slower on a busy server, but that is something that should be worked out with hardware and caching solutions, Postgres is not going to be the bottleneck there.

偶如果您提供的活动资讯提供可以追溯到时间的黎明,请输出[emagine]分页。 LIMIT子句将保存您。如果使用LIMIT的基本查询不够,或者您的用户拥有不再活跃的朋友的长尾巴,您可以考虑将查找限制为最后一天/周/月然后 提供好友ID列表:

Even if you do provide activity feeds going back to the dawn of time, paginate the output! The LIMIT clause will save you there. If the basic query with a LIMIT on it is not enough, or if your users have a long tail of friends that are no longer active, you could consider limiting the lookup to the last day/week/month first and then provide the list of friend ids:

select * from activity 
  where ts <= 123456789 
    and source_user in (1, 2, 44, 2423, ... my friend list)

如果你有一个跨越几个月或几年的表,搜索朋友id只会在第一个WHERE子句选择的行中执行。

If you've got a table spanning months or years back, the search for the friends ids will only be performed within the rows selected by the first WHERE clause.

这只是如果我在你现在考虑的两个解决方案之间选择。我还会看下面的内容:

That's just if I choose between the two solutions you are considering now. I would also look at things like:


  1. 重新考虑表的非规范化。存储预生成的HTML输出真的是最好的方法吗?你会通过一个活动的查找表,并生成模板的输出,在飞行性能更好吗?预先生成的HTML在开始时似乎更好,但考虑到磁盘存储,API,未来的布局更改和存储HTML可能不是那么有吸引力。查找表可以包含可能的活动 - 添加朋友,改变状态等,并且活动日志将引用该活动日志以及朋友的ID(如果另一用户参与活动)。

  1. Reconsidering your denormalisation of the table. Is storing pre-generated HTML output really the best way? Will you be better off performance-wise by having a lookup table of activities instead and generating templated output on the fly? Pre-generated HTML can seem better at the outset, but consider things like disk storage, APIs, future layout changes and storing HTML may not be that attractive after all. The lookup table could contain your possible activities - added a friend, changed status, etc., and the activity log would reference that and the friend's id if another user is involved in the activity.

预先生成HTML,但不会将其存储在数据库中。将磁盘上的内容保存为预生成的页面。这不是一个银弹,但在很大程度上取决于你的网站上的读写比率。也就是说在公共论坛上的典型讨论话题可以有十几个消息,但可以看到几百次 - 一个很好的候选人缓存。而如果您的应用程序更多地调整到立即状态更新,并且您必须重新生成HTML页面,并在每次几次查看后再次保存在磁盘上,那么这种方法没有什么价值。

Doing pre-generate HTML, but not storing it in the database. Save the stuff on disk as pre-generated pages. This is not a silver bullet, however, and largely depends on the ratio of write-to-reads on your site. I.e. a typical discussion thread on a public forum could have a dozen messages, but could be viewed hundreds of times - a good candidate for caching. Whereas if your application is more tuned to immediate status updates and you'd have to regenerate the HTML page and save it again on disk after every couple of views, then there's little value in this approach.

希望这有助。

这篇关于以适度可扩展的方式投放活动Feed项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆