如何构建 DynamoDB 数据库以允许查询热门帖子? [英] How to structure a DynamoDB database to allow queries for trending posts?

查看:7
本文介绍了如何构建 DynamoDB 数据库以允许查询热门帖子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算使用以下公式来计算趋势"帖子:

I am planning on using the following formula to calculate "trending" posts:

Trending Score = (p - 1) / (t + 2)^1.5

p = 来自用户的投票(积分).t = 自提交以来的时间,以小时为单位.

p = votes (points) from users. t = time since submission in hours.

我正在寻找有关如何构建数据库表的建议,以便我可以使用 DynamoDB(来自 Amazon 的一种 nosql 数据库服务)查询热门帖子.

I am looking for advice on how to structure my database tables so that I can query for trending posts with DynamoDB (a nosql database service from Amazon).

DynamoDB 要求表中的每个项目都有一个主键.主键可以由两部分组成:哈希属性(字符串或数字)和范围属性(字符串或数字).每个项目的哈希属性必须是唯一的并且是必需的.Range Attribute 是可选的,但如果使用,DynamoDB 将在 Range Attribute 上构建排序范围索引.

DynamoDB requires a Primary Key for each item in a table. The Primary Key can consist of 2 parts: the Hash Attribute (string or number) and the Range Attribute (string or number). The Hash Attribute must be unique for each item and is required. The Range Attribute is optional, but if used DynamoDB will build a sorted range index on the Range Attribute.

我想到的结构如下:

表名:用户

HashAttribute:  user_id
RangeAttribute: NONE
OtherFields: first_name, last_name

表格名称:帖子

HashAttribute:  post_id
RangeAttribute: NONE
OtherFields: user_id,title, content, points, categories[ ]

表格名称:类别

HashAttribute:  category_name
RangeAttribute: post_id
OtherFields: title, content, points

表名:计数器

HashAttribute:  counter_name
RangeAttribute: NONE
OtherFields: counter_value

以下是我将使用下表设置发出的请求类型的示例(例如:user_id=100):

So here is an example of the types of requests I would make with the following table setup (example: user_id=100):

用户操作 1:

用户创建一个新帖子并将帖子标记为 2 个类别(棒球、足球)

User creates a new post and tags the post for 2 categories (baseball,soccer)

查询(1):

检查 counter_name='post_id' 和 increment+1 的当前值并使用新的 post_id

Check current value for the counter_name='post_id' and increment+1 and use the new post_id

查询 (2): 将以下内容插入 Posts 表:

Query (2): Insert the following into the Posts table:

post_id=value_from_query_1, user_id=100, title=user_generated, content=user_generated, points=0, categories=['baseball','soccer']

查询(3):

将以下内容插入到类别表中:

Insert the following into the Categories table:

category_name='baseball', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0

查询(4):

将以下内容插入到类别表中:

Insert the following into the Categories table:

category_name='soccer', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0



最终目标是能够执行以下类型的查询:

1.热帖查询

2.查询某类帖子

3. 查询积分最高的帖子



The end goal is to be able to conduct the following types of queries:

1. Query for trending posts

2. Query for posts in a certain category

3. Query for posts with the highest point values

有没有人知道如何构建表格以便查询热门帖子?还是我放弃了通过切换到 DynamoDB 来做的事情?

Does anyone have any idea how I could structure my tables so that I could do a query for trending posts? Or is this something I give the up the ability to do by switching to DynamoDB?

推荐答案

我开始用时间戳与 post_id 对您的评论进行注释.
由于您将使用 DynamoDB 作为您的 post_id 生成器,因此存在可扩展性问题.这些数字本质上是不可扩展的,最好使用日期对象.如果您需要以疯狂的速度创建帖子,您可以开始阅读有关 twitter 是如何做到的http://blog.twitter.com/2010/announcing-snowflake

I'm starting with a note on your comment with the timestamp vs post_id.
Since you are going to use DynamoDB as your post_id generator, there is a scalability issue right there. Those numbers are inherently unscalable and you better off using a date object. If you need to create posts in a crazy speed time you can start reading about how twitter are doing it http://blog.twitter.com/2010/announcing-snowflake

现在让我们回到您的趋势检查:
我相信您的场景是在滥用 DynamoDB.
假设您有一个热门类别,其中包含大多数帖子.基本上你将不得不扫描整个帖子(因为数据没有很好地传播)并且每个开始查看点并在你的服务器中进行比较.这将不起作用或非常昂贵,因为每次您可能会使用所有保留的读取单元容量.

Now let's get back to your trending check:
I believe your scenario is misusing DynamoDB.
Let's say you have one HOT category that has most posts in it. Basically you will have to scan the whole posts (since the data isn't spread well) and for each start to look at the points and do the comparisons in your server. This will just not work or will be very expensive since each time you will probably use all your reserved read units capacity.

用于此类趋势检查的 DynamoDB 方法是使用 MapReduce
在这里阅读如何实现这些:http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html

The DynamoDB approach for those type of trends checking is using MapReduce
Read here how to implement those: http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html

我无法指定时间,但我相信您会发现这种方法具有可扩展性 - 尽管您不能经常使用它.

I can't specify a time, but I believe you will find this approach scalable - though you won't be able to use it often.

另外,您可以保留一份前 10/100"热门问题的列表当帖子被点赞时,你会实时"更新它们——你会得到列表,检查是否需要用新点赞的问题进行更新,如果需要,将其保存回数据库.

On another note - you could keep a list of the "top 10/100" trendy questions and you update them in "real-time" when a post is upvoted - you get the list, check if it needs to be updated with the newly upvoted question and save it back to the db if needed.

这篇关于如何构建 DynamoDB 数据库以允许查询热门帖子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆