如何使用DynamoDB进行基本聚合? [英] How to do basic aggregation with DynamoDB?

查看:285
本文介绍了如何使用DynamoDB进行基本聚合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dynamodb如何实现聚集?

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.

让我们说我们正在建立一个技术博客,用户可以在其中发布文章。并说可以标记文章。

Lets say we are building a tech blog where users can post articles. And say articles can be tagged.

user
{
    id : 1235,
    name : "John",
    ...
}

article
{
    id : 789,
    title: "dynamodb use cases",
    author : 12345 //userid
    tags : ["dynamodb","aws","nosql","document database"]
}

在用户界面中,我们要显示当前用户标签及其相应计数。

In the user interface we want to show for the current user tags and the respective count.

如何实现以下聚合?

{
    userid : 12,
    tag_stats:{
        "dynamodb" : 3,
        "nosql" : 8
    }
}

我们将通过rest api提供此数据,并将经常调用它。像这样的信息显示在应用程序主页中。

We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.


  • 我可以想到提取所有文档并在应用程序级别进行汇总。但是我觉得我的读取容量单位会用完

  • 可以使用EMR,redshift,bigquery,aws lambda之类的工具。但是我认为这些是出于数据仓库的目的。

我想知道实现这一目标的其他更好的方法。
人们如何实现动态简单的查询,例如考虑成本和响应时间而选择dynamodb作为主要数据存储的人。

I would like to know other and better ways of achieving the same. How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

推荐答案

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.

您有三个主要选择:


  • 将DynamoDB数据导出到 Redshift EMR配置单元。然后,您可以对过时的数据执行SQL查询。这种方法的好处是它只消耗一次RCU,但是您将坚持使用过时的数据。

  • Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.

使用 DynamoDB连接器(用于Hive)并直接查询DynamoDB。同样,您可以编写任意SQL查询,但是在这种情况下,它将直接访问DynamoDB中的数据。缺点是它将消耗您执行的每个查询的读取容量。

Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.

使用 DynamoDB流。例如,您可以将表UserId作为分区键,并将具有标签和计数的嵌套映射作为属性。在原始数据的每次更新中,DynamoDB流将在主机上执行Lambda函数或某些代码以更新聚合表。这是最经济高效的方法,但是您将需要为每个新查询实现其他代码。

Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.

当然您可以在应用程序级别提取数据并在那里进行汇总,但我不建议您这样做。除非您的桌子很小,否则您将需要考虑节流,仅使用部分预配置容量(例如,您要消耗20%的RCU进行聚合而不是100%),以及如何在多名工人之间分配工作

Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.

Redshift和Hive都已经知道如何做到这一点。 Redshift在执行查询时依赖于多个工作程序节点,而Hive基于Map-Reduce。此外,Redshift和Hive都可以​​使用RCU吞吐量的预定义百分比。

Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

这篇关于如何使用DynamoDB进行基本聚合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆