如何使用 DynamoDB 进行基本聚合? [英] How to do basic aggregation with DynamoDB?

查看:18
本文介绍了如何使用 DynamoDB 进行基本聚合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 dynamodb 实现聚合?Mongodb 和 couchbase 有 map reduce 支持.

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.

假设我们正在构建一个技术博客,用户可以在其中发布文章.并说文章可以被标记.

Lets say we are building a tech blog where users can post articles. And say articles can be tagged.

user
{
    id : 1235,
    name : "John",
    ...
}

article
{
    id : 789,
    title: "dynamodb use cases",
    author : 12345 //userid
    tags : ["dynamodb","aws","nosql","document database"]
}

在用户界面中,我们要显示当前用户标签和相应的计数.

In the user interface we want to show for the current user tags and the respective count.

如何实现下面的聚合?

{
    userid : 12,
    tag_stats:{
        "dynamodb" : 3,
        "nosql" : 8
    }
}

我们将通过一个rest api提供这些数据,它会被频繁调用.像此信息显示在应用程序主页中.

We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.

  • 我可以考虑提取所有文档并在应用程序级别进行聚合.但我觉得我的读取容量单位会耗尽
  • 可以使用 EMR、redshift、bigquery、aws lambda 等工具.但我认为这些是用于数据仓库目的.

我想知道实现相同目标的其他更好的方法.考虑到成本和响应时间,人们如何在选择 dynamodb 作为主要数据存储的情况下实现这样的动态简单查询.

I would like to know other and better ways of achieving the same. How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

推荐答案

长话短说:Dynamo 不支持这个.它不是为此用例构建的.它旨在以低延迟快速访问数据.它根本不支持任何聚合功能.

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.

您有三个主要选择:

  • 将 DynamoDB 数据导出到 RedshiftEMR Hive.然后您可以对陈旧数据执行 SQL 查询.这种方法的好处是它只消耗 RCU 一次,但您会坚持使用过时的数据.

  • Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.

对 Hive 使用 DynamoDB 连接器并直接查询 DynamoDB.同样,您可以编写任意 SQL 查询,但在这种情况下,它将直接访问 DynamoDB 中的数据.缺点是它会消耗您执行的每个查询的读取容量.

Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.

使用 DynamoDB 流在单独的表中维护汇总数据.例如,您可以将表 UserId 作为分区键,将带有标签和计数的嵌套映射作为属性.在您的原始数据中每次更新时,DynamoDB 流将在您的主机上执行 Lambda 函数或一些代码来更新聚合表.这是最具成本效益的方法,但您需要为每个新查询实现额外的代码.

Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.

当然,您可以在应用程序级别提取数据并在那里聚合,但我不建议这样做.除非您有一个小表,否则您将需要考虑限制、仅使用部分配置容量(例如,您想消耗 20% 的 RCU 用于聚合而不是 100%),以及如何在多个工作人员之间分配您的工作.

Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.

Redshift 和 Hive 都已经知道如何做到这一点.Redshift 在执行查询时依赖于多个工作节点,而 Hive 则基于 Map-Reduce.此外,Redshift 和 Hive 都可以使用预定义的 RCU 吞吐量百分比.

Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

这篇关于如何使用 DynamoDB 进行基本聚合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆