DynamoDB 是否适合作为 S3 元数据索引? [英] Is DynamoDB suitable as an S3 Metadata index?

查看:14
本文介绍了DynamoDB 是否适合作为 S3 元数据索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想存储和查询大量的原始事件数据.我想使用的架构是数据湖"架构,其中 S3 保存实际事件数据,而 DynamoDB 用于索引它并提供元数据.这是一个在很多地方都被谈论和推荐的架构:

但是,我很难理解如何使用 DynamoDB 来查询 S3 中的事件数据.在上面的 AWS 博客链接中,他们使用了存储由多个不同服务器产生的客户事件的示例:

S3 路径格式:[4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data

例如:a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data

在 DynamoDB 中记录此事件的架构如下所示:

客户 ID(分区键)、时间戳服务器(排序键)、S3 键、大小87423、1436055953839-i-31cc02、a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data、1234

我想执行一个查询,例如:获取我所有服务器在过去 24 小时内产生的所有客户事件",但据我了解,如果不使用分区键,就不可能有效地查询 DynamoDB.我无法为这种查询指定分区键.

鉴于此要求,我是否应该使用 DynamoDB 以外的数据库来记录我的事件在 S3 中的位置?还是我只需要使用不同类型的 DynamoDB 架构?

解决方案

使用 DynamoDB 数据库的架构看起来不错且可行.DynamoDBMapper 类(存在于 AWS SDK Java 中)可用于创建模型,该模型具有从 S3 获取数据的有用方法.

DynamoDBMapper

<块引用>

getS3ClientCache() 返回底层 S3ClientCache 用于访问S3.

没有分区键就无法查询 DynamoDB 数据库.如果分区键不可用,您必须扫描整个 DynamoDB 数据库.但是,您可以在日期/时间字段上创建 全球二级索引 (GSI) 并查询您的用例的数据.

简单来说,GSI 类似于任何 RDBMS 中的索引.不同之处在于您可以直接查询 GSI 而不是主表.通常,如果您想在分区键不可用时查询 DynamoDB 的某些用例,则需要 GSI.有一些选项可用于在 GSI 的主表中包含 所有(或)选择性字段.

全球二级指数 (GSI)

DynamoDB 中扫描和查询的区别p>

是的,在这个用例中,看起来 GSI 无法提供帮助,因为该用例需要对分区键进行RANGE 查询.DynamoDB 仅支持相等运算符.如果分区键可用,DynamoDB 支持对排序键或其他非键属性的范围查询.您可能必须扫描 DynamoDB 才能完成此用例,这是一项昂贵的操作.

您已经考虑过替代数据模型,您可以在其中通过分区键查询或使用其他数据库.

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:

However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:

S3 path format: [4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data

Eg: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data

And the schema to record this event in DynamoDB looks like:

Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234

I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.

Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?

解决方案

The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.

DynamoDBMapper

getS3ClientCache() Returns the underlying S3ClientCache for accessing S3.

DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.

In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.

Global Secondary Index (GSI)

Difference between Scan and Query in DynamoDB

Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.

Either you have think about alternate data model where you can query by partition key or use some other database.

这篇关于DynamoDB 是否适合作为 S3 元数据索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆