DynamoDB是否适合作为S3元数据索引? [英] Is DynamoDB suitable as an S3 Metadata index?

查看:187
本文介绍了DynamoDB是否适合作为S3元数据索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想存储和查询大量原始事件数据。我要使用的架构是数据湖架构,其中S3保存实际的事件数据,而DynamoDB用于对其进行索引并提供元数据。在许多地方都讨论并推荐了这种体系结构:

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:

  • https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
  • https://www.youtube.com/watch?v=7Px5g6wLW2A
  • https://s3.amazonaws.com/big-data-ipc/AWS_Data-Lake_eBook.pdf

但是,我很难理解如何使用DynamoDB来查询S3中的事件数据。在上面的AWS博客的链接中,他们使用了存储由多个不同服务器产生的客户事件的示例:

However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:

S3路径格式: [4 -digit hash] / [server id] / [year]-[month]-[day]-[hour]-[minute] / [customer id]-[epoch timestamp] .data

例如: a5b2 / i-31cc02 / 2015-07-05-00-25 / 87423-1436055953839.data

在DynamoDB中记录此事件的模式如下:

And the schema to record this event in DynamoDB looks like:

Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234

我想执行以下查询:让我所有的客户所有服务器在过去24小时内产生的事件,但是据我了解,如果不使用分区键就不可能有效地查询DynamoDB。我无法为这种查询指定分区键。

I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.

鉴于此要求,我是否应该使用DynamoDB以外的数据库来记录我的事件在S3中的位置?还是只需要使用其他类型的DynamoDB模式?

Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?

推荐答案

使用DynamoDB数据库,该体系结构看起来不错且可行。 DynamoDBMapper 类(存在于AWS开发工具包Java中)可用于创建模型,该模型具有从S3获取数据的有用方法。

The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.

DynamoDBMapper


getS3ClientCache()返回用于访问
S3的基础S3ClientCache。

getS3ClientCache() Returns the underlying S3ClientCache for accessing S3.

$ b如果没有分区键,则无法查询
$ b

DynamoDB数据库。如果分区键不可用,则必须扫描整个DynamoDB数据库。但是,您可以在日期/时间字段上创建全球二级索引(GSI),并查询用例的数据。

DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.

简单术语,GSI类似于任何RDBMS中存在的索引。区别在于您可以直接查询GSI而不是主表。通常,如果您想在分区键不可用时查询DynamoDB的某些用例,则需要GSI。在GSI的主表中,可以使用以下选项包括所有(或)选择性字段

In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.

全球二级指数(GSI)

DynamoDB中的扫描和查询之间的区别

是的,在此用例中,由于用例需要对分区键进行 RANGE查询,因此GSI似乎无济于事。 DynamoDB仅支持相等运算符。如果分区键可用,DynamoDB支持对排序键或其他非键属性进行范围查询。您可能必须扫描DynamoDB才能完成此用例,该用例是昂贵的操作。

Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.

您可能已经考虑过可以通过分区键查询或使用某些替代数据模型的方法。其他数据库。

Either you have think about alternate data model where you can query by partition key or use some other database.

这篇关于DynamoDB是否适合作为S3元数据索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆