对于典型的Crud应用程序,推荐的dynamo索引架构是什么? [英] What's the recommended index schema for dynamo for a typical crud application?

查看:96
本文介绍了对于典型的Crud应用程序,推荐的dynamo索引架构是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读一些DynamoDB索引文档,他们让我更加困惑。让我们举一个具体的例子。

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.

我有一个简单的日历应用程序,其中有一个 events 表。这是我拥有的列:

I have a simple calendar application, where I have an events table. Here are the columns I have:

id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)

我想执行以下查询:


  • 通过ID获取事件

  • 获取所有事件,其中 calendarId = x ownerId = y

  • 获取 startTimestamp位于x和y之间的所有事件 calendarId = z li>
  • Get an event by ID
  • Get all events where calendarId = x and ownerId = y
  • Get all events where startTimestamp is between x and y and calendarId = z

DynamoDB文档似乎在这里强烈建议避免使用事件的ID作为分区/排序键,那么推荐的模式是什么?

DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?

推荐答案

这是每个人在开始使用DynamoDB时(实际上是在有经验的时候)都在努力的问题。

This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.

让我们从DynamoDB的方式开始已定价(实际上是相关的)。暂时忽略免费套餐,您每月为静态数据支付每GB 0.25 USD。您还需要每月每个写入容量单位(WCU)支付0.47美元,每个读取容量单位(RCU)每月支付0.09美元。 吞吐量是表中WCU和RCU的数量。您必须预先在表上指定吞吐量-您可以在表上执行的读写量受吞吐量规定的限制。支付更多的钱,您可以每秒执行更多读取和写入操作。有关DynamoDB分区表的确切详细信息,可以在此答案中找到。

Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.

现在,我们需要考虑表分区。表必须具有主键。主键必须具有哈希键(也称为分区键),并且可以选择具有排序键(即范围键)。 DynamoDB根据您的哈希键值创建分区。在分区键值内,如果指定了数据,则按范围键对数据进行排序。

Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.

如果您具有确切的主键(如果有哈希键和范围键,则可以使用 GetItem 。如果要获取多个项目,则可以使用 BatchGetItem

If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.

DynamoDB只能以两种方式搜索数据。 查询可以仅从一个分区中的数据中一个呼叫,因为它使用了分区键(还可以使用排序键),因此很快。 扫描始终会评估表中的每个项目,因此其通常很慢,并且在大型桌子上无法很好地扩展

DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.

是哪里变得有趣。 DynamoDB会处理您购买的所有吞吐量,并均匀地分布在整个吞吐量上你们所有的表分区。假设您的表上有10个WCU和10个RCU,还有5个分区,这意味着每个分区有2个WCU和2个RCU。如果您平均访问每个分区,就可以使用所有购买的吞吐量,那就很好了。但是假设您曾经访问过一个分区。现在,您已经购买了10个WCU和RCU,但您仅使用2个。您的表将比您想象的要慢得多。一种选择是只购买更多的吞吐量,这将起作用,但对于大多数工程师来说可能并不十分令人满意。

This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.

基于上述内容,我们知道我们想设计一个表,在该表中均匀访问每个分区。但是,根据我的经验,人们对此太挂了,如果您阅读我刚刚链接的文章(您也链接了),这并不奇怪。

Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).

记住该分区键是我们在查询中用来快速获取数据并避免常规扫描的键。有些人过于专注于使他们的分区访问完全统一,并最终导致无法快速查询的表。

Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.

我想引用最佳表格惯例指南。特别是在表中说用户ID是一个很好的分区键,因此许多用户会定期访问您的应用程序。 (实际上是说您有很多用户,这是不正确的,表的大小无关紧要。)

I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).

在统一访问权限和使用权限之间取得平衡直观,自然的查询给您的应用程序,但是我的意思是,如果您是DyanmoDB的新手,那么可能的正确答案是基于直观访问来设计表。成功完成此操作后,请考虑一下统一访问和热分区,但请记住,访问不必完全统一。有多种设计模式既可以实现直观访问又可以实现统一访问,但是对于那些刚开始的人来说可能会很复杂,并且在很多情况下,如果他们过于关注统一访问思想,可能会阻止使用DynamoDB的人们。

Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.

大多数应用程序都会有用户。对于大多数查询,在大多数应用程序中,最常见的查询是获取用户数据。因此,大多数应用程序的主分区键的第一个选项通常是用户ID。只要您没有那么高的访问量用户和许多从未登录过的用户,就可以了。

Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.

另一个提示。如果您的表称为蔬菜,则主分区键可能是蔬菜ID。如果您的桌子叫鞋,您的主分区键可能就是鞋的ID。

Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.

大多数应用程序为每个用户(或蔬菜或鞋子)提供很多物品。主键必须是唯一的。一个好的选择通常是添加一个日期范围(排序)键-也许是创建该项目的日期时间。然后按创建日期对用户分区内的项目进行排序,并为每个项目赋予唯一的复合主键(即哈希键+范围键)。也可以将生成的UUID用作范围键,您不会使用它给您的顺序,但是每个用户可以有很多项目,仍然可以使用查询功能。

Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.

啊哈!但是,我可以使分区键完全随机,然后将索引与我要查询的属性的分区键一起应用。这样,我可以获得统一的访问权限和快速直观的查询。

可惜没有。索引具有自己的吞吐量和分区,与建立索引的表分开。试想一下,索引是一个全新的表-基本上就是它们的本质。 索引不能解决分区访问不均的问题。

Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.

主键

哈希键:事件ID

范围键:无

全局二级索引

哈希键:日历ID

范围键:startTimestamp

Range Key: startTimestamp

假设统一访问事件ID,那将是一个很好的哈希键。您确实需要描述数据的分发方式,以进行更多讨论。其他要发挥的作用是,您希望查询的运行速度有多快,以及您愿意支付多少(例如,二级索引非常昂贵)。

Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).

以及您的查询:

通过ID获取事件

GetItem 使用事件ID

获取所有事件,其中calendarId = x和ownerId = y

通过GSI分区键查询,在ownerId上添加条件

Query by GSI parition key, add a condition on ownerId

获取所有startTimestamp在x和y之间且calendarId = z的事件

通过GSI分区键查询,添加范围键上的条件

Query by GSI parition key, add a condition on range key

这篇关于对于典型的Crud应用程序,推荐的dynamo索引架构是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆