对于典型的 crud 应用程序,dynamo 的推荐索引模式是什么? [英] What's the recommended index schema for dynamo for a typical crud application?

查看:11
本文介绍了对于典型的 crud 应用程序,dynamo 的推荐索引模式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读 一些 DynamoDB 索引文档 他们让我比什么都困惑.让我们用一个具体的例子来澄清一下.

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.

我有一个简单的日历应用程序,其中有一个 events 表.以下是我的专栏:

I have a simple calendar application, where I have an events table. Here are the columns I have:

id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)

我想执行如下查询:

  • 按 ID 获取事件
  • 获取 calendarId = xownerId = y
  • 的所有事件
  • 获取 startTimestamp 在 x 和 y 之间calendarId = z
  • 的所有事件
  • Get an event by ID
  • Get all events where calendarId = x and ownerId = y
  • Get all events where startTimestamp is between x and y and calendarId = z

DynamoDB 文档似乎强烈建议避免在此处使用事件 ID 作为分区/排序键,那么推荐的架构是什么?

DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?

推荐答案

这是每个人在开始使用 DynamoDB 时(实际上是在使用过 DynamoDB 时)都会遇到的问题.

This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.

让我们从 DynamoDB 如何定价(它的相关 - 老实说).暂时忽略免费层,您每月为静态数据支付每 GB 0.25 美元.您还需要为每个写入容量单位 (WCU) 每月支付 0.47 美元,每个读取容量单位 (RCU) 每月支付 0.09 美元.吞吐量 是您桌上的 WCU 和 RCU 的数量.您必须在表上预先指定吞吐量 - 您可以在表上执行的写入和读取量受您的吞吐量规定的限制.支付更多的钱,您每秒可以进行更多的读取和写入.DynamoDB 如何分区表的确切细节可以在这个答案中找到.

Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.

现在我们需要考虑表分区.表必须有一个主键.主键必须具有哈希键(也称为分区键),并且可以选择具有排序键(也称为范围键).DynamoDB 根据您的哈希键值创建 分区.在分区键值内,数据按范围键排序(如果您指定了一个).

Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.

如果您拥有准确的主键(如果有,则为哈希键和范围键),您可以使用 GetItem.如果您要获取多个项目,可以使用 BatchGetItem.

If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.

DynamoDB 只能通过两种方式搜索"数据.Query 只能从一个分区中获取数据一次调用,因为它使用分区键(以及可选的排序键)它很快.Scan 总是评估表中的每个项目,因此它的 通常很慢,并且不能很好地在大型表上扩展.

DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.

这就是有趣的地方.DynamoDB 获取您购买的所有吞吐量,并均匀分布你们所有的表分区.假设您的表上有 10 个 WCU 和 10 个 RCU,以及 5 个分区,这意味着每个分区有 2 个 WCU 和 2 个 RCU.如果您均匀地访问每个分区,那很好,您可以使用您购买的所有吞吐量.但想象一下,您访问过一个分区.现在您已经购买了 10 个 WCU 和 RCU,但您只使用了 2 个.您的表将比您想象的要慢得多.一种选择是购买更多的吞吐量,这会奏效,但对大多数工程师来说可能不太满意.

This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.

基于以上我们知道我们想要 设计一个表,每个分区都可以均匀访问.但是,根据我的经验,人们对此过于关注,如果您阅读我刚刚链接的文章(您也链接了该文章),这并不奇怪.

Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).

请记住,我们在查询中使用分区键来快速获取数据,并避免定期扫描.有些人过于专注于使他们的分区访问完全统一,最终得到一个他们无法快速查询的表.

Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.

我喜欢参考 表格最佳实践指南.尤其是表中显示用户 ID 是一个很好的分区键,只要许多用户定期访问您的应用程序.(它实际上说明了您在哪里有很多用户 - 这是不正确的,表的大小无关紧要).

I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).

它在统一访问和能够为您的应用程序使用直观、自然的查询之间取得平衡,但我的意思是,如果您是 DyanmoDB 的新手,正确的答案可能是设计您的表基于直观的访问.成功完成之后,请考虑统一访问和热分区,但请记住访问不必完全统一.有多种设计模式可以实现直观和统一的访问,但对于刚起步的人来说这些可能会很复杂,而且在很多情况下,如果人们过于关注统一访问的想法,可能会阻碍使用 DynamoDB 的人.

Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.

大多数应用程序都会有用户.对于大多数查询,在大多数应用程序中,您将执行的最常见查询是为用户获取数据.因此,大多数应用程序的主分区键的第一个选项通常是用户 ID.没关系,只要您没有几个点击率非常高的用户和许多从不登录的用户.

Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.

另一个提示.如果您的表称为蔬菜,则您的主分区键可能是蔬菜 ID.如果您的表名为 shoes,则您的主分区键可能是 shoe id.

Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.

大多数应用程序将为每个用户(或蔬菜或鞋子)提供许多项目.主键必须是唯一的.一个不错的选择通常是添加日期范围(排序)键 - 可能是创建项目的日期时间.然后按创建日期对用户分区中的项目进行排序,并为每个项目提供唯一的复合主键(即散列键 + 范围键).使用生成的 UUID 作为范围键也很好,您不会使用它为您提供的排序,但是您可以为每个用户拥有许多项目,并且仍然使用 Query 功能.

Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.

啊哈!但是我可以让我的分区键完全随机,然后使用我真正想要查询的属性的分区键应用索引.这样我就可以获得统一的访问权限和快速直观的查询.

遗憾的是没有.索引有自己的吞吐量和分区,独立于构建索引的表.想象一下索引是一个全新的表——它们基本上就是这样.索引不能解决不均匀的分区访问.

Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.

主键

哈希键:事件 ID

范围键:无

全球二级索引

哈希键:日历 ID

范围键:startTimestamp

Range Key: startTimestamp

假设事件 ID 被统一访问,它将是一个很好的哈希键.您确实需要描述您的数据是如何分布的,以便更多地讨论这一点.其他需要考虑的因素是您希望查询以多快的速度运行以及您愿意支付多少费用(例如二级索引很昂贵).

Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).

您的疑问:

按 ID 获取事件

GetItem 使用事件 ID

获取 calendarId = x 和 ownerId = y 的所有事件

通过GSI分区键查询,在ownerId上添加条件

Query by GSI parition key, add a condition on ownerId

获取 startTimestamp 在 x 和 y 之间且 calendarId = z 的所有事件

通过 GSI 分区键查询,在范围键上添加条件

Query by GSI parition key, add a condition on range key

这篇关于对于典型的 crud 应用程序,dynamo 的推荐索引模式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆