对于典型的 crud 应用程序,推荐的 dynamo 索引模式是什么? [英] What's the recommended index schema for dynamo for a typical crud application?

查看:23
本文介绍了对于典型的 crud 应用程序,推荐的 dynamo 索引模式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读一些 DynamoDB 索引文档 他们让我更困惑.让我们用一个具体的例子来澄清一下.

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.

我有一个简单的日历应用程序,其中有一个 events 表.这是我拥有的列:

I have a simple calendar application, where I have an events table. Here are the columns I have:

id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)

我想执行以下查询:

  • 通过 ID 获取事件
  • 获取calendarId = xownerId = y
  • 的所有事件
  • 获取 startTimestamp 介于 x 和 ycalendarId = z 之间的所有事件
  • Get an event by ID
  • Get all events where calendarId = x and ownerId = y
  • Get all events where startTimestamp is between x and y and calendarId = z

DynamoDB 文档似乎强烈建议避免在此处使用事件 ID 作为分区/排序键,那么推荐的架构是什么?

DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?

推荐答案

这是每个人在开始使用 DynamoDB 时(实际上是在使用过 DynamoDB 时都会遇到的问题.

This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.

让我们从 DynamoDB 的定价 (它的相关 - 老实说).暂时忽略免费套餐,您需要为静态数据每月每 GB 支付 0.25 美元.您还需要为每个写入容量单位 (WCU) 每月支付 0.47 美元和每个读取容量单位 (RCU) 每月支付 0.09 美元.Throughput 是您桌子上的 WCU 和 RCU 的数量.您必须预先在表上指定吞吐量 - 您可以在表上执行的写入和读取量受吞吐量供应的限制.支付更多的钱,您可以每秒进行更多的读取和写入.在此答案中可以找到有关 DynamoDB 如何分区表的确切详细信息.

Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.

现在我们需要考虑表分区.表必须有一个主键.主键必须有一个散列键(又名分区键),并且可以选择有一个排序键(又名范围键).DynamoDB 根据您的哈希键值创建分区.在分区键值内,数据按范围键排序(如果您已指定).

Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.

如果您有确切的主键(如果有哈希键和范围键),您可以使用 GetItem.如果您有多个项目要获取,您可以使用 BatchGetItem.

If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.

DynamoDB 只能通过两种方式搜索"数据.查询只能从一个分区中获取数据一次调用,因为它使用分区键(以及可选的排序键),所以速度很快.Scan 总是评估表中的每个项目,因此它的 通常很慢,并且在大型表上不能很好地扩展.

DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.

这就是有趣的地方.DynamoDB 获取您购买的所有吞吐量,并将其均匀分布你们所有的表分区.假设您的表上有 10 个 WCU 和 10 个 RCU,以及 5 个分区,这意味着每个分区有 2 个 WCU 和 2 个 RCU.如果您均匀地访问每个分区,那很好,您可以使用所有购买的吞吐量.但想象一下,您访问过一个分区.现在您已经购买了 10 个 WCU 和 RCU,但您只使用了 2 个.您的表将比您想象的要慢得多.一种选择是购买更多的吞吐量,这会起作用,但对大多数工程师来说可能不是很满意.

This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.

基于以上我们知道我们想要设计一个表,每个分区都被均匀访问.但是,根据我的经验,人们对此过于关注,如果您阅读我刚刚链接的文章(您也链接过),这并不奇怪.

Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).

请记住,分区键是我们在查询中用于快速获取数据并避免常规扫描的内容.有些人过于专注于让他们的分区访问完全统一,最终得到一个他们无法快速查询的表.

Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.

我喜欢参考表格最佳实践指南.尤其是表中说用户 ID 是一个很好的分区键,因此许多用户定期访问您的应用程序.(它实际上是说你有很多用户——这是不正确的,表的大小无关紧要).

I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).

它在统一访问和能够为您的应用程序使用直观、自然的查询之间取得平衡,但我要说的是,如果您是 DyanmoDB 的新手,正确的答案可能是设计您的表基于直观的访问.成功完成后,请考虑统一访问和热分区,但请记住,访问不必完全统一.有多种设计模式可以实现直观和统一的访问,但这些对于刚入门的人来说可能很复杂,并且在许多情况下,如果人们过于关注统一访问的想法,可能会阻止他们使用 DynamoDB.

Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.

大多数应用程序都会有用户.对于大多数查询,在大多数应用程序中,您将执行的最常见查询是获取用户数据.因此,大多数应用程序的主分区键的第一个选项通常是用户 ID.没关系,只要您没有一些点击率很高的用户和许多从不登录的用户.

Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.

另一个提示.如果您的表称为蔬菜,则您的主分区键可能是蔬菜 ID.如果您的表名为 shoes,则您的主分区键可能是 shoes id.

Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.

大多数应用程序都会为每个用户(或蔬菜或鞋子)提供许多项目.主键必须是唯一的.一个不错的选择通常是添加日期范围(排序)键 - 可能是创建项目的日期时间.然后按创建日期对用户分区内的项目进行排序,并为每个项目提供一个唯一的复合主键(即散列键 + 范围键).使用生成的 UUID 作为范围键也很好,您不会使用它给您的排序,但是您可以为每个用户拥有多个项目,并且仍然使用 Query 功能.

Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.

啊哈!但是我可以让我的分区键完全随机,然后使用我真正想要查询的属性的分区键应用索引.这样我就可以获得统一的访问和快速直观的查询.

遗憾的是没有.索引有自己的吞吐量和分区,独立于建立索引的表.把索引想象成一个全新的表——它们基本上就是这样.索引不能解决不均匀的分区访问.

Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.

主键

哈希键:事件 ID

范围键:无

全球二级索引

哈希键:日历 ID

范围键:startTimestamp

Range Key: startTimestamp

假设事件 ID 被统一访问,它将是一个很好的哈希键.您确实需要描述您的数据是如何分布的,以便对此进行更多讨论.其他影响因素包括您希望查询的运行速度以及您愿意支付的费用(例如二级索引很昂贵).

Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).

以及您的查询:

通过 ID 获取事件

GetItem 使用事件 ID

获取 calendarId = x 和 ownerId = y 的所有事件

通过 GSI 分区键查询,在 ownerId 上添加条件

Query by GSI parition key, add a condition on ownerId

获取 startTimestamp 在 x 和 y 之间且 calendarId = z 的所有事件

通过 GSI 分区键查询,在范围键上添加条件

Query by GSI parition key, add a condition on range key

这篇关于对于典型的 crud 应用程序,推荐的 dynamo 索引模式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆