从DynamoDB删除大量项目的推荐方法是什么? [英] What is the recommended way to delete a large number of items from DynamoDB?

查看:190
本文介绍了从DynamoDB删除大量项目的推荐方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在DynamoDB中编写一个简单的日志记录服务。



我有一个日志表,该表由一个user_id哈希值和一个时间戳(Unix纪元int)范围键入



该服务的用户终止其帐户后,无论范围值如何,我都需要删除表中的所有项目。



进行这种操作的建议方法是什么(请记住可能要删除数百万个项目)?



我的选择,据我所知:



A:执行扫描操作,对每个返回的项目调用delete,直到没有剩余的项目



B:执行BatchGet操作,再次对每个项目调用delete,直到所有项目都消失为止



这两个对我来说都很糟糕,因为它们会占用



我理想地要做的是调用LogTable.DeleteItem(user_id)-不提供范围,而是让它为我删除所有内容。

解决方案


我理想地要做的是调用LogTable.DeleteItem(user_id)-
不提供范围,并让它为我删除所有内容。


确实是一个可以理解的请求;我可以想象,随着时间的推移,AWS团队可能会添加类似的高级操作(他们有先从有限功能集入手,并根据客户反馈评估扩展的历史),但是这是您应该采取的避免操作成本的方法至少要进行全面扫描:


  1. 使用查询,而不是扫描检索 user_id 的所有项目-无论使用的是哪种哈希/范围主键,这都可以使用,因为 HashKeyValue RangeKeyCondition 是此API中的单独参数,前者仅针对复合主键的哈希组件的 Attribute值。




    • 请注意,您必须像我们这样处理查询API分页ual,请参见 ExclusiveStartKey 参数:


      要从其继续进行先前查询的项目的主键。如果
      查询操作在完成查询之前被中断,则
      较早的查询可能会将此值提供为LastEvaluatedKey。
      (由于结果集大小)或Limit参数。可以在新的查询请求中将
      LastEvaluatedKey传递回去,以从该点开始继续
      的操作。




  2. 对所有返回的项目进行处理,并简化删除项目照常




    • 更新:最有可能是 BatchWriteItem 更适用于这样的用例(有关详细信息,请参见下文)。







更新



ivant BatchWriteItem 操作使您可以在单个API调用[强调我的内容] 中跨多个表放置或删除


要上传一项,可以使用PutItem API,要删除一项
项,可以使用DeleteItem API 。但是,当您要上传
或删除大量数据时,例如从Amazon Elastic MapReduce(EMR)上传大量
数据或将数据从另一个
数据库迁移到Amazon DynamoDB中,此API提供了一种有效的
替代方案。


请注意,这仍然存在一些相关限制,最明显的是:




  • 单个请求中的最大操作数-您最多可以指定25个放置或删除操作;但是,总请求大小不能超过1 MB(HTTP有效负载)。


  • 不是原子操作-在BatchWriteItem是原子的;但是,BatchWriteItem总体上是尽力而为的操作,而不是原子操作。也就是说,在BatchWriteItem请求中,某些操作可能会成功,而其他操作可能会失败。 [...]




尽管如此,这显然为手边的用例提供了潜在的重大收益。 / p>

I'm writing a simple logging service in DynamoDB.

I have a logs table that is keyed by a user_id hash and a timestamp (Unix epoch int) range.

When a user of the service terminates their account, I need to delete all items in the table, regardless of the range value.

What is the recommended way of doing this sort of operation (Keeping in mind there could be millions of items to delete)?

My options, as far as I can see are:

A: Perform a Scan operation, calling delete on each returned item, until no items are left

B: Perform a BatchGet operation, again calling delete on each item until none are left

Both of these look terrible to me as they will take a long time.

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

解决方案

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

An understandable request indeed; I can imagine advanced operations like these might get added over time by the AWS team (they have a history of starting with a limited feature set first and evaluate extensions based on customer feedback), but here is what you should do to avoid the cost of a full scan at least:

  1. Use Query rather than Scan to retrieve all items for user_id - this works regardless of the combined hash/range primary key in use, because HashKeyValue and RangeKeyCondition are separate parameters in this API and the former only targets the Attribute value of the hash component of the composite primary key..

    • Please note that you''ll have to deal with the query API paging here as usual, see the ExclusiveStartKey parameter:

      Primary key of the item from which to continue an earlier query. An earlier query might provide this value as the LastEvaluatedKey if that query operation was interrupted before completing the query; either because of the result set size or the Limit parameter. The LastEvaluatedKey can be passed back in a new query request to continue the operation from that point.

  2. Loop over all returned items and either facilitate DeleteItem as usual

    • Update: Most likely BatchWriteItem is more appropriate for a use case like this (see below for details).


Update

As highlighted by ivant, the BatchWriteItem operation enables you to put or delete several items across multiple tables in a single API call [emphasis mine]:

To upload one item, you can use the PutItem API and to delete one item, you can use the DeleteItem API. However, when you want to upload or delete large amounts of data, such as uploading large amounts of data from Amazon Elastic MapReduce (EMR) or migrate data from another database in to Amazon DynamoDB, this API offers an efficient alternative.

Please note that this still has some relevant limitations, most notably:

  • Maximum operations in a single request — You can specify a total of up to 25 put or delete operations; however, the total request size cannot exceed 1 MB (the HTTP payload).

  • Not an atomic operation — Individual operations specified in a BatchWriteItem are atomic; however BatchWriteItem as a whole is a "best-effort" operation and not an atomic operation. That is, in a BatchWriteItem request, some operations might succeed and others might fail. [...]

Nevertheless this obviously offers a potentially significant gain for use cases like the one at hand.

这篇关于从DynamoDB删除大量项目的推荐方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆