什么是recomended方法来删除大量来自DynamoDB的项目? [英] What is the recomended way to delete a large number of items from DynamoDB?

查看:166
本文介绍了什么是recomended方法来删除大量来自DynamoDB的项目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在DynamoDB写一个简单的日志服务。

我有一个由USER_ID散列和时间戳键入一个日志表(Unix纪元INT)的范围。

在该服务的用户终止他们的帐户,我需要删除所有项目表中,无论范围值。

什么是做这种操作的推荐方法(牢记可能有几百万的项目删除)?

我的选项,据我可以看到的是:

答:执行扫描操作,要求删除在每个返回的项目,直到没有项目被留

B:执行BatchGet操作,再次要求删除在每个项目上,直到没有留下

这两个看起来可怕给我,因为他们将采取一长串的时间。

我的理想想做的事就是打电话LogTable.DeleteItem(USER_ID) - 如果没有提供的范围,并把它删除了我的一切。

有什么想法?

感谢

解决方案
  

我的理想想做的事就是打电话LogTable.DeleteItem(USER_ID) -   如果没有提供范围,并将它删除了我的一切。

这是可以理解的要求确实;我能想象先进的操作,如这些可能会增加随着时间的推移由AWS团队(他们已经开始在有限的功能集第一的历史记录和评估基于客户反馈扩展),但这里是你应该做的,以避免成本全面扫描至少为:

  1. 使用<一个href="http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Query.html">Query而不是扫描来获取所有项目的 USER_ID - 这部作品无论在使用联合的混杂/范围的主键,因为 HashKeyValue RangeKeyCondition 的是这个API不同的参数和前者只瞄准的属性的复合主键的hash部件的价值。的。

    • 请注意,你要先处理查询API页面在这里像往常一样,看到的 ExclusiveStartKey 的参数:

        

      这是一个将继续先前的查询项目的主键。一个   早期的查询可能会提供这个值作为LastEvaluatedKey如果   查询操作完成查询之前中断;任一   因为结果集的大小或限制参数。该   LastEvaluatedKey可以传递回一个新的查询请求继续   从该点操作。

    •   

  2. 遍历所有返回的项目,要么有利于<一href="http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_DeleteItem.html">DeleteItem像往常一样

    • 更新:最有可能的<一个href="http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_BatchWriteItem.html">BatchWriteItem是更适合于用例这样的(详见下文)。

更新

所强调的<一个href="http://stackoverflow.com/questions/9154264/what-is-the-recomended-way-to-delete-a-large-number-of-items-from-dynamodb/9159431#comment18716993_9159431">ivant,在<一个href="http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_BatchWriteItem.html">BatchWriteItem操作的使您可以把或删除若干跨多个表的项目在一个单一的API调用[重点煤矿] 的:

  

要上传一个项目,你可以使用PutItem API和删除一个   项目时,您可以使用DeleteItem API。但是,当你要上传   或删除大量数据,诸如上传大量的   从亚马逊的弹性麻preduce(EMR)或迁移从另一个数据数据   数据库中的亚马逊DynamoDB,这个API提供了一个高效的   替代方案。

请注意,这还是有一些相关的限制,最值得注意的是:

  • 在一个请求 最大操作 - 您可以指定一共有多达25个认沽或删除操作;但是,总的要求大小不能超过1 MB(在HTTP负载)。

  • 不是一个原子操作 - 在BatchWriteItem指定的个人操作都是原子;然而BatchWriteItem作为整体是一个尽力而为操作,而不是一个原子操作。也就是说,在一个BatchWriteItem要求,某些操作可能会成功,别人可能会失败。 [...]

不过这显然提供了一个潜在的显著增益使用情况下,像一个在眼前。

I'm writing a simple logging service in DynamoDB.

I have a logs table that is keyed by a user_id hash and a timestamp (Unix epoch int) range.

When a user of the service terminates their account, I need to delete all items in the table, regardless of the range value.

What is the recommended way of doing this sort of operation (Keeping in mind there could be millions of items to delete)?

My options, as far as I can see are:

A: Perform a Scan operation, calling delete on each returned item, until no items are left

B: Perform a BatchGet operation, again calling delete on each item until none are left

Both of these look terrible to me as they will take a looooong time.

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

Any thoughts?

Thanks

解决方案

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

An understandable request indeed; I can imagine advanced operations like these might get added over time by the AWS team (they have a history of starting with a limited feature set first and evaluate extensions based on customer feedback), but here is what you should do to avoid the cost of a full scan at least:

  1. Use Query rather than Scan to retrieve all items for user_id - this works regardless of the combined hash/range primary key in use, because HashKeyValue and RangeKeyCondition are separate parameters in this API and the former only targets the Attribute value of the hash component of the composite primary key..

    • Please note that you''ll have to deal with the query API paging here as usual, see the ExclusiveStartKey parameter:

      Primary key of the item from which to continue an earlier query. An earlier query might provide this value as the LastEvaluatedKey if that query operation was interrupted before completing the query; either because of the result set size or the Limit parameter. The LastEvaluatedKey can be passed back in a new query request to continue the operation from that point.

  2. Loop over all returned items and either facilitate DeleteItem as usual

    • Update: Most likely BatchWriteItem is more appropriate for a use case like this (see below for details).


Update

As highlighted by ivant, the BatchWriteItem operation enables you to put or delete several items across multiple tables in a single API call [emphasis mine]:

To upload one item, you can use the PutItem API and to delete one item, you can use the DeleteItem API. However, when you want to upload or delete large amounts of data, such as uploading large amounts of data from Amazon Elastic MapReduce (EMR) or migrate data from another database in to Amazon DynamoDB, this API offers an efficient alternative.

Please note that this still has some relevant limitations, most notably:

  • Maximum operations in a single request — You can specify a total of up to 25 put or delete operations; however, the total request size cannot exceed 1 MB (the HTTP payload).

  • Not an atomic operation — Individual operations specified in a BatchWriteItem are atomic; however BatchWriteItem as a whole is a "best-effort" operation and not an atomic operation. That is, in a BatchWriteItem request, some operations might succeed and others might fail. [...]

Nevertheless this obviously offers a potentially significant gain for use cases like the one at hand.

这篇关于什么是recomended方法来删除大量来自DynamoDB的项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆