从哈希键检索不同的值-DynamoDB [英] Retrieve distinct values from the hash key - DynamoDB

查看:73
本文介绍了从哈希键检索不同的值-DynamoDB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个dynamodb表来存储电子邮件属性信息。我在电子邮件上有一个哈希键,在时间戳(数字)上有范围键。使用电子邮件作为哈希键的最初想法是按电子邮件查询所有电子邮件。但是我想做的一件事是检索所有电子邮件ID(在哈希键中)。我为此使用了boto,但不确定如何检索不同的电子邮件ID。

I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.

我当前提取10,000条电子邮件记录的代码是

My current code to pull 10,000 email records is

conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email']) 

但是要检索不同的记录,我将必须进行全表扫描,然后在代码中选择不同的记录。我的另一个想法是维护另一个表,该表将仅存储这些电子邮件并进行有条件的写入,以查看是否存在电子邮件ID,如果不存在则编写。但是我试图考虑这是否会更昂贵并且将是有条件的写入。

But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.

Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?


推荐答案

使用DynamoDB扫描,您需要过滤掉在客户端重复(在您的情况下,使用boto)。即使您使用反向架构创建GSI,您仍然会得到重复的副本。给定一个名为stamped_emails的email_id + timestamp的H + R表,所有唯一的email_ids的列表是H + R stamped_emails表的实例化视图。您可以启用 DynamoDB流在stamped_emails表上,订阅 Lambda 函数对stamped_emails的Stream进行处理,该Stream对名为hashs_only的仅哈希表进行PutItem(email_id)。然后,您可以只扫描emails__,而不会重复。

Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.

最后,关于您的成本问题,即使您仅从以下位置请求某些投影属性,Scan也会读取整个项目这些项目。其次,Scan必须通读每一项,即使它已由FilterExpression(条件表达式)过滤掉了。第三,扫描顺序读取项目。这意味着出于计量目的,每个扫描调用都被视为一个大读取。这样做的代价是,如果扫描呼叫读取200个不同的项目,则不一定要花费100个RCU。如果每个项目的大小为100字节,则该扫描调用将花费ROUND_UP((20000字节/ 1024 kb /字节)/ 8 kb / EC RCU)= 3 RCU。即使此调用仅返回123个项目,如果扫描必须读取200个项目,在这种情况下也会产生3个RCU。

Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.

这篇关于从哈希键检索不同的值-DynamoDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆