DynamoDB get_item以毫秒为单位读取400kb数据 [英] DynamoDB get_item to read 400kb data in milliseconds
问题描述
我有一个称为 events
的动态表,其中存储了所有 user事件详细信息
,例如 product_view
, add_to_cart
和 product_purchase
I have a dynamodb table called events
in which i stored all user event details
like product_view
,add_to_cart
and product_purchase
在此 events
表中,我有一些 items
,其存储容量达到了 400kb
In this events
table, I have some items
whose storage capacity reached 400kb
问题:
response = self._table.get_item(
Key={
PARTITION_KEY: <pk>,
SORT_KEY: <sk>,
},
ConsistentRead=False,
)
当我想使用 dynamodb get_item
方法访问 item(400kb)
时,需要大约 5秒
返回结果
when I want to use dynamodb get_item
method to access the item(400kb)
, it is taking around 5 seconds
to return the result.
我已经使用了DAX
目标
我想在不到1秒的时间内读取 400kb
项.
I want to read 400kb
item in less than a 1 second.
重要信息:
dynamodb
中的数据将以这种格式存储
The data in the dynamodb
will be stored in this format
{
"partition_key": "user_id1111",
"sort_key": "version_1",
"attributes": {
"events": [
{
"t": "1614712316",
"a": "product_view",
"i": "1275"
},
{
"t": "1614712316",
"a": "product_add",
"i": "1275"
},
{
"t": "1614712316",
"a": "product_purchase",
"i": "1275"
},
...
]
}
}
-
t
是一个时间戳记 -
a
可能是product_view
,product_add
,product_purchase
-
i
是product_id t
is a timestampa
may beproduct_view
,product_add
,product_purchase
i
is the product_id
如果您看到上面的项目,则 events
是一个列表,它将被新事件附加.
If you see above item events
is a list and it will be appended by new events.
我有一个 400kb
项,其中 events
列表中的事件数
I have an item which is 400kb
with number of events in the events
list
我写了一些脚本来测量时间,结果在下面给出
I wrote some script to measure the time and the results are given below
import boto3
import datetime
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('events')
pk = f"user_id1111"
sk = f"version_1"
t_load_start = datetime.datetime.now()
response = table.get_item(
Key={
"partition_key": pk,
"sort_key": sk,
},
ReturnConsumedCapacity="TOTAL"
)
capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
t_load_end = datetime.datetime.now()
seconds = (t_load_end - t_load_start).total_seconds()
print(f"Elapsed time is::{seconds}sec and {capacity_units} capacity units")
这是我得到的输出.
Elapsed time is::5.676799sec and 50.0 capacity units
有人可以为此提出建议吗?
Can anyone suggest a solution for this?
推荐答案
tl; dr:将函数的内存增加到至少1024MB,请参阅更新2
我很好奇,所以我做了一些测量.我创建了一个脚本,可以在一个新表中创建一个大小约为400KB的大型Boi项目.
tl;dr: Increase your functions memory to at least 1024MB, see update 2
I was curious, so I did some measurements. I created a script that creates a big boi item with pretty much exactly 400KB in size in a fresh table.
然后,我测试从Python读取的两次读取-一个使用资源API,另一个使用较低级别的客户端-最终在这两种情况下都保持一致的读取.
Then I test two reads from Python - one with the resource API and the other with the lower level client - eventually consistent reads in both cases.
这是我测量的:
Reading Big Boi from a Table Resource took 0.366508s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.301585s and consumed 50.0 RCUs
如果从RCU推断,则读取的项目大小约为 50 * 2 * 4KB = 400 KB
(最终一致的读取将消耗0.5个RCU).
If we extrapolate from the RCUs, the item it read was about 50 * 2 * 4KB = 400 KB
in size (eventually consistent reads consume 0.5 RCUs).
我在德国本地针对 eu-central-1
(德国法兰克福)运行了几次,我看到的最大延迟时间约为900毫秒.(这没有DAX.)
I ran it a few times locally from Germany against eu-central-1
(Frankfurt, Germany) and the highest latency I saw was about 900ms. (This is without DAX.)
因此,我认为您应该向我们展示如何进行测量.
import uuid
from datetime import datetime, timedelta
import boto3
import boto3.dynamodb.conditions as conditions
TABLE_NAME = "big-boi-test"
BIG_BOI_PK = "f0ba8d6c"
TABLE_RESOURCE = boto3.resource("dynamodb").Table(TABLE_NAME)
DDB_CLIENT = boto3.client("dynamodb")
def create_table():
DDB_CLIENT.create_table(
AttributeDefinitions=[{"AttributeName": "PK", "AttributeType": "S"}],
TableName=TABLE_NAME,
KeySchema=[{"AttributeName": "PK", "KeyType": "HASH"}],
BillingMode="PAY_PER_REQUEST"
)
def create_big_boi_item() -> str:
# based on calculations here: https://zaccharles.github.io/dynamodb-calculator/
template = {
"PK": {
"S": BIG_BOI_PK
},
"bigBoi": {
"S": ""
}
} # This is 16 bytes
big_boi = "X" * (1024 * 400 - 16)
template["bigBoi"]["S"] = big_boi
return template
def store_big_boi():
big_bio = create_big_boi_item()
DDB_CLIENT.put_item(
Item=big_bio,
TableName=TABLE_NAME
)
def get_big_boi_with_table_resource():
start = datetime.now()
response = TABLE_RESOURCE.get_item(
Key={"PK": BIG_BOI_PK},
ReturnConsumedCapacity="TOTAL"
)
end = datetime.now()
seconds = (end - start).total_seconds()
capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
print(f"Reading Big Boi from a Table Resource took {seconds}s and consumed {capacity_units} RCUs")
def get_big_boi_with_client():
start = datetime.now()
response = DDB_CLIENT.get_item(
Key={"PK": {"S": BIG_BOI_PK}},
ReturnConsumedCapacity="TOTAL",
TableName=TABLE_NAME
)
end = datetime.now()
seconds = (end - start).total_seconds()
capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
print(f"Reading Big Boi from a Client took {seconds}s and consumed {capacity_units} RCUs")
if __name__ == "__main__":
# create_table()
# store_big_boi()
get_big_boi_with_table_resource()
get_big_boi_with_client()
更新
我对一件看起来更像您正在使用的物品再次进行了相同的测量,无论我以哪种方式要求它们,我的平均水平仍低于1000ms:
Update
I did the same measurements again with an item that looks more like the one you're using, I'm still below 1000ms on average no matter which way I request them:
Reading Big Boi from a Table Resource took 1.492829s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.871583s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.857513s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.769432s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.690172s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.670099s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.633489s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.605999s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.598635s and consumed 50.0 RCUs
Reading Big Boi from a Table Resource took 0.606553s and consumed 50.0 RCUs
Reading Big Boi from a Client took 1.66636s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.921605s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.831735s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.707082s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.668602s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.648401s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.5695s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.592073s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.611436s and consumed 50.0 RCUs
Reading Big Boi from a Client took 0.553827s and consumed 50.0 RCUs
Average latency over 10 requests with the table resource: 0.7796304s
Average latency over 10 requests with the client: 0.7770621s
这是物品的样子:
以下是完整的测试脚本供您验证:
Here is the full test-script for you to verify:
import statistics
import uuid
from datetime import datetime, timedelta
import boto3
import boto3.dynamodb.conditions as conditions
TABLE_NAME = "big-boi-test"
BIG_BOI_PK = "NestedBoi"
TABLE_RESOURCE = boto3.resource("dynamodb").Table(TABLE_NAME)
DDB_CLIENT = boto3.client("dynamodb")
def create_table():
DDB_CLIENT.create_table(
AttributeDefinitions=[{"AttributeName": "PK", "AttributeType": "S"}],
TableName=TABLE_NAME,
KeySchema=[{"AttributeName": "PK", "KeyType": "HASH"}],
BillingMode="PAY_PER_REQUEST"
)
def create_big_boi_item() -> str:
# based on calculations here: https://zaccharles.github.io/dynamodb-calculator/
template = {
"PK": {
"S": "NestedBoi"
},
"bigBoiContainer": {
"M": {
"bigBoiList": {
"L": [
]
}
}
}
} # 43 bytes
item = {
"M": {
"t": {
"S": "1614712316"
},
"a": {
"S": "product_view"
},
"i": {
"S": "1275"
}
}
} # 36 bytes
number_of_items = int((1024 * 400 - 43) / 36)
for _ in range(number_of_items):
template["bigBoiContainer"]["M"]["bigBoiList"]["L"].append(item)
return template
def store_big_boi():
big_bio = create_big_boi_item()
DDB_CLIENT.put_item(
Item=big_bio,
TableName=TABLE_NAME
)
def get_big_boi_with_table_resource():
start = datetime.now()
response = TABLE_RESOURCE.get_item(
Key={"PK": BIG_BOI_PK},
ReturnConsumedCapacity="TOTAL"
)
end = datetime.now()
seconds = (end - start).total_seconds()
capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
print(f"Reading Big Boi from a Table Resource took {seconds}s and consumed {capacity_units} RCUs")
return seconds
def get_big_boi_with_client():
start = datetime.now()
response = DDB_CLIENT.get_item(
Key={"PK": {"S": BIG_BOI_PK}},
ReturnConsumedCapacity="TOTAL",
TableName=TABLE_NAME
)
end = datetime.now()
seconds = (end - start).total_seconds()
capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
print(f"Reading Big Boi from a Client took {seconds}s and consumed {capacity_units} RCUs")
return seconds
if __name__ == "__main__":
# create_table()
# store_big_boi()
n_experiments = 10
experiments_with_table_resource = [get_big_boi_with_table_resource() for i in range(n_experiments)]
experiments_with_client = [get_big_boi_with_client() for i in range(n_experiments)]
print(f"Average latency over {n_experiments} requests with the table resource: {statistics.mean(experiments_with_table_resource)}s")
print(f"Average latency over {n_experiments} requests with the client: {statistics.mean(experiments_with_client)}s")
如果我增加n_experiments,它可能会变得更快,这可能是因为DDB在内部缓存了.
If I increase n_experiments, it tends to get even faster, probably because DDB caches internally.
仍然:无法复制.
了解到您正在运行Lambda函数之后,我再次使用不同的内存配置在Lambda内部运行了测试.
After learning you're running Lambda functions, I ran the tests again inside of Lambda with different memory configurations.
记忆 | n_experiments | 使用资源的平均时间 | 与客户的平均时间 |
---|---|---|---|
128MB | 10 | 6.28s | 5.06s |
256MB | 10 | 3.26s | 2.61s |
512MB | 10 | 1.62s | 1.33s |
1024MB | 10 | 0.84s | 0.68s |
2048MB | 10 | 0.52s | 0.43s |
4096MB | 10 | 0.51s | 0.41s |
如注释中所述,CPU和网络性能与分配给功能的内存量成正比.您可以通过扔钱解决问题:-)
As mentioned in the comments, CPU and Network performance scale with the amount of Memory you assign to a function. You can solve your problem by throwing money at it :-)
这篇关于DynamoDB get_item以毫秒为单位读取400kb数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!