DynamoDB get_item以毫秒为单位读取400kb数据 [英] DynamoDB get_item to read 400kb data in milliseconds

查看:58
本文介绍了DynamoDB get_item以毫秒为单位读取400kb数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个称为 events 的动态表,其中存储了所有 user事件详细信息,例如 product_view add_to_cart product_purchase

I have a dynamodb table called events in which i stored all user event details like product_view ,add_to_cart and product_purchase

在此 events 表中,我有一些 items ,其存储容量达到了 400kb

In this events table, I have some items whose storage capacity reached 400kb

问题:

        response = self._table.get_item(
            Key={
                PARTITION_KEY: <pk>,
                SORT_KEY: <sk>,
            },
            ConsistentRead=False,
        )

当我想使用 dynamodb get_item 方法访问 item(400kb)时,需要大约 5秒返回结果

when I want to use dynamodb get_item method to access the item(400kb), it is taking around 5 seconds to return the result.

我已经使用了DAX

目标

我想在不到1秒的时间内读取 400kb 项.

I want to read 400kb item in less than a 1 second.

重要信息:

dynamodb 中的数据将以这种格式存储

The data in the dynamodb will be stored in this format

{
 "partition_key": "user_id1111",
 "sort_key": "version_1",
 "attributes": {
  "events": [
   {
    "t": "1614712316",  
    "a": "product_view",   
    "i": "1275"
   },
   {
    "t": "1614712316",  
    "a": "product_add",   
    "i": "1275"
   },
   {
    "t": "1614712316",  
    "a": "product_purchase",   
    "i": "1275"
   },
    ...

  ]
 }
}

  • t 是一个时间戳记
  • a 可能是 product_view product_add product_purchase
  • i 是product_id
    • t is a timestamp
    • a may be product_view,product_add,product_purchase
    • i is the product_id
    • 如果您看到上面的项目,则 events 是一个列表,它将被新事件附加.

      If you see above item events is a list and it will be appended by new events.

      我有一个 400kb 项,其中 events 列表中的事件数

      I have an item which is 400kb with number of events in the events list

      我写了一些脚本来测量时间,结果在下面给出

      I wrote some script to measure the time and the results are given below

      import boto3
      import datetime
      
      dynamodb = boto3.resource('dynamodb')
      
      table = dynamodb.Table('events')
      
      pk = f"user_id1111"
      sk = f"version_1"
      
      
      t_load_start = datetime.datetime.now()
      
      
      response = table.get_item(
          Key={
              "partition_key": pk,
              "sort_key": sk,
          },
          ReturnConsumedCapacity="TOTAL"
      )
      capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
      
      t_load_end = datetime.datetime.now()
      seconds = (t_load_end - t_load_start).total_seconds()
      
      print(f"Elapsed time is::{seconds}sec and {capacity_units} capacity units")
      

      这是我得到的输出.

      Elapsed time is::5.676799sec and 50.0 capacity units
      

      有人可以为此提出建议吗?

      Can anyone suggest a solution for this?

      推荐答案

      tl; dr:将函数的内存增加到至少1024MB,请参阅更新2


      我很好奇,所以我做了一些测量.我创建了一个脚本,可以在一个新表中创建一个大小约为400KB的大型Boi项目.

      tl;dr: Increase your functions memory to at least 1024MB, see update 2


      I was curious, so I did some measurements. I created a script that creates a big boi item with pretty much exactly 400KB in size in a fresh table.

      然后,我测试从Python读取的两次读取-一个使用资源API,另一个使用较低级别的客户端-最终在这两种情况下都保持一致的读取.

      Then I test two reads from Python - one with the resource API and the other with the lower level client - eventually consistent reads in both cases.

      这是我测量的:

      Reading Big Boi from a Table Resource took 0.366508s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.301585s and consumed 50.0 RCUs
      

      如果从RCU推断,则读取的项目大小约为 50 * 2 * 4KB = 400 KB (最终一致的读取将消耗0.5个RCU).

      If we extrapolate from the RCUs, the item it read was about 50 * 2 * 4KB = 400 KB in size (eventually consistent reads consume 0.5 RCUs).

      我在德国本地针对 eu-central-1 (德国法兰克福)运行了几次,我看到的最大延迟时间约为900毫秒.(这没有DAX.)

      I ran it a few times locally from Germany against eu-central-1 (Frankfurt, Germany) and the highest latency I saw was about 900ms. (This is without DAX.)

      因此,我认为您应该向我们展示如何进行测量.

      import uuid
      from datetime import datetime, timedelta
      
      import boto3
      import boto3.dynamodb.conditions as conditions
      
      TABLE_NAME = "big-boi-test"
      BIG_BOI_PK = "f0ba8d6c"
      
      TABLE_RESOURCE = boto3.resource("dynamodb").Table(TABLE_NAME)
      DDB_CLIENT = boto3.client("dynamodb")
      
      def create_table():
          DDB_CLIENT.create_table(
              AttributeDefinitions=[{"AttributeName": "PK", "AttributeType": "S"}],
              TableName=TABLE_NAME,
              KeySchema=[{"AttributeName": "PK", "KeyType": "HASH"}],
              BillingMode="PAY_PER_REQUEST"
          )
      
      def create_big_boi_item() -> str:
          # based on calculations here: https://zaccharles.github.io/dynamodb-calculator/
          template = {
              "PK": {
                  "S": BIG_BOI_PK
              },
              "bigBoi": {
                  "S": ""
              }
          } # This is 16 bytes
      
          big_boi = "X" * (1024 * 400 - 16)
          template["bigBoi"]["S"] = big_boi
          return template
      
      def store_big_boi():
          big_bio = create_big_boi_item()
      
          DDB_CLIENT.put_item(
              Item=big_bio,
              TableName=TABLE_NAME
          )
      
      def get_big_boi_with_table_resource():
      
          start = datetime.now()
          response = TABLE_RESOURCE.get_item(
              Key={"PK": BIG_BOI_PK},
              ReturnConsumedCapacity="TOTAL"
          )
          end = datetime.now()
          seconds = (end - start).total_seconds()
          capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
      
          print(f"Reading Big Boi from a Table Resource took {seconds}s and consumed {capacity_units} RCUs")
      
      def get_big_boi_with_client():
      
          start = datetime.now()
          response = DDB_CLIENT.get_item(
              Key={"PK": {"S": BIG_BOI_PK}},
              ReturnConsumedCapacity="TOTAL",
              TableName=TABLE_NAME
          )
          end = datetime.now()
          seconds = (end - start).total_seconds()
          capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
      
          print(f"Reading Big Boi from a Client took {seconds}s and consumed {capacity_units} RCUs")
      
      if __name__ == "__main__":
          # create_table()
          # store_big_boi()
          get_big_boi_with_table_resource()
          get_big_boi_with_client()
      

      更新

      我对一件看起来更像您正在使用的物品再次进行了相同的测量,无论我以哪种方式要求它们,我的平均水平仍低于1000ms:

      Update

      I did the same measurements again with an item that looks more like the one you're using, I'm still below 1000ms on average no matter which way I request them:

      Reading Big Boi from a Table Resource took 1.492829s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.871583s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.857513s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.769432s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.690172s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.670099s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.633489s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.605999s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.598635s and consumed 50.0 RCUs
      Reading Big Boi from a Table Resource took 0.606553s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 1.66636s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.921605s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.831735s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.707082s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.668602s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.648401s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.5695s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.592073s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.611436s and consumed 50.0 RCUs
      Reading Big Boi from a Client took 0.553827s and consumed 50.0 RCUs
      Average latency over 10 requests with the table resource: 0.7796304s
      Average latency over 10 requests with the client: 0.7770621s
      

      这是物品的样子:

      以下是完整的测试脚本供您验证:

      Here is the full test-script for you to verify:

      import statistics
      import uuid
      from datetime import datetime, timedelta
      
      import boto3
      import boto3.dynamodb.conditions as conditions
      
      TABLE_NAME = "big-boi-test"
      BIG_BOI_PK = "NestedBoi"
      
      TABLE_RESOURCE = boto3.resource("dynamodb").Table(TABLE_NAME)
      DDB_CLIENT = boto3.client("dynamodb")
      
      def create_table():
          DDB_CLIENT.create_table(
              AttributeDefinitions=[{"AttributeName": "PK", "AttributeType": "S"}],
              TableName=TABLE_NAME,
              KeySchema=[{"AttributeName": "PK", "KeyType": "HASH"}],
              BillingMode="PAY_PER_REQUEST"
          )
      
      def create_big_boi_item() -> str:
          # based on calculations here: https://zaccharles.github.io/dynamodb-calculator/
          template = {
              "PK": {
                  "S": "NestedBoi"
              },
              "bigBoiContainer": {
                  "M": {
                  "bigBoiList": {
                      "L": [
                      
                      ]
                  }
                  }
              }
          } # 43 bytes
      
          item = {
              "M": {
              "t": {
                  "S": "1614712316"
              },
              "a": {
                  "S": "product_view"
              },
              "i": {
                  "S": "1275"
              }
              }
          }  # 36 bytes
      
          number_of_items = int((1024 * 400 - 43) / 36)
      
          for _ in range(number_of_items):
              template["bigBoiContainer"]["M"]["bigBoiList"]["L"].append(item)
      
          return template
      
      def store_big_boi():
          big_bio = create_big_boi_item()
      
          DDB_CLIENT.put_item(
              Item=big_bio,
              TableName=TABLE_NAME
          )
      
      def get_big_boi_with_table_resource():
      
          start = datetime.now()
          response = TABLE_RESOURCE.get_item(
              Key={"PK": BIG_BOI_PK},
              ReturnConsumedCapacity="TOTAL"
          )
          end = datetime.now()
          seconds = (end - start).total_seconds()
          capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
      
          print(f"Reading Big Boi from a Table Resource took {seconds}s and consumed {capacity_units} RCUs")
      
          return seconds
      
      def get_big_boi_with_client():
      
          start = datetime.now()
          response = DDB_CLIENT.get_item(
              Key={"PK": {"S": BIG_BOI_PK}},
              ReturnConsumedCapacity="TOTAL",
              TableName=TABLE_NAME
          )
          end = datetime.now()
          seconds = (end - start).total_seconds()
          capacity_units = response["ConsumedCapacity"]["CapacityUnits"]
      
          print(f"Reading Big Boi from a Client took {seconds}s and consumed {capacity_units} RCUs")
      
          return seconds
      
      if __name__ == "__main__":
          # create_table()
          # store_big_boi()
      
          n_experiments = 10
          experiments_with_table_resource = [get_big_boi_with_table_resource() for i in range(n_experiments)]
          experiments_with_client = [get_big_boi_with_client() for i in range(n_experiments)]
          print(f"Average latency over {n_experiments} requests with the table resource: {statistics.mean(experiments_with_table_resource)}s")
          print(f"Average latency over {n_experiments} requests with the client: {statistics.mean(experiments_with_client)}s")
      
      

      如果我增加n_experiments,它可能会变得更快,这可能是因为DDB在内部缓存了.

      If I increase n_experiments, it tends to get even faster, probably because DDB caches internally.

      仍然:无法复制.

      了解到您正在运行Lambda函数之后,我再次使用不同的内存配置在Lambda内部运行了测试.

      After learning you're running Lambda functions, I ran the tests again inside of Lambda with different memory configurations.

      <身体>
      记忆 n_experiments 使用资源的平均时间与客户的平均时间
      128MB 10 6.28s 5.06s
      256MB 10 3.26s 2.61s
      512MB 10 1.62s 1.33s
      1024MB 10 0.84s 0.68s
      2048MB 10 0.52s 0.43s
      4096MB 10 0.51s 0.41s

      如注释中所述,CPU和网络性能与分配给功能的内存量成正比.您可以通过扔钱解决问题:-)

      As mentioned in the comments, CPU and Network performance scale with the amount of Memory you assign to a function. You can solve your problem by throwing money at it :-)

      这篇关于DynamoDB get_item以毫秒为单位读取400kb数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆