Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢 [英] Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

查看：23 发布时间：2021/11/27 10:08:14 amazon-s3 hive amazon-dynamodb amazon-emr

本文介绍了Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将大约 1.3 亿个项目(总共 5 Gb 以上)初始上传到单个 DynamoDB 表中.在我使用 API 上传它们时遇到问题申请，我决定改用 EMR.

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

长话短说，即使在最强大的集群上，导入非常平均的(对于 EMR)数据量也需要很长时间，花费数百小时而进展甚微(处理测试 2Mb 数据位大约需要 20 分钟，并且没有无法在 12 小时内完成测试 700Mb 文件).

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

我已经联系了 Amazon Premium Support，但到目前为止他们只说由于某种原因，DynamoDB 导入速度很慢".

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

我在交互式 hive 会话中尝试了以下说明:

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎没有任何可见的效果.已尝试以下设置而不是默认设置:

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

为 HDFS 而不是 DynamoDB 目标运行的相同命令在几秒钟内完成.

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

这似乎是一个简单的任务，一个非常基本的用例，我真的很想知道我在这里做错了什么.

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢 [英] Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢 [英] Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭