Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢 [英] Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

查看:23
本文介绍了Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将大约 1.3 亿个项目(总共 5 Gb 以上)初始上传到单个 DynamoDB 表中.在我使用 API 上传它们时遇到问题申请,我决定改用 EMR.

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

长话短说,即使在最强大的集群上,导入非常平均的(对于 EMR)数据量也需要很长时间,花费数百小时而进展甚微(处理测试 2Mb 数据位大约需要 20 分钟,并且没有无法在 12 小时内完成测试 700Mb 文件).

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

我已经联系了 Amazon Premium Support,但到目前为止他们只说由于某种原因,DynamoDB 导入速度很慢".

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

我在交互式 hive 会话中尝试了以下说明:

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎没有任何可见的效果.已尝试以下设置而不是默认设置:

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

为 HDFS 而不是 DynamoDB 目标运行的相同命令在几秒钟内完成.

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

这似乎是一个简单的任务,一个非常基本的用例,我真的很想知道我在这里做错了什么.

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

推荐答案

这是我最近从 AWS 支持部门得到的答案.希望对遇到类似情况的人有所帮助:

Here is the answer I finally got from AWS support recently. Hope that helps someone in a similar situation:

EMR 工作人员目前实现为单线程工作人员,每个工人一个一个地写入项目(使用 Put,而不是 BatchWrite).因此,每次写入消耗 1 个写入容量单位 (IOP).

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

这意味着您正在建立很多连接在一定程度上降低性能.如果使用 BatchWrites,它这意味着您可以在单个操作中提交最多 25 行性能方面的成本会更低(但如果我理解,价格相同对的).这是我们知道并且可能会将来在 EMR 中实施.但我们无法提供时间表.

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

如前所述,这里的主要问题是您在 DynamoDB 中的表正在达到规定的吞吐量,因此请尝试增加它暂时用于导入,然后随意将其减少到您需要的任何级别.

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

这听起来可能有点方便,但有一个问题当您执行此操作时发出警报,这就是为什么您从未收到警报.此问题已得到解决.

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.

这篇关于Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆