针对Amazon EMR / Hive处理S3中大量文件的性能调优 [英] Performance tuning for Amazon EMR / Hive processing large number of files in S3

查看:271
本文介绍了针对Amazon EMR / Hive处理S3中大量文件的性能调优的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将Amazon EMR与Hive一起使用,以处理由广告跟踪服务器生成的大量日志文件。性能比我预期的要差得多,我希望有人能够给我提供改进的指针。

跟踪服务器每隔几分钟就会将日志文件上传到由S3分区的S3文件夹日(例如,2014-05-20)。每天上传大约3000个文件,每个文件大约20K。

使用Hive,我成功创建了引用S3中数据的外部表,并为30天的日志文件设置了分区。我已经验证了分区工作正常,并且简单的查询(例如,SELECT * FROM click WHERE dt ='2014-05-19'LIMIT 10)可以正常工作,并且可以快速响应。



我将数据加载到临时HDFS表中供后续查询使用,为此我运行一个基本上是HQL的作业(注意 click 是在S3中的外部表):

pre $ CREATE TABLE tmp_click(
clickId string,
- ...
dt字符串

STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE tmp_click
SELECT
clickId,
- .. 。
k.dt
FROM
点击k
WHERE
k.dt> ='$ {START_DAY}'AND
k.dt< ='$ {END_DAY}'
;

此操作需要花费1个小时xlarge作为核心/任务节点的实例鉴于基本上没有进行任何处理,只是将数据复制过来,对吧? - 我觉得必须有一些我错过的东西。任何人都可以给我任何提示调查?

我认为可能是大量的文件(〜3000天),或压缩(gz)的日志文件可能是问题,但我无法控制输入。

解决方案

您的查询不得不同时处理S3N协议列表S3中的文件并处理压缩。尝试使用s3distcp将文件从S3更快地复制到HDFS,然后使用复制的文件创建一个表。


I am trying to use Amazon EMR with Hive to process a rather large number of logfiles generated by ad tracking servers. The performance is far worse than I would expect, and am hoping someone can give me pointers for improvement.

The tracking servers upload log files every few minutes to S3 folders partitioned by day (e.g., "2014-05-20"). There are roughly 3,000 total files uploaded per day, roughly 20k per file.

Using Hive, I have successfully created external tables referencing the data in S3, and set up partitions for 30 days worth of log files. I have verified that the partitioning is working correctly, and simple queries (e.g., "SELECT * FROM click WHERE dt='2014-05-19' LIMIT 10) work correctly and respond quickly.

I am loading the data into temporary HDFS tables for subsequent queries. To do so, I run an HQL job that is essentially this (note that click is the external table in S3):

CREATE TABLE tmp_click (
    clickId string,
    -- ...
    dt string
)
STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE tmp_click
    SELECT 
        clickId, 
            -- ...
            k.dt
    FROM
        click k
    WHERE
        k.dt >= '${START_DAY}' AND
        k.dt <=  '${END_DAY}'
;

This operation takes upwards of an hour with 25 xlarge instances working as core/task nodes. Given that there is basically no processing going on here -- it's just copying the data over, right? -- I feel like there must be something I'm missing. Can anyone give me any tips to investigate?

I've considered that perhaps the large number of files (~3,000 day), or the compression (gz) of the logfiles might be problems, but I have no ability to control the input.

解决方案

Your query surly has to both deal with S3N protocol listing the files in S3 and handling compression. Try to use s3distcp to copy the files from S3 to HDFS faster and then create a table with the copied files.

这篇关于针对Amazon EMR / Hive处理S3中大量文件的性能调优的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆