大量偏斜数据集上的Hive排序操作 [英] Hive sort operation on high volume skewed dataset

查看:180
本文介绍了大量偏斜数据集上的Hive排序操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Hortonworks 2.6.5上处理一个约3 TB的大型数据集,该数据集的布局非常简单.

I am working on a big dataset of size around 3 TB on Hortonworks 2.6.5, the layout of the dataset is pretty straight forward.

数据的层次结构如下-

-Country
   -Warehouse
      -Product
          -Product Type
              -Product Serial Id

我们在上述层次结构中拥有30个国家/地区的交易数据,每个国家/地区拥有200多个仓库,而单个国家/地区美国的交易数据占整个数据集的75%左右.

We have transaction data in the above hierarchy for 30 countries each country have more than 200 warehouse, single country USA contributes around 75% of the entire data set.

问题:

1)对于每个仓库,上述数据集的事务数据带有事务日期列(trans_dt),我需要使用Hive(1.1.2版)MapReduce在每个仓库内按升序对trans_dt进行排序.我已经在国家/地区"级别创建了一个分区,然后应用"DISTRIBUTE BY Warehouse SORT BY trans_dt ASC";排序大约需要8个小时才能完成,Reducer在99%的阶段使用了最后6个小时.在这个阶段,我看到了很多洗牌.

1) We have transaction data with transaction date column (trans_dt) for the above data set for each warehouse, I need to sort trans_dt in ascending order within each warehouse using Hive (1.1.2 version) MapReduce. I have created a partition at Country level and then applied DISTRIBUTE BY Warehouse SORT BY trans_dt ASC; Sorting takes around 8 hours to finish and last 6 hrs is being used at Reducer at 99% stage. I see a lot of shuffles at this stage.

2)我们对该组合进行了很多分组-Country,Warehouse,Product,Product Type,Product Serial Id关于优化此操作的任何建议将非常有帮助.

2) We do lot of group by on this combination - Country,Warehouse,Product,Product Type,Product Serial Id any suggestion to optimize this operation will be very helpful.

3)如何处理美国国家/地区的偏斜数据集?

3) How to handle Skewed dataset for USA country ?

我们正在使用以下配置单元属性.

We are using below hive properties.

SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

推荐答案

对于美国和美国以外的地区,使用相同的查询,但要独立处理.

For US and Non US use the same query but process them independently.

Select * from Table where Country = 'US'
UNION
Select * from Table where Country <> 'US'

您可以使用脚本来处理它们,一次在一个查询中触发一个国家/地区,从而减少了一次需要处理的数据量.

You can process them using a script where you fire one country at the query at a time, reducing the volume of data that needs to be processed at one instance.

INSERT INTO TABLE <AggregateTable>
SELECT * FROM <SourceTable>
  WHERE Country in ('${hiveconf:ProcessCountry}')

这篇关于大量偏斜数据集上的Hive排序操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆