Hive在小部件上分割ORC文件 [英] Hive splits ORC files on small parts

查看:1229
本文介绍了Hive在小部件上分割ORC文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  create table n_data(MARKET string,CATEGORY string,D map< string,string>,monthid int,value DOUBLE)
STORED ORC
;

我将数据加载到它中(超过45000000行),查看配置单元仓库





结果表由5个大小为10MB-20MB的文件组成,但 dfs.block.size 设置为128MB,因此使用整个块来存储小文件并不是最佳选择!



如何设置128 MB的HIVE分割文件?



编辑
插入查询:

  insert into n_data 
选择tmp.market,tmp.category,d,adTable.monthid,tmp。 factperiod [adTable.monthid]作为事实
从(选择市场,类别,d,factperiod,map_keys(factperiod)作为来自n_src的month_arr,其中market不为空)作为tmp
LATERAL VIEW explode(month_arr)adTable AS monthid


解决方案

您必须设置以下配置参数hive:

  hive.merge.m apfiles = true 
hive.merge.mapredfiles = true
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize = 16000000

我有完全相同的问题,直到找到这个来源。您可以尝试使用set命令在hive会话中手动设置这些参数:

  set hive.merge。映射文件= TRUE; 
set hive.merge.mapredfiles = true;
set hive.merge.tezfiles = true;
set hive.merge.smallfiles.avgsize = 16000000;

如果您只输入set;在配置单元会话控制台中,您可以检查上述参数设置是否正确。经过测试,我建议在您的hive-site.xml配置文件或Ambari中更改它们(如果您使用的是Hortonworks分发)。干杯!

create table n_data(MARKET string,CATEGORY string,D map<string,string>,monthid int,value  DOUBLE)
  STORED AS ORC
 ;

I load data into it (over 45000000 rows), look at hive warehouse

Result table consists of 5 files with 10MB-20MB size, but dfs.block.size sets to 128MB, it's not optimal to store small files, because it uses whole block!

How to setup HIVE split files by 128 MB?

EDIT insert query:

insert into n_data
select tmp.market,tmp.category,d,adTable.monthid,tmp.factperiod[adTable.monthid] as fact 
from (select market,category,d,factperiod,map_keys(factperiod) as month_arr  from n_src where market is not null) as tmp 
LATERAL VIEW explode(month_arr) adTable AS monthid

解决方案

You have to set the following config parameters for hive:

hive.merge.mapfiles = true
hive.merge.mapredfiles = true
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize = 16000000

I had the exact same problem, until I found this source. You can try setting these params manually in a hive session by using the "set" command like this:

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=16000000;

If you just type "set;" in a hive session console, you can check if the above mentioned params where set correctly. After testing, I recommend changing them in your hive-site.xml config file or via Ambari (If you're using the Hortonworksdistribution). Cheers!

这篇关于Hive在小部件上分割ORC文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆