Hive在小部件上分割ORC文件 [英] Hive splits ORC files on small parts
问题描述
create table n_data(MARKET string,CATEGORY string,D map< string,string>,monthid int,value DOUBLE)
STORED ORC
;
我将数据加载到它中(超过45000000行),查看配置单元仓库
结果表由5个大小为10MB-20MB的文件组成,但 dfs.block.size 设置为128MB,因此使用整个块来存储小文件并不是最佳选择!
如何设置128 MB的HIVE分割文件?
编辑
插入查询:
insert into n_data
选择tmp.market,tmp.category,d,adTable.monthid,tmp。 factperiod [adTable.monthid]作为事实
从(选择市场,类别,d,factperiod,map_keys(factperiod)作为来自n_src的month_arr,其中market不为空)作为tmp
LATERAL VIEW explode(month_arr)adTable AS monthid
您必须设置以下配置参数hive:
hive.merge.m apfiles = true
hive.merge.mapredfiles = true
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize = 16000000
我有完全相同的问题,直到找到这个来源。您可以尝试使用set命令在hive会话中手动设置这些参数:
set hive.merge。映射文件= TRUE;
set hive.merge.mapredfiles = true;
set hive.merge.tezfiles = true;
set hive.merge.smallfiles.avgsize = 16000000;
如果您只输入set;在配置单元会话控制台中,您可以检查上述参数设置是否正确。经过测试,我建议在您的hive-site.xml配置文件或Ambari中更改它们(如果您使用的是Hortonworks分发)。干杯!
create table n_data(MARKET string,CATEGORY string,D map<string,string>,monthid int,value DOUBLE)
STORED AS ORC
;
I load data into it (over 45000000 rows), look at hive warehouse
Result table consists of 5 files with 10MB-20MB size, but dfs.block.size sets to 128MB, it's not optimal to store small files, because it uses whole block!
How to setup HIVE split files by 128 MB?
EDIT insert query:
insert into n_data
select tmp.market,tmp.category,d,adTable.monthid,tmp.factperiod[adTable.monthid] as fact
from (select market,category,d,factperiod,map_keys(factperiod) as month_arr from n_src where market is not null) as tmp
LATERAL VIEW explode(month_arr) adTable AS monthid
You have to set the following config parameters for hive:
hive.merge.mapfiles = true
hive.merge.mapredfiles = true
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize = 16000000
I had the exact same problem, until I found this source. You can try setting these params manually in a hive session by using the "set" command like this:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=16000000;
If you just type "set;" in a hive session console, you can check if the above mentioned params where set correctly. After testing, I recommend changing them in your hive-site.xml config file or via Ambari (If you're using the Hortonworksdistribution). Cheers!
这篇关于Hive在小部件上分割ORC文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!