在Hive表格中插入时创建多个部分 [英] Multiple parts created while inserting in Hive table

查看:381
本文介绍了在Hive表格中插入时创建多个部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个hive表(带压缩),其定义如下:

$ p $ create table temp1(col1 string,col2 int)
由(col3字符串,col4字符串)分隔
行格式分隔
字段以''结尾
以'\\\'格式转义
以'\\'结尾的行\\ \\ n'
存储为序列文件;

当我执行一个简单的选择并从其他配置单元表中插入(无减速器运行)看到一个独特的模式,这个压缩表中的数据被分割成很小的文件( table 1 ):有时1GB的数据被分割成200-300个文件,因此增加了块的数量尽管它应该只跨越16个块),因为这个非常高的地图是在我查询这个新表时形成的。文件大小不超过245mb(表2 )。是否有一个设置限制为64mb(或64mb的倍数或只是一个文件),因为我的块大小为64 MB,因此超出的块不会被创建。



表1



名称|类型|大小|块大小



000000_0 |文件| 30.22MB | 64 MB



<00> 000001_0 |文件| 26.19MB | 64 MB



<00> 000002_0 |文件| 25.19MB | 64 MB



<00> 000003_0 |文件| 24.74MB | 64 MB



<00> 000004_0 |文件| 24.54MB | 64 MB

..........



000031_0 |文件| 0.9MB | 64 MB

表2

类型|大小|块大小



000000_0 |文件| | 245.02MB | 64 MB



<00> 000001_0 |文件| 245.01MB | 64 MB



<00> 000002_0 |文件| 244.53MB | 64 MB



<00> 000003_0 |文件| | 244.4MB | 64 MB



<00> 000004_0 |文件| 198.21MB | 64 MB的解决方案

已经解决了这个问题,这要感谢Bryan的人员,他强调了控制查询输出格式的hive变量。
我在会话中测试了以下hive变量的设置:
set hive.merge.mapredfiles = true
set hive.merge.size.per.task = 256000000
set hive.merge.smallfiles.avgsize = 256000000



所以现在在一个分区里,我得到大小约为256mb的压缩文件。
要永久设置这些变量,请在该用户的主目录中使用相同的语句创建一个.hiverc文件。



希望这有助于

I have a hive table (with compression) with definition like

create table temp1 (col1 string, col2 int)
partitioned by (col3 string, col4 string) 
row format delimited 
fields terminated by ',' 
escaped by '\\' 
lines terminated by '\n'
stored as sequencefile;

When I do a simple select and insert (no reducers running) from another hive table to this table i see a unique pattern, data in this table with compression gets split in high no of files of very small size (table 1 : at times 1gb data gets split over 200-300 files thus increasing the no of blocks consumed though it should have spanned only 16blocks) due to this very high no of maps are formed when I query this new table. File size does not go beyond 245mb (table 2). Is there a setting to restrict this to 64mb (or multiple of 64mb or just a single file) as my block size is 64 mb and hence excess blocks will not get created.

TABLE 1

Name | Type | Size | Block Size

000000_0 | file | 30.22MB | 64 MB

000001_0 | file | 26.19MB | 64 MB

000002_0 | file | 25.19MB | 64 MB

000003_0 | file | 24.74MB | 64 MB

000004_0 | file | 24.54MB | 64 MB

..........

000031_0 | file | 0.9MB | 64 MB

TABLE 2

Name | Type | Size | Block Size

000000_0 | file | 245.02MB | 64 MB

000001_0 | file | 245.01MB | 64 MB

000002_0 | file | 244.53MB | 64 MB

000003_0 | file | 244.4MB | 64 MB

000004_0 | file | 198.21MB | 64 MB

解决方案

have solved the issue thanks to Bryan's who highlighted hive variables that control the query output format. I tested setting following hive variables in a session: set hive.merge.mapredfiles=true set hive.merge.size.per.task=256000000 set hive.merge.smallfiles.avgsize=256000000

So now inside a partition I am getting compressed files of size ~ 256mb. To permanently set these variables create a .hiverc file with the same statements in home directory of that user.

Hope this helps

这篇关于在Hive表格中插入时创建多个部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆