如何提高从非分区表加载数据到HIVE ORC分区表中的性能 [英] How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

查看:876
本文介绍了如何提高从非分区表加载数据到HIVE ORC分区表中的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hive Querying的新手,我正在寻找从Hive表中检索数据的最佳实践。我们已经启用了TeZ的执行引擎并启用了向量化。

我们希望从Hive表格中进行报告,我从TEZ文档中读取它可以用于实时报告。场景来自我的WEB应用程序,我想在UI上显示Hive Query Select *从Hive表中的结果,但对于任何查询,在hive命令提示符下最少需要20-60秒,即使hive表有60 GB数据。 1)任何人都可以告诉我如何通过查询Hive表来显示实时报告,并在10-30秒内立即在用户界面上显示结果。

$ b $

b
2)我们发现的另一个问题是,最初我们有非分区表指向HDFS中的Blob /文件,它的大小为60 GB,200列,当我们从Un - ORC表中的分区表(ORC表分区),需要3 + hrs,有没有一种方法可以提高将数据转储到ORC表中的性能。



3)当我们使用bucketing查询非分区表时,插入到配置表和查询比在ORC表上选择查询花费更少的时间,但具有hive表中的记录数增加ORC表的SELECT查询优于具有存储桶的表。有没有办法提高小数据集的性能呢?由于它是初始阶段,我们每个月都会将50 GB数据加载到Hive表中。但它可以增加,我们希望提高将数据加载到Orc分区表中的性能。
$ b 4)TEZ支持交互式,减少延迟并深入支持报告。如何启用我的向下钻取报告以在人类反应时间(即5-40秒)内从Hive(应该是交互式)获取数据。



我们正在使用4个节点进行测试,每个节点有4个cpu核心,7 GB RAM和3个磁盘连接到每个虚拟机。



谢谢,
Mahender

解决方案

为了提高向ORC表插入数据的速度,您可以尝试使用以下参数进行游戏:

  hive.exec.orc.memory.pool 
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize

另外,您可能会看到,压缩是否也可以帮助您。例如:

  SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; 
SET hive.exec.compress.intermediate = true;

希望它有帮助!


I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.

We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.

1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs

2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.

3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.

4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.

we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.

Thanks, Mahender

解决方案

In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:

hive.exec.orc.memory.pool 
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size 
hive.exec.orc.default.buffer.size
dfs.blocksize

Also, you might see, whether compression might also help you. For example:

SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;

Hope it helps!

这篇关于如何提高从非分区表加载数据到HIVE ORC分区表中的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆