是否可以通过编写单独的mapreduce程序并行执行Hive查询? [英] Is it possible to execute Hive queries parallelly by writing seperate mapreduce program?

查看:177
本文介绍了是否可以通过编写单独的mapreduce程序并行执行Hive查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我提出了一些关于提高Hive查询性能的问题。一些答案是关于mappers和reducer的数量。我尝试过使用多个映射器和reducer,但是我没有看到执行的任何不同。不知道为什么,可能是我没有以正确的方式做,或者我错过了别的。



我想知道是否可以执行Hive查询在平行?
我究竟是什么意思,通常查询会在队列中执行。
例如:
query1



query2



query3






n

执行时间太长,我想减少执行时间。

我需要知道,如果我们在Hive JDBC程序中使用mapreduce程序,那么是否可以并行执行它?
不知道这样做是否有效,但这是我的目标?



我在下面恢复我的问题:

1)如果可以并行运行多个hive查询,是否需要多个Hive Thrift Server?



2)是否有可能打开多个Hive Thrift服务器?



3)我认为在同一个端口上不能打开多个Hive Thrift Server?



4)我们可以在不同的端口上打开多个Hive Thrift Server吗?

请为我提供一些解决方案。如果您有其他的选择,我也会尝试。

正如你可能已经知道的,Hive是一个类似于SQL的Hadoop和Map-reduce的前端。 Hive上任何非平凡的查询都会被编译为Map-Reduce并在Hadoop上运行。 Map-reduce是一个并行处理框架,因此您的每个Hive查询都将并行运行和处理数据。
Hive默认使用FIFO调度程序在Hadoop上调度作业,因此,在给定时间只能执行一个Hive查询,并在第一个查询完成时执行下一个查询。在大多数情况下,我会建议人们优化单独的Hive查询,而不是并行处理多个Hive查询。如果您倾向于并行化Hive查询,则可能表示您的集群被低效使用。为了进一步分析Hive查询的性能和使用情况,可以安装一个分布式监控系统,例如 Ganglia ,以监控使用你的群集(Amazon EMR也支持它)。


长话短说,你不必编写map-reduce程序;这就是你首先使用Hive的原因。但是,如果您对Hive可能不知道的数据有所了解,则可能会导致Hive查询的性能达不到最佳性能。例如,您的数据可能按某个列进行排序,Hive可能不知道该信息。在这种情况下,如果您无法在Hive中设置额外的元信息,编写map-reduce作业可能会考虑到这些附加信息,并可能为您提供更好的性能。在大多数情况下,我发现Hive性能与对应于Hive查询的Map-reduce作业相当。


I have asked some of the questions on increasing the performance of Hive queries. Some of the answers were pertaining to number of mappers and reducers. I tried with multiple mappers and reducers but I didn't see any difference in the execution. Don't know why, may be I did not do it in the right way or I missed something else.

I would like to know is it possible to execute Hive queries in parallell? What exactly I mean is, normally the queries get executed in a queue. For instance: query1

query2

query3

. . . n

It takes too much time to execute and I want to reduce the execution time.

I need to know if we use mapreduce program in Hive JDBC program then is it possible to execute it in parallel? Don't know if that will work or not but that's my aim to achieve?

I am reinstating my questions below:

1) If it is possible to run multiple hive queries in parallel, does it requires multiple Hive Thrift Server?

2) Is it possible to open multiple Hive Thrift Servers?

3) I think it is not possible to open multiple Hive Thrift Server on same port?

4) Can we open multiple Hive Thrift Server on different ports?

Please suggest me some solution for this. If you have any other alternative I will try that as well.

解决方案

As you might already know, Hive is a SQL-like front-end to Hadoop and Map-reduce. Any non-trivial query on Hive gets compiled to Map-Reduce and run on Hadoop. Map-reduce is a parallel processing framework, therefore each of your Hive queries will run and process data in parallel. Hive uses a FIFO scheduler by default to schedule jobs on Hadoop, therefore, only one Hive query can be executed at a given time and the next query would be executed when the first one is done. In most circumstances, I would suggest people to optimize individual Hive queries instead of parallelizing multiple Hive queries. If you are inclined towards parallelizing Hive queries, it might be an indicative of your cluster being used inefficiently. To further analyze the performance and usage of your Hive queries, you can install a distributed monitoring system like Ganglia for monitoring the usage of your cluster (Amazon EMR supports it too).

Long story short, you don't have to write a map-reduce program; that's what you are using Hive for in the first place. However, if there is something you might know about the data that Hive might not, it might result in sub-optimal performance of your Hive queries. For example, your data might be sorted by some column and Hive might not know about that information. In such cases, if you can't set that additional meta-information in Hive, it might make sense to write a map-reduce job that takes that additional information into account and potentially gives you better performance. In most cases, I have found Hive performance to be at-par with Map-reduce jobs corresponding to the Hive query.

这篇关于是否可以通过编写单独的mapreduce程序并行执行Hive查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆