如何将hive并发映射器增加到4以上? [英] How to increase hive concurrent mappers to more than 4?

查看:130
本文介绍了如何将hive并发映射器增加到4以上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

总结



当我在hive中的表查询中运行一个简单的select count(*)时,我的大集群中只有两个节点用于映射。

详情



我使用的是一个有点大的集群(数十个节点每个超过200 GB RAM)运行hdfs和Hive 1.2.1(IBM-12)。

我有一个数十亿行的表。当我执行一个简单的

 从mytable中选择count(*); 



创建数百个地图任务,但只有4个同时运行。



这意味着我的集群在查询过程中大部分处于闲置状态,这似乎很浪费。我已经尝试了使用中的节点,并没有充分利用CPU或内存。我们的集群由Infiniband网络和Isilon文件存储提供支持,这两种文件都没有看起来非常重要。



我们使用mapreduce作为引擎。我试过去除了我可以找到的任何资源限制,但它并没有改变仅使用两个节点(4个并发映射器)的事实。

内存设置如下:



  yarn.nodemanager.resource.memory-mb 188928 MB 
yarn.scheduler.minimum-分配-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB

我们在41个节点上运行。通过我的计算,我应该可以获得41 * 188928/20992 = 369个地图/减少任务。



Vcor​​e设置:



  yarn.nodemanager .resource.cpu-vcores 24 
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu -vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1




  • 是否有办法让hive / mapreduce使用更多的群集?

  • 如何解决瓶颈问题


  • $ b

    我猜可能是因为Yarn没有足够快地分配任务吗?使用tez可以提高性能,但我仍然对为什么资源利用率如此有限感兴趣(并且我们没有安装ATM)。 解决方案

运行并行任务取决于纱线
中的内存设置,例如,如果您有4个数据节点和纱线存储器属性定义如下

  yarn.nodemanager.resource.memory-mb 1 GB 
yarn.scheduler.minimum-allocation -mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB

根据此设置,您有4个数据节点因此总共 yarn.nodemanager.resource.memory-mb 将为4 GB,您可以使用它来启动容器
,并且由于容器可以占用1 GB的内存,因此它意味着在任何给定的时间点您可以启动4个容器,其中一个将由应用程序主控使用,因此您可以在任何给定的时间点运行最多3个映射器或减速器任务,因为应用程序主控器,映射器和缩减器均使用1 GB内存



因此您需要增加 yarn.nodemanager.resource.memory-mb 以增加map / reduce任务的数量

PS - 在这里我们正在考虑可以启动的最大任务,它可能比那个还要少一些。


Summary

When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.

Details

I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).

I have a table of several billion rows. When I perform a simple

select count(*) from mytable;

hive creates hundreds of map tasks, but only 4 are running simultaneously.

This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.

We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).

The memory settings are as follows:

yarn.nodemanager.resource.memory-mb     188928  MB
yarn.scheduler.minimum-allocation-mb    20992   MB
yarn.scheduler.maximum-allocation-mb    188928  MB
yarn.app.mapreduce.am.resource.mb       20992   MB
mapreduce.map.memory.mb                 20992   MB
mapreduce.reduce.memory.mb              20992   MB

and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.

Vcore settings:

yarn.nodemanager.resource.cpu-vcores       24
yarn.scheduler.minimum-allocation-vcores   1
yarn.scheduler.maximum-allocation-vcores   24
yarn.app.mapreduce.am.resource.cpu-vcores  1
mapreduce.map.cpu.vcores                   1
mapreduce.reduce.cpu.vcores                1

  • Is there are way to get hive/mapreduce to use more of my cluster?
  • How would a go about figuring out the bottle neck?
  • Could it be that Yarn is not assigning tasks fast enough?

I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).

解决方案

Running parallel tasks depends on your memory setting in yarn for example if you have 4 data nodes and your yarn memory properties are defined as below

yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb    1 GB
yarn.scheduler.maximum-allocation-mb    1 GB
yarn.app.mapreduce.am.resource.mb   1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb  1 GB

according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory

so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task

P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

这篇关于如何将hive并发映射器增加到4以上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆