如何减少查询中的容器数 [英] how to reduce the number of containers in the query

查看:198
本文介绍了如何减少查询中的容器数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询使用了很多容器和很多内存. (使用的内存的97%). 有没有一种方法可以设置查询中使用的容器数并限制最大内存? 该查询正在Tez上运行.

I have a query using to much containers and to much memory. (97% of the memory used). Is there a way to set the number of containers used in the query and limit the max memory? The query is running on Tez.

预先感谢

推荐答案

控制映射器的数量:

映射器的数量取决于各种因素,例如,数据如何在节点之间分配,输入格式,执行引擎和配置参数.另请参见 初始任务并行性的工作原理

The number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also How initial task parallelism works

MR使用CombineInputFormat,而Tez使用分组拆分.

MR uses CombineInputFormat, while Tez uses grouped splits.

Tez:

set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split

增加这些数字以减少正在运行的映射器的数量.

Increase these figures to reduce the number of mappers running.

Mapper也正在数据所在的数据节点上运行,这就是为什么手动控制Mapper数量不是一件容易的事,并非总能组合输入的原因.

Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.

控制减速器的数量:

根据确定的减速器数量

mapreduce.job.reduces

  • 每个作业的默认减少任务数量.通常设置为接近可用主机数量的质数.当mapred.job.tracker为本地"时被忽略. Hadoop默认将其设置为1,而Hive使用-1作为其默认值.通过将此属性设置为-1,Hive将自动计算出减速器的数量.
    • The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
    • hive.exec.reducers.bytes.per.reducer-Hive 0.14.0和更早版本中的默认值为1 GB.

      hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

      hive.exec.reducers.max-将使用的最大减速器数量.如果mapreduce.job.reduces为负数,则Hive在自动确定减速器数量时将以此为最大减速器数量.

      Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

      只需设置hive.exec.reducers.max=<number>即可限制运行的减速器数量.

      Simply set hive.exec.reducers.max=<number> to limit the number of reducers running.

      如果要增加reducer的并行度,请增加hive.exec.reducers.max并减少hive.exec.reducers.bytes.per.reducer.

      If you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer.

      内存设置

      set tez.am.resource.memory.mb=8192;
      set tez.am.java.opts=-Xmx6144m;
      set tez.reduce.memory.mb=6144;
      set hive.tez.container.size=9216;
      set hive.tez.java.opts=-Xmx6144m;
      

      默认设置意味着实际的Tez任务将使用映射器的内存设置:

      The default settings mean that the actual Tez task will use the mapper's memory setting:

      hive.tez.container.size = mapreduce.map.memory.mb
      hive.tez.java.opts = mapreduce.map.java.opts
      

      请阅读以下内容以获取更多详细信息: 解密Apache Tez内存调整-逐步

      Read this for more details: Demystify Apache Tez Memory Tuning - Step by Step

      我建议先优化查询.如果可能,请使用map-joins,使用矢量化执行,如果要编写分区表,请添加distribute by partitin key,以减少reducer上的内存消耗,当然还要编写好的sql.

      I would suggest to optimize query first. Use map-joins if possible, use vectorising execution, add distribute by partitin key if you are writing partitioned table to reduce memory consumption on reducers and write good sql of course.

      这篇关于如何减少查询中的容器数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆