如何增加Torque作业的OpenFabrics内存限制? [英] How can I increase OpenFabrics memory limit for Torque jobs?

查看:136
本文介绍了如何增加Torque作业的OpenFabrics内存限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在InfiniBand上运行MPI作业时,会遇到以下问题.我们使用扭矩管理器.

When I run MPI job over InfiniBand, I get the following worning. We use Torque Manager.

--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:              host1

Registerable memory:     65536 MiB

Total memory:            196598 MiB

Your MPI job will continue, but may be behave poorly and/or hang.

--------------------------------------------------------------------------

我已经阅读了警告消息上的链接,到目前为止,我已经做过了;

I've read the link on the warning message, and I've done so far is;

  1. /etc/modprobe.d/mlx4_en.conf上附加options mlx4_core log_num_mtt=20 log_mtts_per_seg=4.
  2. 确保以下行写在/etc/security/limits.conf
    • * soft memlock unlimited
    • * hard memlock unlimited
  1. Append options mlx4_core log_num_mtt=20 log_mtts_per_seg=4 on /etc/modprobe.d/mlx4_en.conf.
  2. Make sure the following lines are written on /etc/security/limits.conf
    • * soft memlock unlimited
    • * hard memlock unlimited

有人可以帮助我找出我所缺少的吗?

Can anyone help me to find out what I'm missing?

推荐答案

您的mlx4_core参数仅允许注册2^20 * 2^4 * 4 KiB = 64 GiB.每个节点具有192 GiB的物理内存,并且建议您至少拥有两倍的可注册内存,因此应将log_num_mtt设置为23,这会将限制增加到512 GiB-2的最接近幂等于或大于2到两倍的RAM.请确保重新引导节点或卸载然后重新加载内核模块.

Your mlx4_core parameters allow for the registration of 2^20 * 2^4 * 4 KiB = 64 GiB only. With 192 GiB of physical memory per node and given that it is recommended to have at least twice as much registerable memory, you should set log_num_mtt to 23, which would increase the limit to 512 GiB - the closest power of two greater or equal to twice the amount of RAM. Be sure to reboot the node(s) or unload and then reload the kernel module.

您还应该提交一个执行ulimit -l的简单Torque作业脚本,以验证锁定内存的限制并确保没有此类限制.请注意,ulimit -c unlimited不会删除对锁定内存量的限制,而是对核心转储文件的大小的限制.

You should also submit a simple Torque job script that executes ulimit -l in order to verify the limits on locked memory and make sure there is no such limit. Note that ulimit -c unlimited does not remove the limit on the amount of locked memory but rather the limit on the size of core dump files.

这篇关于如何增加Torque作业的OpenFabrics内存限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆