> 技术文档 > k8s容器放开锁内存限制

k8s容器放开锁内存限制

参考:https://access.redhat.com/solutions/1257953

问题

nccl-test容器docker.io/library/nccl-tests:24.12中跑mpirun,buff设置为NCCL_BUFFSIZE=503316480
提示out of memory:

pod-1:78:91 [0] include/alloc.h:114 NCCL WARN Cuda failure \'out of memory\'pod-1:78:91 [0] include/alloc.h:119 NCCL WARN Failed to CUDA host alloc -268435456 bytespod-1:78:91 [0] NCCL INFO transport/net.cc:517 -> 1pod-1:78:91 [0] NCCL INFO transport/net.cc:719 -> 1pod-1:78:93 [0] NCCL INFO transport.cc:193 -> 1pod-1:78:93 [0] NCCL INFO group.cc:133 -> 1pod-1:78:93 [0] NCCL INFO group.cc:75 -> 1 [Async thread]pod-1:78:91 [0] proxy.cc:1620 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connectionpod-1:78:78 [0] NCCL INFO group.cc:426 -> 1pod-1:78:78 [0] NCCL INFO group.cc:566 -> 1pod-1:78:78 [0] NCCL INFO group.cc:106 -> 1pod-1: Test NCCL failure sendrecv.cu:57 \'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / \' .. pod-1 pid 78: Test failure common.cu:383 .. pod-1 pid 78: Test failure common.cu:592 .. pod-1 pid 78: Test failure sendrecv.cu:103 .. pod-1 pid 78: Test failure common.cu:625 .. pod-1 pid 78: Test failure common.cu:1123 .. pod-1 pid 78: Test failure common.cu:893

问题确认

容器内执行ulimit -a显示max locked memory只有64k
k8s容器放开锁内存限制

放开容器max locked memory限制

在 /etc/systemd/system/docker.service中增加LimitMEMLOCK=infinity
k8s容器放开锁内存限制
然后重启docker:

systemctl daemon-reload systemctl restart docker