RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please tr

技术文档

1. 部署vllm服务报gpu内存错误

报错信息：

ERROR 05-10 09:27:22 [core.py:400] RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.Process EngineCore_0:Traceback (most recent call last): File \"/workspace/vllm/vllm/v1/worker/gpu_model_runner.py\", line 1580, in _dummy_sampler_run sampler_output = self.sampler(logits=logits,  ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py\", line 1750, in _call_impl return forward_call(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/sample/sampler.py\", line 49, in forward sampled = self.sample(logits, sampling_metadata)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/sample/sampler.py\", line 115, in sample random_sampled = self.topk_topp_sampler(  ^^^^^^^^^^^^^^^^^^^^^^^ File \"/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py\", line 1750, in _call_impl return forward_call(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py\", line 91, in forward_native logits = apply_top_k_top_p(logits, k, p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/sample/ops/topk_topp_sampler.py\", line 189, in apply_top_k_top_p logits_sort, logits_idx = logits.sort(dim=-1, descending=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 298.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 308.31 MiB is free. Including non-PyTorch memory, this process has 29.94 GiB memory in use. Of the allocated memory 28.69 GiB is allocated by PyTorch, with 75.88 MiB allocated in private pools (e.g., CUDA Graphs), and 32.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)The above exception was the direct cause of the following exception:Traceback (most recent call last): File \"/usr/lib/python3.12/multiprocessing/process.py\", line 314, in _bootstrap self.run() File \"/usr/lib/python3.12/multiprocessing/process.py\", line 108, in run self._target(*self._args, **self._kwargs) File \"/workspace/vllm/vllm/v1/engine/core.py\", line 404, in run_engine_core raise e File \"/workspace/vllm/vllm/v1/engine/core.py\", line 391, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/engine/core.py\", line 333, in __init__ super().__init__(vllm_config, executor_class, log_stats, File \"/workspace/vllm/vllm/v1/engine/core.py\", line 72, in __init__ self._initialize_kv_caches(vllm_config) File \"/workspace/vllm/vllm/v1/engine/core.py\", line 158, in _initialize_kv_caches self.model_executor.initialize_from_config(kv_cache_configs) File \"/workspace/vllm/vllm/v1/executor/abstract.py\", line 65, in initialize_from_config self.collective_rpc(\"compile_or_warm_up_model\") File \"/workspace/vllm/vllm/executor/uniproc_executor.py\", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/utils.py\", line 2555, in run_method return func(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/worker/gpu_worker.py\", line 254, in compile_or_warm_up_model self.model_runner._dummy_sampler_run( File \"/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py\", line 116, in decorate_context return func(*args, **kwargs)  ^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/worker/gpu_model_runner.py\", line 1584, in _dummy_sampler_run raise RuntimeError(RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.[rank0]:[W510 09:27:22.759524463 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())Traceback (most recent call last): File \"/usr/local/bin/vllm\", line 33, in  sys.exit(load_entry_point(\'vllm\', \'console_scripts\', \'vllm\')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/entrypoints/cli/main.py\", line 53, in main args.dispatch_function(args) File \"/workspace/vllm/vllm/entrypoints/cli/serve.py\", line 27, in cmd uvloop.run(run_server(args)) File \"/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py\", line 109, in run return __asyncio.run(  ^^^^^^^^^^^^^^ File \"/usr/lib/python3.12/asyncio/runners.py\", line 194, in run return runner.run(main)  ^^^^^^^^^^^^^^^^ File \"/usr/lib/python3.12/asyncio/runners.py\", line 118, in run return self._loop.run_until_complete(task)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"uvloop/loop.pyx\", line 1518, in uvloop.loop.Loop.run_until_complete File \"/usr/local/lib/python3.12/dist-packages/uvloop-0.21.0-py3.12-linux-x86_64.egg/uvloop/__init__.py\", line 61, in wrapper return await main  ^^^^^^^^^^ File \"/workspace/vllm/vllm/entrypoints/openai/api_server.py\", line 1077, in run_server async with build_async_engine_client(args) as engine_client: File \"/usr/lib/python3.12/contextlib.py\", line 210, in __aenter__ return await anext(self.gen)  ^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/entrypoints/openai/api_server.py\", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File \"/usr/lib/python3.12/contextlib.py\", line 210, in __aenter__ return await anext(self.gen)  ^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/entrypoints/openai/api_server.py\", line 178, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/engine/async_llm.py\", line 151, in from_vllm_config return cls(  ^^^^ File \"/workspace/vllm/vllm/v1/engine/async_llm.py\", line 118, in __init__ self.engine_core = core_client_class( ^^^^^^^^^^^^^^^^^^ File \"/workspace/vllm/vllm/v1/engine/core_client.py\", line 649, in __init__ super().__init__( File \"/workspace/vllm/vllm/v1/engine/core_client.py\", line 400, in __init__ self._wait_for_engine_startup() File \"/workspace/vllm/vllm/v1/engine/core_client.py\", line 432, in _wait_for_engine_startup raise RuntimeError(\"Engine core initialization failed. \"RuntimeError: Engine core initialization failed. See root cause above.

2. 解决办法

报错原因： gpu显存不够
解决思路：限制gpu的显存使用或者增加虚拟显存
假如这是vllm运行脚本：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \\
--served-model-name chat_model

增加如下配置，一般情况下就能够解决问题：
方法1：限制使用内存(0.8相当于80%) --gpu-memory-utilization 0.8
方法2：增加虚拟内存 --cpu-offload-gb 10 --swap-space 10

--swap-space ：每GPU的CPU交换空间的大小（以GIB为单位）。默认：4
--cpu-offload-gb ：每GPU，GIB中的空间可卸载到CPU。默认值为0，这意味着没有卸载。直观地，该参数可以看作是增加GPU内存大小的虚拟方法。例如，如果您有一个24 GB GPU并将其设置为10，则实际上可以将其视为34 GB GPU。然后，您可以加载带有BF16重量的13B型号，这至少需要26GB GPU内存。请注意，这需要快速的CPU-GPU互连，作为模型的一部分，从CPU存储器加载到每个模型向前通行中的GPU内存。默认：0

最终脚本1：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \\
--served-model-name chat_model --gpu-memory-utilization 0.8

最终脚本2(部分机器不生效，推荐使用脚本1)：
vllm serve /opt/modules/Qwen3-8B --trust-remote-code \\
--served-model-name chat_model --cpu-offload-gb 10 --swap-space 10

欢迎大佬留下更多解决办法的思路。

RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please tr

1. 部署vllm服务报gpu内存错误

2. 解决办法

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please tr

1. 部署vllm服务报gpu内存错误

2. 解决办法

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签