llama.cpp 部署deepseek 满血版研究

技术文档

ipex-llm/ipex-llm: Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc. 源码官网

ipex-llm/docs/mddocs/Overview/install_gpu.md at main · intel/ipex-llm ipex-llm 安装指南

一、硬件环境：

主板： X13SWA-TF

CPU：Intel Xeon 至强 w9-3575x

显卡：Intel Arc A770

二、软件环境：

操作系统：Ubuntu 25.04 Intel架构版

Ubuntu 25.04 开始更加适配intel显卡，个人推荐使用25.04以上的版本。22.04 、24.04 、24.10 、25.04 这几个Ubuntu版本运行llama.cpp速度没有太大的差异。

三、必备准备工作：

1、安装Intel GPU驱动：

client GPU ：
Installing Client GPUs — Intel® software for general purpose GPU capabilities documentation
data Center GPU ：
Installing Data Center GPU: LTS Releases — Intel® software for general purpose GPU capabilities documentation

备注：只有 Intel® Data Center GPU Max 系列和 Intel® Data Center GPU Flex 请安装 data Center GPU
其他系列如 Intel® Arc™ A-series 一律安装client GPU (好消息：安装 client GPU 比安装 data Center GPU 要容易的多)

2、安装Intel oneAPI

Get the Intel® oneAPI Base Toolkit

注意安装完后一定要在终端使用以下两条命令，否则llama.cpp 识别不到SYCL后端：

sudo apt update sudo apt -y install cmake pkg-config build-essential

四、llama.cpp 版本：

llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-xeon.tgz

下载地址： Release 2.3.0 nightly build · ipex-llm/ipex-llm

使用方式：参考

ipex-llm/docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md at main · intel/ipex-llm

和 https://github.com/ggml-org/llama.cpp/tree/master/tools/server

五、研究结果：

推理参数设置：44物理核线程(CPU有88个逻辑核)，CPU运行温度不超过80℃，GPU运行温度不超过60℃。使用flash-moe 命令。部署deepseek-r1-671b Q4_K_M 模型，速度稳定到5.5 t/s 。

1、散热，推理过程 CPU达到100℃，推理速度大概在4.5~5.0 t/s 。推理过程 CPU稳定到80℃，推理速度在5.5 ~ 6.0 t/s 。

2、线程，选择物理核线程速度会快很多，选择固定的物理核线程(5.5t/s)，比随机线程(4.5t/s)要快1.0 t/s 。选择合适的线程数(一般取逻辑核数量的一半)速度达到最优。

3、llama.cpp 技术框架在部署deepseek-r1-671b Q4_K_M 时，基本是纯CPU在跑，GPU的显存主要用在上下文大小(4G显存对应4000token对应8000单个文字)，和计算缓存大小(每张显卡占2G左右)，还有自动卸载到显存的大小(该模型共卸载到显卡9.5G)

六、常用地址

llama.cpp/docs/development/token_generation_performance_tips.md at master · ggml-org/llama.cpp 推理速度瓶颈可能因素

Intel Xeon performance on R1 671B quants? · ggml-org/llama.cpp · Discussion #12088 满血版deepseek推理速度参考表

ipex-llm/docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md at main · intel/ipex-llm 使用指南

llama.cpp/tools/server at master · ggml-org/llama.cpp 服务器参数

Release 2.3.0 nightly build · ipex-llm/ipex-llm 安装包

llama.cpp 部署deepseek 满血版研究

一、硬件环境：

二、软件环境：

三、必备准备工作：

1、安装Intel GPU驱动：

2、安装Intel oneAPI

四、llama.cpp 版本：

五、研究结果：

六、常用地址

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

llama.cpp 部署deepseek 满血版 研究

一、硬件环境：

二、软件环境：

三、必备准备工作：

1、安装Intel GPU驱动：

2、安装Intel oneAPI

四、llama.cpp 版本：

五、研究结果：

六、常用地址

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

llama.cpp 部署deepseek 满血版研究