解决多卡情况下运行llamafactory报错问题
使用llamafactory多卡会报错:
Traceback (most recent call last): File \"/home/Mmm/anaconda3/envs/llama/bin/llamafactory-cli\", line 8, in sys.exit(main()) File \"/data/Mmm/LLaMA-Factory/src/llamafactory/cli.py\", line 130, in main process = subprocess.run( File \"/home/Mmm/anaconda3/envs/llama/lib/python3.10/subprocess.py\", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command \'[\'torchrun\', \'--nnodes\', \'1\', \'--node_rank\', \'0\', \'--nproc_per_node\', \'4\', \'--master_addr\', \'127.0.0.1\', \'--master_port\', \'33837\', \'/data/Mmm/LLaMA-Factory/src/llamafactory/launcher.py\', \'/data/Mmm/Params/train_2025-07-23-13-30-31/training_args.yaml\']\' returned non-zero exit status 1.
解决方法:在LLaMA-Factory/src/llamafactory/launcher.py这个文件中添加
import os # 指定使用的GPU ID(例如只使用第0号GPU) os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"
然后在终端运行llamafactory-cli webui之前,先依次运行这两条命令
export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1
否则会报错:
File \"/home/Mmm/anaconda3/envs/llama/lib/python3.10/site-packages/accelerate/state.py\", line 311, in __init__
raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn\'t support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE=\"1\"` and `NCCL_IB_DISABLE=\"1\" or use `accelerate launch` which will do this automatically.
根据上面的步骤就可以解决不使用多卡的报错问题