vLLM-Ascend 四机部署Deepseek BF16模型

使用的vLLM版本是0.13.rc1

在此跳过docker镜像的load和docker服务化的过程,直接看vllm启动脚本

参数主要是要设置tp、dp、dp local和dp start rank:

  • TP(Tensor Parallelism):张量并行度
  • DP(Data Parallelism):数据并行度
  • DP local:每个节点上的本地DP进程数
  • DP start rank:每个节点DP进程的起始全局排名
  1. 可以把每个节点看作一些npu组,如果每台机器8卡npu,tp是8的话,节点所有的npu就都属于一组,如果tp是4,npu就被分为了两组(0-3卡,4-7卡),在tp分组的基础上,做dp local。

    也得出了dp local * tp =每个节点的npu数。
    同时,每个节点的dp local 加起来等于dp,也就是dp / 节点数 = dp local。

  2. dp start rank表示的是从全局的第几个组开始的。如果每台机器8卡npu,tp是4,dp local是2,此时第1个从节点(node1)的dp 的start rank,就是当前机器的第一组的全局顺序,就是2。

节点0(8个NPU):
  |- DP进程0:使用NPU 0-3,全局Rank 0
  |- DP进程1:使用NPU 4-7,全局Rank 1

节点1(8个NPU):
  |- DP进程0:使用NPU 0-3,全局Rank 2
  |- DP进程1:使用NPU 4-7,全局Rank 3

下面的配置中,我设置的tp是8,所以每个节点的dp local 是1,此时一共有4个组,每个节点的dp的start rank就是0(但是主节点不设置),1,2,3。

主节点(node0)

参数里没有headless,也没有dp rank

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
local_ip="192.168.70.5"
nic_name="enp67s0f0np0"

node0_ip="192.168.70.5"
NODE_RANK=0

# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

export VLLM_USE_MODELSCOPE=True

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
export VLLM_ASCEND_ENABLE_MLAPO=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /NFS/models/DeepSeek-R1-0528-BF16 \
--host 0.0.0.0 \
--port 8004 \
--tensor-parallel-size 8 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'

从节点(node1)

第一个从节点

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
local_ip="192.168.70.6"
nic_name="enp67s0f0np0"

node0_ip="192.168.70.5"
NODE_RANK=1

# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

export VLLM_USE_MODELSCOPE=True

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
export VLLM_ASCEND_ENABLE_MLAPO=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /NFS/models/DeepSeek-R1-0528-BF16 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--tensor-parallel-size 8 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-start-rank $NODE_RANK \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'

从节点(node2)

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
local_ip="192.168.70.7"
nic_name="enp67s0f0np0"

node0_ip="192.168.70.5"
NODE_RANK=2

# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

export VLLM_USE_MODELSCOPE=True

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
export VLLM_ASCEND_ENABLE_MLAPO=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /NFS/models/DeepSeek-R1-0528-BF16 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--tensor-parallel-size 8 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-start-rank $NODE_RANK \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'

从节点(node3)

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
local_ip="192.168.70.8"
nic_name="enp67s0f0np0"

node0_ip="192.168.70.5"
NODE_RANK=3

# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

export VLLM_USE_MODELSCOPE=True

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
export VLLM_ASCEND_ENABLE_MLAPO=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve /NFS/models/DeepSeek-R1-0528-BF16 \
--host 0.0.0.0 \
--port 8004 \
--headless \
--tensor-parallel-size 8 \
--data-parallel-size 4 \
--data-parallel-size-local 1 \
--data-parallel-start-rank $NODE_RANK \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}'

要按顺序启动各个节点,先是主节点,保证组网正确

Logo

昇腾计算产业是基于昇腾系列(HUAWEI Ascend)处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务,https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐