vLLM-Ascend环境部署：常用参数设置

介绍vLLM-Ascend（昇腾NPU专用后端）的关键特性及配置方法，列举了14项优化特性，包括ACL图模式、量化、算子优化、内存管理等

jieph01

1784人浏览 · 2025-12-03 09:44:42

jieph01 · 2025-12-03 09:44:42 发布

作者：昇腾PAE技术支持团队

昇腾案例库简介：https://agent.blog.csdn.net/article/details/155446713
昇腾案例抢鲜预览：https://gitcode.com/invite/link/8791cccc43cb4ee589e8
（如对本文有疑问，请移步案例库提交issue，专人答疑）

vLLM-Ascend是vLLM社区官方支持的昇腾NPU专用后端，用于在 Ascend NPU 上运行 vLLM。以下是环境部署过程中，常用的一些参数设置。

1. 特性列表

序列	关键特性
1	ACL Graph
2	量化
3	task_queue算子下发队列优化
4	Jemalloc
5	HCCL AIV模式
6	torch npu虚拟内存
7	prefix cache
8	chunked prefill
9	weight nz
10	FlashComm
11	稠密模型通用优化
12	MLP权重预取
13	mooncake池化
14	Torchair图模式优化

2. 开启方式

2.1 ACL Graph图模式

v0.9.1rc1版本开始，V1 Engine下默认开启ACL Graph。

使能V1 Engine:

export VLLM_USE_V1=1

如果在执行过程中遇到问题，可以通过以下方式切回Eager模式，辅助问题定界。

离线：

import os
from vllm import LLM

model = LLM(model="someother_model_weight", enforce_eager=True)
outputs = model.generate("Hello, how are you?")

在线：

vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager

2.2 量化

使用量化权重时需要指定quantization=‘ascned’，使用浮点权重时需要删掉该参数。

离线：

import torch

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
          max_model_len=2048,
          trust_remote_code=True,
          # Enable quantization by specifying `quantization="ascend"`
          quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

在线：

vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend

2.3 task_queue算子下发队列优化

详细说明可参考文档：https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_007.html

export TASK_QUEUE_ENABLE=0 # 关闭优化
export TASK_QUEUE_ENABLE=1 # 开启level1优化
export TASK_QUEUE_ENABLE=2 # 开启level2优化

2.4 Jemalloc

Jemalloc安装参考：https://www.hiascend.com/document/detail/zh/mindie/21RC2/mindieservice/servicedev/mindie_service0381.html

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2 $LD_PRELOAD

2.5 HCCL AIV模式

详细说明可参考文档：https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/maintenref/envvar/envref_07_0096.html

export HCCL_OP_EXPANSION_MODE="AIV"

2.6 torch npu虚拟内存

详细说明可参考文档：https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_012.html

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

2.7 prefix cache

离线：

llm = LLM(model="lmsys/longchat-13b-16k", enable_prefix_caching=True)

在线：

默认开启，可通过指定 --no-enable-prefix-caching关闭

vllm serve /home/weight/Qwen2.5-32B-Instruct --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --no-enable-prefix-caching # 关闭prefix cache

2.8 chunked prefill

v1 scheduler默认开启，可以通过配置max_num_batched_tokens调整chunk大小：

离线：

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=2048) # 超过2048才会切

在线：

vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 ---max-num-batched-tokens 2048

2.9 Weight NZ

vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --additional-config '{"enable_weight_nz_layout":true}'

2.10 稠密模型通用优化

稠密模型优化总开关，需要配合以下具体特性使用：

FlashComm
MLP权重预取

# Whether to enable dense model and general optimizations for better performance.
# Since we modified the base parent class `linear`, this optimization is also applicable to other model types.
# However, there might be hidden issues, and it is currently recommended to prioritize its use with dense models.
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1

2.11 FlashComm

需要配合TP并行使用，适用于大并发场景。

# Whether to enable FlashComm optimization when tensor parallel is enabled.
# This feature will get better performance when concurrency is large.
export VLLM_ASCEND_ENABLE_FLASHCOMM=1

2.12 MLP权重预取

适用于小并发场景。

# Whether to enable MLP weight prefetch, only used in small concurrency.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1

2.13 mooncake池化

使用说明可参考文档：https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md

在线：

# Prefill节点
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
  --host localhost \
  --port 8100 \
  --tensor-parallel-size 2\
  --seed 1024 \
  --max-model-len 2000  \
  --max-num-batched-tokens 2000  \
  --trust-remote-code \
  --enforce-eager \
  --data-parallel-size 2 \
  --data-parallel-address localhost \
  --data-parallel-rpc-port 9100 \
  --gpu-memory-utilization 0.8  \
  --kv-transfer-config  \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_buffer_device": "npu",
  "kv_role": "kv_producer",
  "kv_parallel_size": 1,
  "kv_port": "20001",
  "engine_id": "0",
  "kv_rank": 0,
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 2
             },
             "decode": {
                    "dp_size": 2,
                    "tp_size": 2
             }
      }
  }'

# Decode节点
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
  --host localhost \
  --port 8200 \
  --tensor-parallel-size 2\
  --seed 1024 \
  --max-model-len 2000  \
  --max-num-batched-tokens 2000  \
  --trust-remote-code \
  --enforce-eager \
  --data-parallel-size 2 \
  --data-parallel-address localhost \
  --data-parallel-rpc-port 9100 \
  --gpu-memory-utilization 0.8  \
  --kv-transfer-config  \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_buffer_device": "npu",
  "kv_role": "kv_consumer",
  "kv_parallel_size": 1,
  "kv_port": "20002",
  "engine_id": "1",
  "kv_rank": 1,
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 2,
                    "tp_size": 2
             },
             "decode": {
                    "dp_size": 2,
                    "tp_size": 2
             }
      }
  }'

2.14 Torchair图模式优化

仅支持DeepSeek系列和PanguProMoE。

在线：

vllm serve /mnt/share/weight/DeepSeek-R1-0528_w8a8_MTP_float \
 --port 20002 \
 --data-parallel-size 1 \
 --tensor-parallel-size 16 \
 --enable-expert-parallel \
 --seed 1024 \
 --served-model-name dsv3 \
 --max-model-len 5200 \
 --max-num-batched-tokens 2048 \
 --max-num-seqs 16 \
 --quantization ascend \
 --speculative-config '{"num_speculative_tokens":2, "method":"deepseek_mtp"}' \
 --trust-remote-code \
 --gpu-memory-utilization 0.9 \
 --additional-config \
'{"torchair_graph_config":{"enabled":true,"enable_multistream_moe":true,"enable_super_kernel":true,"use_cached_graph":true,"enable_multistream_mla":true,"graph_batch_sizes":[16]},"chunked_prefill_for_mla":true,"enable_weight_nz_layout":true}'

昇腾开源生态专区

昇腾计算产业是基于昇腾系列（HUAWEI Ascend）处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务，https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐

HarmonyOS 6 自定义人脸识别模型10：基于MindSpore Lite框架的自定义人脸识别功能实现

昇腾开源生态专区

DeepSeek 崩了 13 小时，不是故障，是 V4 在换引擎

昇腾开源生态专区

体系结构论文（107）：AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

摘要：本文提出AscendOptimizer系统，针对华为Ascend NPU的AscendC算子优化难题，通过双阶段交替优化方法实现性能提升。系统将算子拆分为host侧tiling program和device侧kernel program：Stage I采用进化搜索优化tiling策略，利用硬件反馈探索可行解空间；Stage II通过"优化回退"机制从优质kernel反向构