1. vLLM是什么

        官网文档:https://docs.vllm.ai/en/stable/Easy, fast, and cheap LLM serving for everyone.

        相对于Ollama来说,vLLM使用就没有那么easy, 特别硬件资源不足的时候,我手头只有一块nvidia L20, 显存48G,大一点的模型运行起来就超过显存总量,各种参数调整才能搞定。

2. vLLM安装

        pip安装是很简单的,但是一定要搞一个虚拟环境,因为尝试不同大模型时,需要的各种库版本是不一样的,要各种大模型版本尝试,最好是创建几个虚拟环境隔离开来,测试运行成功的模型环境就不要改动了。

  python3 -m venv vLLm

  pip install vllm

  pip list

        安装起来很简单,只要你GPU硬件比较新,驱动版本比较新,基本不要考虑指定老版本来安装:

$ nvidia-smi 

        在后面尝试各种大模型时,vllm、transformers、tokenizers这3个关键库的版本,有时会出现版本不兼容,需要调整。

3. 大模型下载

        虽然vllm可以直接指定模型名运行时去下载,但还是喜欢先下载到本地,然后直接指定本地模型路径来运行。

        一般下载开源大模型流行的我知道的就2个:HuggingFace和modelscope,地址是:

        https://huggingface.co/  基本访问不成功。

        https://github.com/huggingface

        https://hf-mirror.com/  HuggingFace的镜像,基本是可以访问的,偶尔会失败。

        https://modelscope.cn/my/overview  魔塔社区,网络没问题。

        我基本上只要魔塔上有要的模型就会先去这里下载,写一个脚本跑后台:

$ cat download.sh

#!/usr/bin/bash

source /home/honya/llm/vLLM/bin/activate

rm -f ./nohup_download.out

nohup modelscope download \
	--model Qwen/Qwen3.5-27B-FP8 \
	--local_dir ./Qwen3.5-27B-FP8 \
	 >./nohup_download.out 2>&1 &

4. 模型运行

        每个模型名去各家AI网页上查一下,会给具体运行命令参数的,如果你的GPU够好显存够大,就不用考虑那么多烦心事,不然如我这48G显存,很多今年发布的大模型,默认FP16都无法安装。

        下面把我调测的大模型执行命令列一下:

#!/bin/bash


# 安装vLLM
# pip install vllm


stop_process()
{
	# 定义模型路径关键字
	MODEL_PATH="$1"

	# 1. 查找主进程PID
	MAIN_PID=$(ps -ef | grep "$MODEL_PATH" | grep -v grep | awk '{print $2}')

	if [ -z "$MAIN_PID" ]; then
		echo "未找到与模型路径 $MODEL_PATH 相关的进程"
		exit 1
	fi

	echo "找到主进程PID: $MAIN_PID"

	# 2. 查找所有子进程
	CHILD_PIDS=$(ps -o pid --ppid "$MAIN_PID" --no-headers | awk '{printf "%s ", $0}' | sed 's/ $//')

	if [ -z "$CHILD_PIDS" ]; then
		echo "主进程 $MAIN_PID 没有子进程"
		kill -9 $MAIN_PID
	else
		echo "子进程PID: $CHILD_PIDS"
		kill -9 $CHILD_PIDS $MAIN_PID
	fi
}


start_qwen35_9b_sft()
{
	rm -f ./nohup_vllm_qwen35_9b_sft.out 
	echo "begin to start model qwen3.5-9b-sft ......"
	nohup python -m vllm.entrypoints.openai.api_server \
	  --model /home/honya/llm/LlamaFactory/saves/my_model/qwen3.5-9b-sft \
	  --served-model-name my-qwen3.5-9b-sft \
	  --tensor-parallel-size 1 \
	  --trust-remote-code \
	  --max-model-len 65536 \
	  --enable-auto-tool-choice \
	  --tool-call-parser qwen3_xml \
	  --port 38080 \
	  --host 0.0.0.0 >./nohup_vllm_qwen35_9b_sft.out 2>&1 &
	echo "end to start."
}

stop_qwen35_9b_sft()
{
	echo "begin to stop model qwen3.5-9b-sft ......"
	stop_process "/home/honya/llm/LlamaFactory/saves/my_model/qwen3.5-9b-sft"
	echo "end to stop."
}

start_qwen3_coder_next()
{
	echo "begin to start model qwen3-coder-next ......"
	export CUDA_VISIBLE_DEVICES=0

	rm -f ./nohup_vllm_qwen3_coder_next.out 

	nohup vllm serve /home/honya/modelscope_models/Qwen3-Coder-30B-A3B-Instruct \
		--host 0.0.0.0 \
		--port 38081 \
		--dtype bfloat16 \
		--kv-cache-dtype fp8 \
		--tensor-parallel-size 1 \
		--gpu-memory-utilization 0.7 \
		--max-model-len 16384 \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--enable-prefix-caching \
		--enable-auto-tool-choice \
		--tool-call-parser qwen3_coder \
		--max-num-seqs 8 \
		--trust-remote-code >./nohup_vllm_qwen3_coder_next.out 2>&1 &

	echo "end to start."
}

stop_qwen3_coder_next()
{
	echo "begin to stop model qwen3_coder_next ......"
	stop_process "/home/honya/modelscope_models/Qwen3-Coder-30B-A3B-Instruct"
	echo "end to stop."
}

# start Qwen3.5-27B-FP8
start_qwen35_27b()
{
	echo "begin to start model Qwen3.5-27B-FP8......"
	export PYTORCH_ALLOC_CONF=expandable_segments:True

	nohup python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/Qwen3.5-27B-FP8 \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--tensor-parallel-size 1 \
		--kv-cache-dtype fp8_e4m3 \
		--gpu-memory-utilization 0.85 \
		--max-model-len 32768 \
		--enable-auto-tool-choice \
		--tool-call-parser qwen3_coder \
		--reasoning-parser qwen3 \
		--max-num-seqs 8 \
		--trust-remote-code \
		>>./qwen35_27b_fp8.out 2>&1 &

	echo "end to start."
}

stop_qwen35_27b()
{
	stop_process "/home/honya/modelscope_models/Qwen3.5-27B"
}

# start Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
start_qwen35_27b_claude_46_opus_reasoning_distilled_v2()
{
	echo "begin to start model Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 ......"
	export CUDA_VISIBLE_DEVICES=0
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

	# 模型内部使用的 FLA 加速算子与当前 vLLM 版本或 PyTorch 环境不兼容,导致输入张量的维度格式解析错误。
	# 禁用 FLA(Flash Linear Attention)
	# export VLLM_USE_FLA=OFF

	# 强制限制最大并发请求数,减少预热时的显存压力。
	# --max-num-seqs 3 \

	# vLLM 默认会尝试使用 CUDA Graphs 来加速。如果你不需要极致的推理性能,或者为了避开这个特定的算子 Bug,可以先禁用它。
	# 使用 --enforce-eager 会牺牲约 10%-20% 的推理吞吐量(Token 生成速度变慢)。
	# 如果上述命令能成功启动,你可以尝试去掉 --enforce-eager,并保持较低的 --max-num-seqs,看是否能正常启动。
	# --enforce-eager \

	nohup python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--dtype auto \
		--load-format auto \
		--tensor-parallel-size 1 \
		--gpu-memory-utilization 0.90 \
		--max-model-len 16384 \
		--kv-cache-dtype auto \
		--reasoning-parser qwen3 \
		--tool-call-parser qwen3_coder \
		--tokenizer-mode slow \
		--max-num-seqs 8 \
		--trust-remote-code >>./qwen35_27b_claude46_opus.out 2>&1 &

	echo "end to start."
}

stop_qwen35_27b_claude_46_opus_reasoning_distilled_v2()
{
	stop_process "/home/honya/modelscope_models/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
}

# start Qwen3.5-35B-A3B-FP8
start_qwen35_35b_a3b_fp8()
{
	echo "begin to start model Qwen3.5-35B-A3B-FP8 ......"
	echo "need vllm == 0.18.0 and transformers == 4.57.6, please: pip install vllm==0.18.0 transformers==4.57.6"

	export CUDA_VISIBLE_DEVICES=0
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

	# 模型内部使用的 FLA 加速算子与当前 vLLM 版本或 PyTorch 环境不兼容,导致输入张量的维度格式解析错误。
	# 禁用 FLA(Flash Linear Attention)
	export VLLM_USE_FLA=OFF

	# 强制限制最大并发请求数,减少预热时的显存压力。
	# --max-num-seqs 3 \

	# vLLM 默认会尝试使用 CUDA Graphs 来加速。如果你不需要极致的推理性能,或者为了避开这个特定的算子 Bug,可以先禁用它。
	# 使用 --enforce-eager 会牺牲约 10%-20% 的推理吞吐量(Token 生成速度变慢)。
	# 如果上述命令能成功启动,你可以尝试去掉 --enforce-eager,并保持较低的 --max-num-seqs,看是否能正常启动。
	# --enforce-eager \

	nohup python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/Qwen3.5-35B-A3B-FP8 \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--tensor-parallel-size 1 \
		--max-model-len 262144 \
		--reasoning-parser qwen3 \
		--gpu-memory-utilization 0.90 \
		--enable-auto-tool-choice \
		--trust-remote-code \
		--max-num-seqs 3 \
		--tool-call-parser qwen3_coder >>./qwen35_35b_a3b_fp8.out 2>&1 &

	echo "end to start."
}

stop_qwen35_35b_a3b_fp8()
{
	stop_process "/home/honya/modelscope_models/Qwen3.5-35B-A3B-FP8"
}

# start Qwen2.5-14B-Instruct-AWQ
start_qwen25_14b_instruct_awq()
{
	echo "begin to start model Qwen2.5-14B-Instruct-AWQ ......"
	export CUDA_VISIBLE_DEVICES=0
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

	# rm -f ./nohup_vllm_qwen25_14b_instruct_awq.out 

	# nohup vllm serve /home/honya/modelscope_models/Qwen2.5-14B-Instruct-AWQ \

	nohup python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/Qwen2.5-14B-Instruct-AWQ \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--tensor-parallel-size 1 \
		--dtype auto \
		--max-model-len 131072 \
		--gpu-memory-utilization 0.90 \
		--quantization awq \
		--enable-auto-tool-choice \
		--guided-decoding-backend lm-format-enforcer \
		--trust-remote-code \
		--tool-call-parser hermes >>./nohup_vllm_qwen25_14b_instruct_awq.out 2>&1 &

	echo "end to start."
}

stop_qwen25_14b_instruct_awq()
{
	stop_process "/home/honya/modelscope_models/Qwen2.5-14B-Instruct-AWQ"
}

# ZhipuAI/GLM-4.7-Flash
# cyankiwi/GLM-4.7-Flash-AWQ-4bit
start_glm47_flash()
{
	echo "begin to start model cyankiwi/GLM-4.7-Flash-AWQ-4bit ......"
	export CUDA_VISIBLE_DEVICES=0
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
	export PYTORCH_ALLOC_CONF=expandable_segments:True

	nohup python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/GLM-4.7-Flash-AWQ-4bit \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--trust-remote-code \
		--max-model-len 200000 \
		--gpu-memory-utilization 0.85 \
		--enable-auto-tool-choice \
		--tool-call-parser glm47 \
		--speculative-config.method mtp \
		--speculative-config.num_speculative_tokens 1 \
		 >>./glm47_flash.out 2>&1 &

	echo "end to start."
}

stop_glm47_flash()
{
	stop_process "/home/honya/modelscope_models/GLM-4.7-Flash-AWQ-4bit"
}

# start Qwen2.5-Coder-7B-Instruct
start_qwen25_coder_7b_instruct()
{
	echo "begin to start model Qwen2.5-Coder-7B-Instruct ......"
	export CUDA_VISIBLE_DEVICES=0

	rm -f ./nohup_vllm_qwen25_coder_7b_instruct.out 

	nohup vllm serve /home/honya/modelscope_models/Qwen2.5-Coder-7B-Instruct \
		--served-model-name my-qwen2.5-coder-7b-instruct \
		--host 0.0.0.0 \
		--port 38081 \
		--dtype bfloat16 \
		--gpu-memory-utilization 0.85 \
		--max-model-len 32768 \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--enable-auto-tool-choice \
		--tool-call-parser hermes \
		--max-num-seqs 8 \
		--trust-remote-code >./nohup_vllm_qwen25_coder_7b_instruct.out 2>&1 &

	echo "end to start."
}

stop_qwen25_coder_7b_instruct()
{
	echo "begin to stop model Qwen2.5-Coder-7B-Instruct ......"
	stop_process "/home/honya/modelscope_models/Qwen2.5-Coder-7B-Instruct"
	echo "end to stop."
}

start_deepseek_coder_67b_instruct()
{
	echo "begin to start model deepseek-ai/deepseek-coder-6.7b-instruct ......"
	export CUDA_VISIBLE_DEVICES=0
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

	python -m vllm.entrypoints.openai.api_server \
	  	--model /home/honya/modelscope_models/deepseek-coder-6.7b-instruct \
		--served-model-name my-coder-llm \
		--api-key b7f5b00cd851463392aeacef6eaf1d0e \
		--host 0.0.0.0 \
		--port 38081 \
		--tensor-parallel-size 1 \
		--gpu-memory-utilization 0.9 \
		--max-model-len 32768 \
		--dtype bfloat16 \
		--enable-prefix-caching \
		--enable-auto-tool-choice \
		--tool-call-parser pythonic \
		--trust-remote-code \
		--max-num-seqs 256 \
		--max-num-batched-tokens 8192 \
		--max-num-seqs 8 \
		>./deepseek_coder_67b_instruct.out 2>&1 &

	echo "end to start."
}

stop_deepseek_coder_67b_instruct()
{
	stop_process "/home/honya/modelscope_models/deepseek-coder-6.7b-instruct"
}

list_status()
{
	ps -ef | grep honya | grep -v grep | egrep -e "vllm|vLLM|VLLM"
}

usage()
{
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen35_9b_sft]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen35_35b_a3b_fp8]"
	echo "usage: ./run_vllm.sh [start | stop | status] [glm47_flash]"
	echo "usage: ./run_vllm.sh [start | stop | status] [deepseek_coder_67b_instruct]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen25_14b_instruct_awq]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen25_coder_7b_instruct]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen3_coder_next]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen35_27b_claude46_opus]"
	echo "usage: ./run_vllm.sh [start | stop | status] [qwen35_27b]"
}


if [ "$#" -lt "1" ]; then
	usage
elif [ "$1" = "start" -a "$2" = "qwen35_27b" ]; then
	start_qwen35_27b
elif [ "$1" = "stop" -a "$2" = "qwen35_27b" ]; then
	stop_qwen35_27b
elif [ "$1" = "start" -a "$2" = "qwen35_27b_claude46_opus" ]; then
	start_qwen35_27b_claude_46_opus_reasoning_distilled_v2
elif [ "$1" = "stop" -a "$2" = "qwen35_27b_claude46_opus" ]; then
	stop_qwen35_27b_claude_46_opus_reasoning_distilled_v2
elif [ "$1" = "start" -a "$2" = "qwen35_9b_sft" ]; then
	start_qwen35_9b_sft
elif [ "$1" = "start" -a "$2" = "qwen35_35b_a3b_fp8" ]; then
	start_qwen35_35b_a3b_fp8
elif [ "$1" = "stop" -a "$2" = "qwen35_35b_a3b_fp8" ]; then
	stop_qwen35_35b_a3b_fp8
elif [ "$1" = "start" -a "$2" = "glm47_flash" ]; then
	start_glm47_flash
elif [ "$1" = "start" -a "$2" = "deepseek_coder_67b_instruct" ]; then
	start_deepseek_coder_67b_instruct
elif [ "$1" = "start" -a "$2" = "qwen25_14b_instruct_awq" ]; then
	start_qwen25_14b_instruct_awq
elif [ "$1" = "start" -a "$2" = "qwen3_coder_next" ]; then
	start_qwen3_coder_next
elif [ "$1" = "start" -a "$2" = "qwen25_coder_7b_instruct" ]; then
	start_qwen25_coder_7b_instruct
elif [ "$1" = "stop" -a "$2" = "qwen35_9b_sft" ]; then
	stop_qwen35_9b_sft
elif [ "$1" = "stop" -a "$2" = "qwen3_coder_next" ]; then
	stop_qwen3_coder_next
elif [ "$1" = "stop" -a "$2" = "qwen25_coder_7b_instruct" ]; then
	stop_qwen25_coder_7b_instruct
elif [ "$1" = "status" ]; then
	list_status
else 
	usage
fi

4.1 Qwen3-Coder-30B-A3B-Instruct

        失败,显存不足,没去折腾,放弃。

(EngineCore_DP0 pid=15622) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 44.39 GiB of whic    h 389.25 MiB is free. Including non-PyTorch memory, this process has 44.01 GiB memory in use. Of the allocated memory 43.63 GiB is allocated by PyTo    rch, and 33.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_seg    ments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
181 [rank0]:[W326 16:14:21.090413420 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can     leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

4.2 Qwen2.5-Coder-7B-Instruct

        失败,放弃。

(EngineCore_DP0 pid=24176) ValueError: Free memory on device cuda:0 (34.49/44.39 GiB) on startup is less than desired GPU memory utilization (0.85,     37.73 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
 87 [rank0]:[W327 14:47:34.674962665 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can     leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

4.3 deepseek-coder-6.7b-instruct

        成功。

4.4 Qwen2.5-14B-Instruct-AWQ

        成功。

4.5 qwen3.5-9b-sft

        成功,参数太小,基本不用。

4.6 cyankiwi/GLM-4.7-Flash-AWQ-4bit

        失败,没细查失败原因,放弃。

(EngineCore_DP0 pid=121118)   File "/data/llm/vLLM/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/utils.py     ", line 126, in find_matched_target
1459 (EngineCore_DP0 pid=121118)     raise ValueError(
1460 (EngineCore_DP0 pid=121118) ValueError: Unable to find matching target for model.layers.1.self_attn.fused_qkv_a_proj in the compressed-tensors conf     ig.
1461 [rank0]:[W329 12:18:45.529142484 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can      leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

4.7 Qwen3.5-35B-A3B-FP8

        成功,其中--max-num-seqs参数会极大影响缓存,默认是256,很多AI网页版都没讲到点子上,千问提示正确解决。

4.8 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

        失败,最头痛这个版本,Tokenizers总是不行,vllm、transformers、tokenizers库不停换版本,最后也没解决。

(APIServer pid=524191)   File "/data/llm/vLLM2/lib/python3.12/site-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
247 (APIServer pid=524191)     tokenizer = AutoTokenizer.from_pretrained(
248 (APIServer pid=524191)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
249 (APIServer pid=524191)   File "/data/llm/vLLM2/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretr    ained
250 (APIServer pid=524191)     raise ValueError(
251 (APIServer pid=524191) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

4.9 Qwen3.5-27B-FP8

        成功,目前用它来做AI coding,编程时执行速度很慢,一个大型一点的重构,跑上2个小时以上,tokens生成数很低。

5. 结束语

        刚接触LLM才2周多,太多概念不熟悉,太多细节不清楚,也没时间去调研学习,客户现在被小龙虾风暴搞得激情起来,主动提出要一个大模型项目,怎么在业务流程中使用LangChain还没明白,先把开源模型选项和vLLM部署搞一下,抛砖引玉,希望有经验者来指点和讨论。

Logo

昇腾计算产业是基于昇腾系列(HUAWEI Ascend)处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务,https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐