基于昇腾快速上手Qwen3-VL-30B-A3B模型

此前昇腾一直同步支持Qwen系列模型，此次Qwen3-VL-30B-A3B-Thinking和Qwen3-VL-30B-A3B-Instruct模型一经发布开源，即在LLaMA Factory和vLLM中开箱即用，实现模型的0Day适配。Qwen3-VL是迄今为止Qwen系列中最强大的视觉语言模型,此次开源的Qwen3-VL-30B-A3B-Thinking和Qwen3-VL-30B-A3B-In

weixin_51827225

1638人浏览 · 2025-11-03 15:09:33

weixin_51827225 · 2025-11-03 15:09:33 发布

2025年10月4日，Qwen3系列模型发布并开源其新一代多模态模型：Qwen3-VL-30B-A3B-Thinking和Qwen3-VL-30B-A3B-Instruct。Qwen3-VL是一个多模态视觉语言模型系列，基于其前代产品，Qwen3-VL 在视觉理解方面实现了显著提升，同时保持了强大的纯文本处理能力。

此前昇腾一直同步支持Qwen系列模型，此次Qwen3-VL-30B-A3B-Thinking和Qwen3-VL-30B-A3B-Instruct模型一经发布开源，即在LLaMA Factory和vLLM中开箱即用，实现模型的0Day适配。

魔乐社区链接：https://usercenter.modelers.cn/register?client_id=658a392ad997cd15f2612a60&scope=openid%20profile%20email%20phone%20address%20username%20id_token&redirect_uri=https%3A%2F%2Fmodelers.cn%2F%3Futm_source%3Dactivity_HUAWEI_912%26utm_source%3Dactivity_HUAWEI_912%26utm_medium%3Dregister&response_mode=query&state=8b580d81971849e39916580308ba02e1&lang=zh

同时，模型权重已上线魔乐社区，欢迎开发者下载体验！

Qwen3-VL-30B-A3B-Thinking/Instruct模型介绍

Qwen3-VL是迄今为止Qwen系列中最强大的视觉语言模型,此次开源的Qwen3-VL-30B-A3B-Thinking和Qwen3-VL-30B-A3B-Instruct模型体积更小，性能依旧强劲，并集Qwen3-VL全部能力于一身！

模型仅需激活30亿参数，即可在STEM、视觉问答（VQA）、光学字符识别（OCR）、视频理解、智能体（Agent）任务等多个领域媲美GPT-5-Mini和Claude4-Sonnet，甚至表现更优。

基于昇腾快速上手Qwen3-VL-30B-A3B模型

本教程将手把手指导您完成Qwen3-VL-30B-A3B-Instruct模型的训练、推理部署流程。我们提供了详尽的步骤说明和最佳实践，确保您能够快速上手。

基于LLaMA Factory训练上手指导

环境配置

开发环境配置

软件	版本
Python	3.10.18
PyTorch	2.5.1
Transformers	main
LLaMA Factory	main
CANN	8.2.RC

安装Ascend Cann Toolkit和kernels

可参考安装指南进行安装。

https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Ubuntu&Software=cannToolKit

安装LLaMA Factory

执行以下指令即可安装LLaMA Factory及其依赖，此步骤会自动安装torch及torch_npu。

git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install-e".[torch-npu,metrics]"--no-build-isolation cd .. git clone https://github.com/huggingface/transformers.git cd transformers pip install -e . pip install qwen_vl_utils

模型微调

数据集准备

微调支持alpaca和sharegpt两种主流格式，使用json文件存储，这里使用llava-en-zh-2k数据集作为演示。
训练数据按照格式准备好后，需要编写data/dataset_info.json文件，用于说明数据具体的情况，比如，如下配置说明了数据集名称为llava_2k_en,数据集的路径为llava-en-zh-2k/en/train-00000-of-00001.parquet，数据集格式为sharegpt，并将数据集的字段名称与标准的名称进行了映射。

{  "llava_2k_en": {  "file_name": "llava-en-zh-2k/en/train-00000-of-00001.parquet",  "formatting": "sharegpt",  "columns": {  "messages": "messages",  "images": "images"  },  "tags": {  "role_tag": "role",  "content_tag": "content",  "user_tag": "user",  "assistant_tag": "assistant"  }  },  "llava_2k_zh": {  "file_name": "llava-en-zh-2k/zh/train-00000-of-00001.parquet",  "formatting": "sharegpt",  "columns": {  "messages": "messages",  "images": "images"  },  "tags": {  "role_tag": "role",  "content_tag": "content",  "user_tag": "user",  "assistant_tag": "assistant"  }  } }

训练配置

LLaMA Factory套件提供了低代码配置化的方式启动训练流程，只需要编写一个train_sample.yaml配置文件，定义训练过程中需要的不同参数即可，这里以Qwen3-VL-30B-A3B-Instruct模型的lora微调为例子进行说明，train_sample.yaml配置文件如下所示：

### modelmodel_name_or_path: Qwen/Qwen3-VL-30B-A3B-Instructimage_max_pixels: 262144video_max_pixels: 16384trust_remote_code: true### methodstage: sftdo_train: truefinetuning_type: loralora_rank: 8lora_target: allflash_attn: disableddisable_gradient_checkpointing: false### datasetdataset: llava_2k_en, llava_2k_zhtemplate: qwen3_vl_nothinkcutoff_len: 1024overwrite_cache: truepreprocessing_num_workers: 16dataloader_num_workers: 4enable_thinking: false### outputoutput_dir: saves/qwen3_vl-30b/lora/sftlogging_steps: 1save_steps: 50000plot_loss: trueoverwrite_output_dir: truesave_only_model: falsereport_to: none  ### trainper_device_train_batch_size: 8gradient_accumulation_steps: 1learning_rate: 1.0e-4max_steps: 500lr_scheduler_type: cosinewarmup_ratio: 0.1bf16: trueddp_timeout: 180000000resume_from_checkpoint: nullseed: 1234

关键参数说明：
配置了原始模型路径，微调方法为lora，模板设置为qwen3_vl_nothink，数据引用了数据集准备小节中定义的配置，设置每张卡的batch_size为8，并设置了模型保存的路径为saves/qwen3_vl-30b/lora/sft。
完整的参数说明可参考官方框架文档中的参数介绍。

训练启动

（1）根据NPU设备信息按需修改

examples/accelerate/fsdp_config.yaml配置文件中的num_processes参数，完整fsdp_config.yaml配置文件如下：

copute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config:  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP  fsdp_backward_prefetch: BACKWARD_PRE  fsdp_forward_prefetch: false  fsdp_cpu_ram_efficient_loading: true  fsdp_offload_params: false  fsdp_sharding_strategy: FULL_SHARD  fsdp_state_dict_type: FULL_STATE_DICT  fsdp_sync_module_states: true  fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 # or fp16 num_machines: 1 # the number of nodes num_processes: 16 # the number of NPUs in all nodes rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

（2）传入训练配置小节中的yaml配置文件路径，执行以下命令即可使用LLaMA Factory+FSDP框架训练Qwen3-VL-30B-A3B-Instruct模型。

accelerate launch \  --config_file examples/accelerate/fsdp_config.yaml \  src/train.py train_sample.yaml

训练过程中会输出日志，包含loss等，等待训练完成后，即可在配置中的output_dir获取到微调后的lora模型权重。

模型合并编写一个export_sample.yaml配置文件，定义模型权重合并过程中需要的不同参数。完整的参数说明可参考官方框架文档中的模型导出。

### modelmodel_name_or_path: Qwen/Qwen3-VL-30B-A3B-Instructadapter_name_or_path: saves/qwen3_vl-30b/lora/sfttemplate: qwen3_vlfinetuning_type: lora### exportexport_dir: saves/qwen3_vl-30b/exportexport_device: cpuexport_legacy_format: false执行脚本即可开始合并模型权重。llamafactory-cli export export_sample.yaml

微调后模型推理

将以下脚本中的模型路径及图片路径更换为实际路径即可运行微调后的模型推理。

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen3VLForConditionalGeneration.from_pretrained(  "saves/qwen3_vl-30b/export", torch_dtype="auto", device_map="npu:0" ) processor = AutoProcessor.from_pretrained("saves/qwen3_vl-30b/export") messages = [  {  "role": "user",  "content": [  {  "type": "image",  "image": "test_picture.jpg",  },  {"type": "text", "text": "Describe this image."},  ],  } ] # Preparation for inference text = processor.apply_chat_template(  messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor(  text=[text],  images=image_inputs,  videos=video_inputs,  padding=True,  return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode(  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)

微调前推理结果如下：

['This is a dynamic, action-oriented photograph capturing two football players in the midst of a game.\n\n- **The Players**: The image features two male athletes wearing matching red and white uniforms with a prominent logo on their chests.\n - The player in the foreground is captured mid-celebration or reaction, with his mouth open as if shouting or cheering, and his right arm raised high.\n - The second player stands slightly behind him, looking towards the first player with an expression that could be interpreted as surprise, excitement, or encouragement.\n\n- **Setting and Atmosphere**: The scene takes place on a football field, indicated by the green']

微调后推理

['The image features two football players wearing red and white uniforms, standing on a field with excitement and enthusiasm. One player is pointing towards the crowd, while his teammate stands next to him, possibly cheering or celebrating with the first player.\n\nThere is a large audience present in the scene, as people can be seen in various positions around the players, such as sitting or standing. The crowd adds to the energetic atmosphere of the football match.']

基于vLLM推理上手指导

这里我们采用 vLLM Ascend 镜像的方式，在昇腾上运行Qwen3-VL-30B-A3B-Instruct模型。

NPU离线推理

启动容器

export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0docker run --rm \--name vllm-ascend \--device /dev/davinci0 \--device /dev/davinci_manager \--device /dev/devmm_svm \--device /dev/hisi_hdc \-v /usr/local/dcmi:/usr/local/dcmi \-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \-v /etc/ascend_install.info:/etc/ascend_install.info \-v /root/.cache:/root/.cache \-p 8000:8000 \-it $IMAGE bash

设置环境变量

# 从ModelScope加载模型来加速下载export VLLM_USE_MODELSCOPE=True# 设置`max_split_size_mb`来减少内存碎片export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

运行脚本

pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/

import gcimport torchfrom transformers import AutoProcessorfrom vllm import LLM, SamplingParamsfrom vllm.distributed.parallel_state import (destroy_distributed_environment,destroy_model_parallel)from qwen_vl_utils import process_vision_infoMODEL_PATH = "Qwen/Qwen3-VL-30B-A3B-Instruct"def clean_up():destroy_model_parallel()destroy_distributed_environment()gc.collect()torch.npu.empty_cache()llm = LLM(model=MODEL_PATH,tensor_parallel_size=2,distributed_executor_backend="mp",enable_expert_parallel=True,max_model_len=16384,limit_mm_per_prompt={"image": 10},)sampling_params = SamplingParams(max_tokens=512)image_messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user","content": [{"type": "image","image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png","min_pixels": 224 * 224,"max_pixels": 1280 * 28 * 28,},{"type": "text", "text": "Please provide a detailed description of this image"},],},]messages = image_messagesprocessor = AutoProcessor.from_pretrained(MODEL_PATH)prompt = processor.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)mm_data = {}if image_inputs is not None:mm_data["image"] = image_inputsllm_inputs = {"prompt": prompt,"multi_modal_data": mm_data,}outputs = llm.generate([llm_inputs], sampling_params=sampling_params)generated_text = outputs[0].outputs[0].textprint(generated_text)del llmclean_up()

若成功运行此脚本，你将看到如下所示的信息：

This image displays a logo, likely for a company or organization, featuring a combination of a graphic icon and text.- **The Icon**: On the left side, there is a geometric logo composed of several interlocking triangular shapes, forming a three-dimensional, hexagonal-like structure. The design has a light blue or cyan outline against a white background.- **The Text**: To the right of the icon, the name is presented in two separate lines.- The top line reads "TONGYI" in a blue, sans-serif font.- The bottom line reads "Qwen" in a black, sans-serif font. The letter "Q" is capitalized.- **Layout**: The overall design is simple and modern, with the blue and black elements contrasting with the white background.

多NPU在线服务

运行Docker容器以在多个NPU上启动vLLM服务：

# Update the vllm-ascend imageexport IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0docker run --rm \--name vllm-ascend \--device /dev/davinci0 \--device /dev/davinci_manager \--device /dev/devmm_svm \--device /dev/hisi_hdc \-v /usr/local/dcmi:/usr/local/dcmi \-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \-v /etc/ascend_install.info:/etc/ascend_install.info \-v /root/.cache:/root/.cache \-p 8000:8000 \-e VLLM_USE_MODELSCOPE=True \-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \-it $IMAGE \vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \--served-model-name qwen3vl \--dtype bfloat16 \--max_model_len 16384 \--max-num-batched-tokens 16384 \--tensor-parallel-size 2 \--enable_expert_parallel

若服务成功启动，将看到如下所示的信息：

INFO: Started server process [44610]INFO: Waiting for application startup.INFO: Application startup complete.

一旦服务器启动，可以使用输入提示词来查询模型:

curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "qwen3vl","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},{"type": "text", "text": "What is the text in the illustrate?"}]}]}'

若服务侧正常响应，客户端发送侧会打印如下信息：

INFO: 127.0.0.1:45880 - "POST /v1/chat/completions HTTP/1.1" 200 OKINFO 09-23 19:44:05 [loggers.py:123] Engine 000: Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%INFO 09-23 19:44:15 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

vllm服务侧打印如下日志：

{"id":"chatcmpl-7d35682041384faeb147660c93bd13f8","object":"chat.completion","created":1758627832,"model":"qwen3vl","choices":[{"index":0,"message":{"role":"assistant","content":"TONGYI Qwen","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":65,"total_tokens":72,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

昇腾开源生态专区

昇腾计算产业是基于昇腾系列（HUAWEI Ascend）处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务，https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐

DeepSeek 崩了 13 小时，不是故障，是 V4 在换引擎

昇腾开源生态专区

体系结构论文（107）：AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

摘要：本文提出AscendOptimizer系统，针对华为Ascend NPU的AscendC算子优化难题，通过双阶段交替优化方法实现性能提升。系统将算子拆分为host侧tiling program和device侧kernel program：Stage I采用进化搜索优化tiling策略，利用硬件反馈探索可行解空间；Stage II通过"优化回退"机制从优质kernel反向构

昇腾开源生态专区

AtomGit模型托管与实验管理全指南

AtomGit模型托管指南：AI开发者的版本控制利器摘要：本文介绍AtomGit平台针对AI开发者推出的模型托管与实验管理功能，解决传统Git无法有效管理大模型文件、实验参数分散等问题。AtomGit通过Git LFS大文件存储、模型卡片元数据记录、代码与模型版本关联等创新功能，实现AI项目的一体化管理。文章详细演示了如何创建模型仓库、配置Git LFS管理大文件、编写结构化模型卡片，并特别介绍