910B分布式部署Qwen235B

LubuBlue

362人浏览 · 2026-03-17 14:33:36

LubuBlue · 2026-03-17 14:33:36 发布

😊前情提要

全程claud4.6辅助
基于两台昇腾910B服务器
vllm-ascend 0.11.0镜像
部署Qwen3-235B非推理模型
没使用共享存储，将模型权重分别传到两台服务器中
使用vllm-ascend官方提供的ray方式部署
不包含昇腾驱动、镜像等基础操作

名词关系

主节点 —> 节点1、服务器1
子节点 —> 节点2、服务器2

一、验证多节点间通信

1.1 节点1

 # Check the remote switch ports
 for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
 # Get the link status of the Ethernet ports (UP or DOWN)
 for i in {0..7}; do hccn_tool -i $i -link -g ; done
 # Check the network health status
 for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
 # View the network detected IP configuration
 for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
 # View gateway configuration
 for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
 # View NPU network configuration
 cat /etc/hccn.conf

1.2 节点2

命令同上。

1.3 输出（原因：NPU没有配置IP）

Ifname: 200GE1/0/1 
Ifname: 200GE1/0/2 
...
---------------------------------
link status: UP
link status: UP
...
---------------------------------
net health status: Init
net health status: Init 
...
---------------------------------
netdetect address: 0.0.0.0
netdetect address: 0.0.0.0
...
---------------------------------
Get gateway failed, because no gateway was preset there!
Get gateway failed, because no gateway was preset there!
...
---------------------------------
xsfp_reset_max_times_1=0
xsfp_reset_wait_times_1=7
...

二、NPU分配IP

2.1 节点1

# ========== 服务器1 NPU IP 配置脚本 ==========
# NPU0-7 配置为 192.168.100.1-8

echo ">>> 开始配置服务器1 NPU IP..."

# 配置各NPU的IP地址
hccn_tool -i 0 -ip -s address 192.168.100.1 netmask 255.255.255.0
hccn_tool -i 1 -ip -s address 192.168.100.2 netmask 255.255.255.0
hccn_tool -i 2 -ip -s address 192.168.100.3 netmask 255.255.255.0
hccn_tool -i 3 -ip -s address 192.168.100.4 netmask 255.255.255.0
hccn_tool -i 4 -ip -s address 192.168.100.5 netmask 255.255.255.0
hccn_tool -i 5 -ip -s address 192.168.100.6 netmask 255.255.255.0
hccn_tool -i 6 -ip -s address 192.168.100.7 netmask 255.255.255.0
hccn_tool -i 7 -ip -s address 192.168.100.8 netmask 255.255.255.0

echo ">>> IP配置完成，开始配置netdetect对端地址..."

# 配置对端检测地址（指向服务器2的NPU IP）
hccn_tool -i 0 -netdetect -s address 192.168.100.9
hccn_tool -i 1 -netdetect -s address 192.168.100.10
hccn_tool -i 2 -netdetect -s address 192.168.100.11
hccn_tool -i 3 -netdetect -s address 192.168.100.12
hccn_tool -i 4 -netdetect -s address 192.168.100.13
hccn_tool -i 5 -netdetect -s address 192.168.100.14
hccn_tool -i 6 -netdetect -s address 192.168.100.15
hccn_tool -i 7 -netdetect -s address 192.168.100.16

echo ">>> 验证配置结果..."
echo ""
echo "--- 已配置的IP地址 ---"
for i in {0..7}; do hccn_tool -i $i -ip -g; done

echo ""
echo "--- netdetect 对端地址 ---"
for i in {0..7}; do hccn_tool -i $i -netdetect -g; done

echo ""
echo "--- 写入 hccn.conf 的内容 ---"
cat /etc/hccn.conf

2.2 节点2

# ========== 服务器2 NPU IP 配置脚本 ==========
# NPU0-7 配置为 192.168.100.9-16
echo ">>> 开始配置服务器2 NPU IP..."
# 配置各NPU的IP地址
hccn_tool -i 0 -ip -s address 192.168.100.9  netmask 255.255.255.0
hccn_tool -i 1 -ip -s address 192.168.100.10 netmask 255.255.255.0
hccn_tool -i 2 -ip -s address 192.168.100.11 netmask 255.255.255.0
hccn_tool -i 3 -ip -s address 192.168.100.12 netmask 255.255.255.0
hccn_tool -i 4 -ip -s address 192.168.100.13 netmask 255.255.255.0
hccn_tool -i 5 -ip -s address 192.168.100.14 netmask 255.255.255.0
hccn_tool -i 6 -ip -s address 192.168.100.15 netmask 255.255.255.0
hccn_tool -i 7 -ip -s address 192.168.100.16 netmask 255.255.255.0
echo ">>> IP配置完成，开始配置netdetect对端地址..."
# 配置对端检测地址（指向服务器1的NPU IP）
hccn_tool -i 0 -netdetect -s address 192.168.100.1
hccn_tool -i 1 -netdetect -s address 192.168.100.2
hccn_tool -i 2 -netdetect -s address 192.168.100.3
hccn_tool -i 3 -netdetect -s address 192.168.100.4
hccn_tool -i 4 -netdetect -s address 192.168.100.5
hccn_tool -i 5 -netdetect -s address 192.168.100.6
hccn_tool -i 6 -netdetect -s address 192.168.100.7
hccn_tool -i 7 -netdetect -s address 192.168.100.8
echo ">>> 验证配置结果..."
echo ""
echo "--- 已配置的IP地址 ---"
for i in {0..7}; do hccn_tool -i $i -ip -g; done
echo ""
echo "--- netdetect 对端地址 ---"
for i in {0..7}; do hccn_tool -i $i -netdetect -g; done
echo ""
echo "--- 写入 hccn.conf 的内容 ---"
cat /etc/hccn.conf

2.3 验证联通

检查健康状态，两台服务器分别执行

# 等待30秒让HCCL网络初始化
sleep 30
# 检查网络健康状态（目标：全部 Healthy）
echo "=== $(hostname) 网络健康状态 ==="
for i in {0..7}; do
    result=$(hccn_tool -i $i -net_health -g)
    echo "  NPU${i}: ${result}"
done

跨节点NPU通信

节点1

echo "=== 服务器1 -> 服务器2 跨节点NPU PING测试 ==="
for i in {0..7}; do
    peer_ip=$((i + 9))
    result=$(hccn_tool -i ${i} -ping -g address 192.168.100.${peer_ip} 2>&1)
    echo "  NPU${i}(192.168.100.$((i+1))) -> 192.168.100.${peer_ip}: ${result}"
done

节点2

echo "=== 服务器2 -> 服务器1 跨节点NPU PING测试 ==="
for i in {0..7}; do
    peer_ip=$((i + 1))
    result=$(hccn_tool -i ${i} -ping -g address 192.168.100.${peer_ip} 2>&1)
    echo "  NPU${i}(192.168.100.$((i+9))) -> 192.168.100.${peer_ip}: ${result}"
done

三、拉起vllm-ascend容器

docker run --rm \
--name qwen3_235B \
--net=host \
--shm-size=512g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/models/Qwen/Qwen3-235B-A22B-Instruct-2507:/root/models/Qwen/Qwen3-235B-A22B-Instruct-2507 \
-it quay.io/ascend/vllm-ascend:v0.11.0-openeuler bash

这步骤出现以下问题😂

3.1 我这里节点1报OpenBLAS警告

OpenBLAS blas_thread_init: pthread_create failed for thread 1 of 64: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max

该报错claude给出回复：为不致命报错，添加--ulimit nproc=65535:65535参数

3.2 节点1内找不到NPU

[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 workspace]# npu-smi info | grep -E "NPU|Memory-Usage|Aicore"
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 workspace]#

[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 workspace]# npu-smi info
dcmi module initialize failed. ret is -8005
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 workspace]#

claude给出回复：

🔴 问题诊断：dcmi 初始化失败

dcmi module initialize failed. ret is -8005

-8005 错误含义：DCMI（设备通信管理接口）初始化失败，通常是容器内的 dcmi 库版本与宿主机驱动版本不匹配，或宿主机 dcmi 守护进程未正常运行。

第一步：退出容器，在宿主机上诊断

# 先退出容器
exit
# === 以下在服务器1宿主机执行 ===
# 1. 验证宿主机 npu-smi 是否正常
echo "=== 宿主机 npu-smi 测试 ==="
npu-smi info | head -30
# 2. 查看 Ascend 驱动版本
echo "=== 驱动版本 ==="
cat /usr/local/Ascend/driver/version.info
# 3. 查看 dcmi 目录内容
echo "=== dcmi 库文件 ==="
ls -la /usr/local/dcmi/
# 4. 检查 dcmi 相关守护进程
echo "=== dcmi 相关进程 ==="
ps aux | grep -E "dcmi|davinci|ascend" | grep -v grep
# 5. 检查 Ascend 相关服务状态
echo "=== Ascend 服务状态 ==="
systemctl list-units | grep -i ascend

第二步：对比两台服务器的驱动版本

exit  # 退出服务器2容器
cat /usr/local/Ascend/driver/version.info
ls -la /usr/local/dcmi/

输出：

 echo "=== 宿主机 npu-smi 测试 ==="
=== 宿主机 npu-smi 测试 ===
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# npu-smi info | head -30
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.0.3 Version: 24.1.0.3 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B4-1 | OK | 90.3 34 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3397 / 65536 |
+===========================+===============+====================================================+
| 1 910B4-1 | OK | 90.6 36 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3397 / 65536 |
+===========================+===============+====================================================+
| 2 910B4-1 | OK | 88.9 36 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3394 / 65536 |
+===========================+===============+====================================================+
| 3 910B4-1 | OK | 93.5 34 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3396 / 65536 |
+===========================+===============+====================================================+
| 4 910B4-1 | OK | 91.7 37 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 3393 / 65536 |
+===========================+===============+====================================================+
| 5 910B4-1 | OK | 89.3 38 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 3383 / 65536 |
+===========================+===============+====================================================+
| 6 910B4-1 | OK | 97.2 40 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 3393 / 65536 |
+===========================+===============+====================================================+
| 7 910B4-1 | OK | 94.9 41 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 3393 / 65536 |
+===========================+===============+====================================================+
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# # 2. 查看 Ascend 驱动版本
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# echo "=== 驱动版本 ==="
=== 驱动版本 ===
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# cat /usr/local/Ascend/driver/version.info
Version=24.1.0.3
ascendhal_version=7.35.23
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C19SPC005B220
compatible_version=[V100R001C13],[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19]
compatible_version_fw=[7.0.0,7.6.99]
package_version=24.1.0.3
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# # 3. 查看 dcmi 目录内容
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# echo "=== dcmi 库文件 ==="
=== dcmi 库文件 ===
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# ls -la /usr/local/dcmi/
total 556
drwxr-xr-x 2 root root 52 Nov 12 17:39 .
drwxr-xr-x. 14 root root 157 Nov 12 17:39 ..
-r--r--r-- 1 root root 73771 Nov 12 17:39 dcmi_interface_api.h
-r--r--r-- 1 root root 489640 Nov 12 17:39 libdcmi.so
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# # 4. 检查 dcmi 相关守护进程
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# echo "=== dcmi 相关进程 ==="
=== dcmi 相关进程 ===
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# ps aux | grep -E "dcmi|davinci|ascend" | grep -v grep
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# # 5. 检查 Ascend 相关服务状态
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# echo "=== Ascend 服务状态 ==="
=== Ascend 服务状态 ===
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]# systemctl list-units | grep -i ascend
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-01 Qwen3-235B-A22B-Instruct-2507]#
服务器2：[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-02 imges]# cat /usr/local/Ascend/driver/version.info
Version=24.1.0.3
ascendhal_version=7.35.23
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C19SPC005B220
compatible_version=[V100R001C13],[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19]
compatible_version_fw=[7.0.0,7.6.99]
package_version=24.1.0.3
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-02 imges]# ls -la /usr/local/dcmi/
total 556
drwxr-xr-x 2 root root 52 Nov 12 18:38 .
drwxr-xr-x. 14 root root 157 Nov 12 18:38 ..
-r--r--r-- 1 root root 73771 Nov 12 18:38 dcmi_interface_api.h
-r--r--r-- 1 root root 489640 Nov 12 18:38 libdcmi.so
[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-02 imges]#

3.3 claude给出方案（实测有效）

添加 --privileged 参数重启容器

docker run --rm \
--name qwen3_235B
--net=host
--privileged
--shm-size=512g
--ulimit nproc=65535:65535  # 解决
--device /dev/davinci0
--device /dev/davinci1
--device /dev/davinci2
--device /dev/davinci3
--device /dev/davinci4
--device /dev/davinci5
--device /dev/davinci6
--device /dev/davinci7
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/dcmi:/usr/local/dcmi
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
-v /etc/ascend_install.info:/etc/ascend_install.info
-v /home/models/Qwen/Qwen3-235B-A22B-Instruct-2507:/root/models/Qwen/Qwen3-235B-A22B-Instruct-2507
-it quay.io/ascend/vllm-ascend:v0.11.0-openeuler bash

四、起ray服务

4.1 查看网卡配置，确认网卡名称

# === 服务器1容器内执行===
# 方法1：ifconfig
ifconfig | grep -A1 "XX.XX.XX"
# 方法2：读取 proc 网络信息（必定可用）
cat /proc/net/fib_trie | grep -A1 "XX.XX.XX.XX"
# 方法3：查看所有网卡列表
ls /sys/class/net/

4.2 节点1

export NIC_NAME=bond1.2001
export LOCAL_IP=节点1IP
export HCCL_IF_IP=${LOCAL_IP}
export GLOO_SOCKET_IFNAME=${NIC_NAME}
export TP_SOCKET_IFNAME=${NIC_NAME}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export OPENBLAS_NUM_THREADS=1

ray start --head \
  --port=6379 \
  --node-ip-address=${LOCAL_IP} \
  --num-gpus=8
  
ray status

4.3 节点2

# === 服务器2容器内，直接复制粘贴执行 ===
export NIC_NAME=bond1.2001
export LOCAL_IP=节点2IP
export HEAD_IP=主节点IP
export HCCL_IF_IP=${LOCAL_IP}
export GLOO_SOCKET_IFNAME=${NIC_NAME}
export TP_SOCKET_IFNAME=${NIC_NAME}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export OPENBLAS_NUM_THREADS=1

ray start \
  --address="${HEAD_IP}:6379" \
  --node-ip-address=${LOCAL_IP} \
  --num-gpus=8

4.4 防火墙配置

节点2输出：

[root@LNYK-AIC-E23C145-MDMZ-PM-OS01-AICUBE-02 workspace]# ray start \
--address="节点1IP:6379" \
--node-ip-address=节点2IP \
--num-gpus=8
Local node IP: XX.XX.XX.XX
[2026-03-16 23:20:56,879 W 587 587] gcs_rpc_client.h:151: Failed to connect to GCS at address XX.XX.XX.XX:6379 within 5 second

🔴 原因：端口 6379 不通

# === 服务器1宿主机执行（不是容器内！）===

# 方案A：开放同网段所有流量（推荐，内网安全）
iptables -I INPUT -s XX.XX.XX.0/24 -j ACCEPT
iptables -I OUTPUT -d XX.XX.XX.0/24 -j ACCEPT

# 方案B：仅开放 Ray 所需端口
iptables -I INPUT -s XX.XX.XX.XX -p tcp --dport 6379 -j ACCEPT   # GCS
iptables -I INPUT -s XX.XX.XX.XX -p tcp --dport 8265 -j ACCEPT   # Dashboard
iptables -I INPUT -s XX.XX.XX.XX -p tcp --dport 10001 -j ACCEPT  # Client
iptables -I INPUT -s XX.XX.XX.XX -p tcp --dport 20000:30000 -j ACCEPT  # Worker ports

# 方案C：若使用 firewalld
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="XX.XX.XX.0/24" accept'
firewall-cmd --reload


# === 服务器2宿主机执行 ===
iptables -I INPUT -s XX.XX.XX.0/24 -j ACCEPT
iptables -I OUTPUT -d XX.XX.XX.0/24 -j ACCEPT

# === 服务器2容器 ===
# 再次验证连通性
nc -zv xx.xx.xx.xx 6379 -w 5
# 期望输出：Connection to xx.xx.xx.xx 6379 port [tcp/*] succeeded!
# 重新加入 Ray 集群

ray start \
  --address="xx.xx.xx.xx:6379" \
  --node-ip-address=xx.xx.xx.xx \
  --num-gpus=8

五、vllm-ascend起模型服务

仅在节点1起服务
环境变量我节点1与节点2同时设置一样，不确定只设置节点1是否可行

# Step 1: 设置环境变量
export HCCL_SOCKET_IFNAME=bond1.2001
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE=AIV
export OMP_NUM_THREADS=10
export OMP_PROC_BIND=false
export OPENBLAS_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export ASCEND_LAUNCH_BLOCKING=0
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=7200
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Step 2: 启动 vllm
nohup vllm serve /root/models/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  --host 0.0.0.0 \
  --port 1025 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --max-model-len 87000 \
  --trust-remote-code \
  --distributed-executor-backend ray \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 10 \
  --block-size 128 \
  --swap-space 0 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --disable-log-stats \
  > vllm_serve.log 2>&1 &

六、其他

6.1 重启模型服务

重启服务只需要停掉节点1中服务即可

# 服务器1容器内
kill $(cat /workspace/vllm.pid)
# 或
pkill -f "vllm serve"

6.2 重启容器

因为ray是在容器中启动，当容器重启时需要重启ray服务

6.3 请求API

curl --location 'http://xx.xx.xx.xx:1025/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "/root/models/Qwen/Qwen3-235B-A22B-Instruct-2507",
    "messages": [{"role": "user", "content": "你好，请介绍一下你自己"}],
    "max_tokens": 85000,
    "temperature": 0.7,
    "stream": false
  }'

昇腾开源生态专区

昇腾计算产业是基于昇腾系列（HUAWEI Ascend）处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务，https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐

DeepSeek 崩了 13 小时，不是故障，是 V4 在换引擎

昇腾开源生态专区

体系结构论文（107）：AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

摘要：本文提出AscendOptimizer系统，针对华为Ascend NPU的AscendC算子优化难题，通过双阶段交替优化方法实现性能提升。系统将算子拆分为host侧tiling program和device侧kernel program：Stage I采用进化搜索优化tiling策略，利用硬件反馈探索可行解空间；Stage II通过"优化回退"机制从优质kernel反向构

昇腾开源生态专区

AtomGit模型托管与实验管理全指南

AtomGit模型托管指南：AI开发者的版本控制利器摘要：本文介绍AtomGit平台针对AI开发者推出的模型托管与实验管理功能，解决传统Git无法有效管理大模型文件、实验参数分散等问题。AtomGit通过Git LFS大文件存储、模型卡片元数据记录、代码与模型版本关联等创新功能，实现AI项目的一体化管理。文章详细演示了如何创建模型仓库、配置Git LFS管理大文件、编写结构化模型卡片，并特别介绍