本次为大家带来SGLang框架下的昇腾部署实操指南，手把手教你完成 Qwen3.5系列开源模型的昇腾平台部署，轻松实现高效推理。

本次教程适配Qwen3.5-397B-A17B、122B-A10B、35B-A3B、27B全系列模型，同时提供BF16原版权重与量化版本权重，满足不同开发需求。

🔗 BF16版本模型权重：

➡️ https://modelers.cn/models/Qwen-AI/Qwen3.5-397B-A17B

➡️ https://modelers.cn/models/Qwen-AI/Qwen3.5-122B-A10B

➡️ https://modelers.cn/models/Qwen-AI/Qwen3.5-35B-A3B

➡️ https://modelers.cn/models/Qwen-AI/Qwen3.5-27B

🔗 量化版本模型权重

➡️ Qwen3.5-397B-w8a8（无 mtp）

https://modelers.cn/models/SGLangAscend/Qwen3.5-397B-A17B-w8a8-mtp

🔗 SGLang部署教程：

魔乐社区为各型号模型配套了定制化SGLang部署教程，可前往对应地址查看细节：

➡️ https://modelers.cn/models/SGLangAscend/Qwen3.5-122B-A10B

➡️ https://modelers.cn/models/SGLangAscend/Qwen3.5-35B-A3B

➡️ https://modelers.cn/models/SGLangAscend/Qwen3.5-27B

以下将以Qwen3.5-35B-A3B为例，展开详细的部署实操步骤，其他型号模型可参考此流程适配操作。

01 环境准备

1、模型权重

Qwen3.5-35B（BF16 版本）：

https://modelers.cn/models/Qwen-AI/Qwen3.5-35B-A3B

2、安装

为简化 NPU 运行环境配置，昇腾官方已将 NPU 运行所需所有依赖集成至 Docker 镜像，并上传至 quay.io 平台，开发者无需手动配置依赖，直接拉取对应型号镜像即可，具体拉取和容器启动流程如下：

#Atlas 800 A3
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-a3
#Atlas 800 A2
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-910b

#start container
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0  \
--device=/dev/davinci1:/dev/davinci1  \
--device=/dev/davinci2:/dev/davinci2  \
--device=/dev/davinci3:/dev/davinci3  \
--device=/dev/davinci4:/dev/davinci4  \
--device=/dev/davinci5:/dev/davinci5  \
--device=/dev/davinci6:/dev/davinci6  \
--device=/dev/davinci7:/dev/davinci7  \
--device=/dev/davinci8:/dev/davinci8  \
--device=/dev/davinci9:/dev/davinci9  \
--device=/dev/davinci10:/dev/davinci10  \
--device=/dev/davinci11:/dev/davinci11  \
--device=/dev/davinci12:/dev/davinci12  \
--device=/dev/davinci13:/dev/davinci13  \
--device=/dev/davinci14:/dev/davinci14  \
--device=/dev/davinci15:/dev/davinci15  \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:${tag}

02 部署模型

本次采用单节点部署方式，全程通过脚本配置环境参数、启动推理服务，并提供 curl 请求示例，快速验证部署效果。

单节点部署

执行以下脚本进行在线推理。

# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

python3 -m sglang.launch_server \
        --model-path $MODEL_PATH \
        --attention-backend ascend \
        --device npu \
        --tp-size 2 --nnodes 1 --node-rank 0 \
        --chunked-prefill-size 4096 --max-prefill-tokens 280000 \
        --disable-radix-cache \
        --trust-remote-code \
        --host127.0.0.1 \
        --mem-fraction-static 0.7 \
        --port8000 \
        --cuda-graph-bs 16 \
        --quantization modelslim \
        --enable-multimodal \
        --mm-attention-backend ascend_attn \
        --dtype bfloat16

执行以下脚本向模型发送一条请求：

curl --location http://127.0.0.1:8000/v1/chat/completions --header 'Content-Type: application/json' --data '{
  "model": "qwen3.5",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": "/image_path/qwen.png"} 
        },
        {"type": "text", "text": "What is the text in the illustrate?"}
      ]
    }
  ]
}'

执行结束后，您可以看到模型回答如下：

{"id":"cdcd6d14645846e69cc486554f198154","object":"chat.completion","created":1772098465,"model":"qwen3.5","choices":[{"index":0,"message":{"role":"assistant","content":"The user is asking about the text present in the image. I will analyze the image to identify the text.\n</think>\n\nThe text in the image is \"TONGyi Qwen\".","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":248044}],"usage":{"prompt_tokens":98,"total_tokens":138,"completion_tokens":40,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

欢迎体验反馈！

该教程程中的适配模型当前仅为尝鲜体验，性能优化中。如您在使用过程中发现任何问题（包括但不限于功能问题、合规问题），请在模型的代码仓提交issue，开发者将及时审视并解答。

🔗 https://modelers.cn/models/SGLangAscend/Qwen3.5-35B-A3B

01 环境准备

1、模型权重

Qwen3.5-35B（BF16 版本）：

https://modelers.cn/models/Qwen-AI/Qwen3.5-35B-A3B

2、安装

#Atlas 800 A3 docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-a3 #Atlas 800 A2 docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-910b #start container docker run -itd --shm-size=16g --privileged=true --name ${NAME} \ --privileged=true --net=host \ -v /var/queue_schedule:/var/queue_schedule \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /usr/local/sbin:/usr/local/sbin \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ --device=/dev/davinci0:/dev/davinci0 \ --device=/dev/davinci1:/dev/davinci1 \ --device=/dev/davinci2:/dev/davinci2 \ --device=/dev/davinci3:/dev/davinci3 \ --device=/dev/davinci4:/dev/davinci4 \ --device=/dev/davinci5:/dev/davinci5 \ --device=/dev/davinci6:/dev/davinci6 \ --device=/dev/davinci7:/dev/davinci7 \ --device=/dev/davinci8:/dev/davinci8 \ --device=/dev/davinci9:/dev/davinci9 \ --device=/dev/davinci10:/dev/davinci10 \ --device=/dev/davinci11:/dev/davinci11 \ --device=/dev/davinci12:/dev/davinci12 \ --device=/dev/davinci13:/dev/davinci13 \ --device=/dev/davinci14:/dev/davinci14 \ --device=/dev/davinci15:/dev/davinci15 \ --device=/dev/davinci_manager:/dev/davinci_manager \ --device=/dev/hisi_hdc:/dev/hisi_hdc \ --entrypoint=bash \ quay.io/ascend/sglang:${tag}

02 部署模型

本次采用单节点部署方式，全程通过脚本配置环境参数、启动推理服务，并提供 curl 请求示例，快速验证部署效果。

单节点部署

执行以下脚本进行在线推理。

# high performance cpu echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl -w kernel.sched_migration_cost_ns=50000 # bind cpu export SGLANG_SET_CPU_AFFINITY=1 unset https_proxy unset http_proxy unset HTTPS_PROXY unset HTTP_PROXY unset ASCEND_LAUNCH_BLOCKING # cann source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export STREAMS_PER_DEVICE=32 export HCCL_BUFFSIZE=1000 export HCCL_OP_EXPANSION_MODE=AIV export HCCL_SOCKET_IFNAME=lo export GLOO_SOCKET_IFNAME=lo python3 -m sglang.launch_server \ --model-path $MODEL_PATH \ --attention-backend ascend \ --device npu \ --tp-size 2 --nnodes 1 --node-rank 0 \ --chunked-prefill-size 4096 --max-prefill-tokens 280000 \ --disable-radix-cache \ --trust-remote-code \ --host127.0.0.1 \ --mem-fraction-static 0.7 \ --port8000 \ --cuda-graph-bs 16 \ --quantization modelslim \ --enable-multimodal \ --mm-attention-backend ascend_attn \ --dtype bfloat16

执行以下脚本向模型发送一条请求：

curl --location http://127.0.0.1:8000/v1/chat/completions --header 'Content-Type: application/json' --data '{ "model": "qwen3.5", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": {"url": "/image_path/qwen.png"} }, {"type": "text", "text": "What is the text in the illustrate?"} ] } ] }'

执行结束后，您可以看到模型回答如下：

{"id":"cdcd6d14645846e69cc486554f198154","object":"chat.completion","created":1772098465,"model":"qwen3.5","choices":[{"index":0,"message":{"role":"assistant","content":"The user is asking about the text present in the image. I will analyze the image to identify the text.\n</think>\n\nThe text in the image is \"TONGyi Qwen\".","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":248044}],"usage":{"prompt_tokens":98,"total_tokens":138,"completion_tokens":40,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

欢迎体验反馈！

🔗 https://modelers.cn/models/SGLangAscend/Qwen3.5-35B-A3B

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

基于SGLang在昇腾上部署Qwen3.5新模型

01 环境准备

02 部署模型

基于SGLang在昇腾上部署Qwen3.5新模型

01 环境准备

02 部署模型

活动

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

基于SGLang在昇腾上部署Qwen3.5新模型

01 ​环境准备

02 部署模型

基于SGLang在昇腾上部署Qwen3.5新模型

01 ​环境准备

02 部署模型

01 环境准备

01 环境准备