Ascend Infra

Ascend Infra

Ascend集群上的基本配置,和华子斗智斗勇(其一)。

环境配置

创建环境:

1
2
# conda环境(实测3.11也可以,但是verl依赖3.10)
conda create -n ascend python=3.10 -y

CANN

按照vllm依赖安装CANN(来源vllm-ascend文档):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Install required python packages.
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple attrs 'numpy<2.0.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions

# Download and install the CANN package.
wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C22B800TP052/Ascend-cann-toolkit_8.2.rc1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-toolkit_8.2.rc1_linux-"$(uname -i)".run
./Ascend-cann-toolkit_8.2.rc1_linux-"$(uname -i)".run --full
# https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C22B800TP052/Ascend-cann-kernels-910b_8.2.rc1_linux-aarch64.run

source /usr/local/Ascend/ascend-toolkit/set_env.sh

wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C22B800TP052/Ascend-cann-kernels-910b_8.2.rc1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-kernels-910b_8.2.rc1_linux-"$(uname -i)".run
./Ascend-cann-kernels-910b_8.2.rc1_linux-"$(uname -i)".run --install

wget --header="Referer: https://www.hiascend.com/" https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C22B800TP052/Ascend-cann-nnal_8.2.rc1_linux-"$(uname -i)".run
chmod +x ./Ascend-cann-nnal_8.2.rc1_linux-"$(uname -i)".run
./Ascend-cann-nnal_8.2.rc1_linux-"$(uname -i)".run --install

source /usr/local/Ascend/nnal/atb/set_env.sh

两个source可以添加到~/.bashrc中,进入终端自动加载环境。

vLLM

安装vllm和vllm-ascend:

1
2
3
4
5
6
7
8
9
# 安装编译依赖
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq

# 配置镜像源,否则找不到正确版本的vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"

pip install vllm==0.10.0
pip install vllm-ascend==0.10.0rc1

安装vllm-ascend会使torch版本为2.7.1+cpu,但是会正常使用npu运行。

如果编译有报错类似Command '['cmake', '--build', '.', '-j=192', '--target=_C']' returned non-zero exit status 1,是因为默认用所有的核编译导致崩溃,限制MAX_JOBS=32或更少即可。

附vllm的依赖:

Software Supported version Note
CANN >= 8.2.RC1 Required for vllm-ascend and torch-npu
torch-npu >= 2.7.1.dev20250724 Required for vllm-ascend, No need to install manually, it will be auto installed in below steps
torch >= 2.7.1 Required for torch-npu and vllm

verl

在已有vllmvllm-ascend的情况下:

1
2
3
4
git clone https://github.com/volcengine/verl.git
cd verl
pip install -r requirements-npu.txt
pip install -e .

附verl的依赖,只能尽可能取最大交集,CANN和torch版本能向下兼容。

software version
Python == 3.10
CANN == 8.1.RC1
torch == 2.5.1
torch_npu == 2.5.1.RC1

最小化测试

安装verl后可以进行GRPO训练的测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 下载数据集
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k

set -x

export VLLM_ATTENTION_BACKEND=XFORMERS

# 训练
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=128 \
data.max_prompt_length=512 \
data.max_response_length=128 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-7 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.entropy_coeff=0.001 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=20 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=console \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=1 \
trainer.device=npu $@

疑难杂症

  • 在通过Python API使用vllm时(常见于自定义模型的推理或verl等使用vllm的其他库),在某个from vllm import xxx上产生报错:

    1
    ValueError: infer_schema(func): Parameter block_size has unsupported type list[int]. The valid types are: dict_keys([<class 'torch.Tensor'>, typing.Optional[torch.Tensor], typing.Sequence[torch.Tensor], typing.List[torch.Tensor], typing.Sequence[typing.Optional[torch.Tensor]], typing.List[typing.Optional[torch.Tensor]], <class 'int'>, typing.Optional[int], typing.Sequence[int], typing.List[int], typing.Optional[typing.Sequence[int]], typing.Optional[typing.List[int]], <class 'float'>, typing.Optional[float], typing.Sequence[float], typing.List[float], typing.Optional[typing.Sequence[float]], typing.Optional[typing.List[float]], <class 'bool'>, typing.Optional[bool], typing.Sequence[bool], typing.List[bool], typing.Optional[typing.Sequence[bool]], typing.Optional[typing.List[bool]], <class 'str'>, typing.Optional[str], typing.Union[int, float, bool], typing.Union[int, float, bool, NoneType], typing.Sequence[typing.Union[int, float, bool]], typing.List[typing.Union[int, float, bool]], <class 'torch.dtype'>, typing.Optional[torch.dtype], <class 'torch.device'>, typing.Optional[torch.device]]). Got func with signature (input: torch.Tensor, weight: torch.Tensor, block_size: list[int], weight_scale: torch.Tensor, input_scale: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, cutlass_block_fp8_supported: bool = False, use_aiter_and_is_supported: bool = False) -> torch.Tensor)

    原因是list[int]应为List[Int],这一兼容本来应当被vllm_ascend自动完成。可参考issues#2564在报错行(或者所有vllm相关的import前面)添加:

    1
    2
    from vllm_ascend.patch import platform
    from vllm_ascend.patch import worker

    手动patch后没有问题。

    其他可能的原因:Issue #1048 · vllm-project/vllm-ascend

  • import vllm相关库时报错:ValueError: 'aimv2' is already used by a Transformers config, pick another name

    这个是vllm的bug,参考这个issue和新的commit vllm-project/vllm@3fc9644修改vllm/transformers_utils/configs/ovis.py即可。

Author

Byter

Posted on

2025-09-01

Licensed under