vLLM测试大模型推理性能的一般方法
一 概述
vLLM是一个开源的推理服务框架
二 准备环境
先创建一个conda环境:
conda create -n vllm python=3.10 -y
conda active vllm
pip install vllm
附一个完整的环境版本:
(vllm) qiql@d8fbcca68988:~$ pip list
Package Version
--------------------------------- -------------
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.6.2.post1
async-timeout 4.0.3
attrs 24.2.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
compressed-tensors 0.6.0
datasets 3.1.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
exceptiongroup 1.2.2
fastapi 0.115.4
filelock 3.16.1
frozenlist 1.5.0
fsspec 2024.9.0
gguf 0.10.0
h11 0.14.0
httpcore 1.0.6
httptools 0.6.4
httpx 0.27.2
huggingface-hub 0.26.2
idna 3.10
importlib_metadata 8.5.0
interegular 0.3.3
Jinja2 3.1.4
jiter 0.7.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyterlab-language-pack-zh-CN 4.2.post3
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.6
MarkupSafe 3.0.2
mistral_common 1.4.4
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.18.6
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.2
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
openai 1.54.3
opencv-python-headless 4.10.0.84
outlines 0.0.46
packaging 24.2
pandas 2.2.3
partial-json-parser 0.2.1.1.post4
pillow 10.4.0
pip 24.2
prometheus_client 0.21.0
prometheus-fastapi-instrumentator 7.0.0
propcache 0.2.0
protobuf 5.28.3
psutil 6.1.0
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 18.0.0
pycountry 24.6.1
pydantic 2.9.2
pydantic_core 2.23.4
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
ray 2.38.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
rpds-py 0.21.0
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 75.1.0
six 1.16.0
sniffio 1.3.1
starlette 0.41.2
sympy 1.13.3
tiktoken 0.7.0
tokenizers 0.20.3
torch 2.4.0
torchvision 0.19.0
tqdm 4.67.0
transformers 4.46.2
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.0
uvloop 0.21.0
vllm 0.6.3.post1
watchfiles 0.24.0
websockets 14.0
wheel 0.44.0
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.17.1
zipp 3.21.0
上传模型文件到服务器上
三 启动服务
启动服务:
python3 -m vllm.entrypoints.openai.api_server --port 9109 --gpu-memory-utilization 0.95 --dtype auto --trust-remote-code --model /workspace/models/Qwen1.5-7B-Chat --tensor-parallel-size 1 --max-model-len 1024 --served-model-name qwen1.5 --max-num-seqs 256 --max-num-batched-tokens 32768
--model /workspace/models/Qwen1.5-7B-Chat 为模型文件目录
--port 9109 为端口
--tensor-parallel-size 1 是并行度,即几张GPU
四 测试
重新开一个会话窗口,进入所创建的conda环境
下载测试脚本:
wget https://mirrors.aheadai.cn/scripts/vllm-performance-stream-qwen1.5-long-4090-0.5.0.py
下载测试数据集:
wget https://mirrors.aheadai.cn/data/alpaca_gpt4_data_input_1k.json
开始测试:
python3 ./vllm-performance-stream-qwen1.5-long-4090-0.5.0.py http://localhost:9109 > chatglm2-6b-1gpu.log 2>&1
测试大概需要一小时左右
测试结果保存在log文件中
本文系作者 @
admin
原创发布在 文档中心 | AheadAI ,未经许可,禁止转载。
有帮助?
评论