一 概述

vLLM是一个开源的推理服务框架

二 准备环境

先创建一个conda环境:

conda create -n vllm python=3.10 -y
conda active vllm
pip install vllm

附一个完整的环境版本:

(vllm) qiql@d8fbcca68988:~$ pip list 
Package                           Version
--------------------------------- -------------
aiohappyeyeballs                  2.4.3
aiohttp                           3.10.10
aiosignal                         1.3.1
annotated-types                   0.7.0
anyio                             4.6.2.post1
async-timeout                     4.0.3
attrs                             24.2.0
certifi                           2024.8.30
charset-normalizer                3.4.0
click                             8.1.7
cloudpickle                       3.1.0
compressed-tensors                0.6.0
datasets                          3.1.0
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
einops                            0.8.0
exceptiongroup                    1.2.2
fastapi                           0.115.4
filelock                          3.16.1
frozenlist                        1.5.0
fsspec                            2024.9.0
gguf                              0.10.0
h11                               0.14.0
httpcore                          1.0.6
httptools                         0.6.4
httpx                             0.27.2
huggingface-hub                   0.26.2
idna                              3.10
importlib_metadata                8.5.0
interegular                       0.3.3
Jinja2                            3.1.4
jiter                             0.7.0
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
jupyterlab-language-pack-zh-CN    4.2.post3
lark                              1.2.2
llvmlite                          0.43.0
lm-format-enforcer                0.10.6
MarkupSafe                        3.0.2
mistral_common                    1.4.4
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.18.6
multidict                         6.1.0
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.4.2
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.6.77
nvidia-nvtx-cu12                  12.1.105
openai                            1.54.3
opencv-python-headless            4.10.0.84
outlines                          0.0.46
packaging                         24.2
pandas                            2.2.3
partial-json-parser               0.2.1.1.post4
pillow                            10.4.0
pip                               24.2
prometheus_client                 0.21.0
prometheus-fastapi-instrumentator 7.0.0
propcache                         0.2.0
protobuf                          5.28.3
psutil                            6.1.0
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           18.0.0
pycountry                         24.6.1
pydantic                          2.9.2
pydantic_core                     2.23.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.38.0
referencing                       0.35.1
regex                             2024.11.6
requests                          2.32.3
rpds-py                           0.21.0
safetensors                       0.4.5
sentencepiece                     0.2.0
setuptools                        75.1.0
six                               1.16.0
sniffio                           1.3.1
starlette                         0.41.2
sympy                             1.13.3
tiktoken                          0.7.0
tokenizers                        0.20.3
torch                             2.4.0
torchvision                       0.19.0
tqdm                              4.67.0
transformers                      4.46.2
triton                            3.0.0
typing_extensions                 4.12.2
tzdata                            2024.2
urllib3                           2.2.3
uvicorn                           0.32.0
uvloop                            0.21.0
vllm                              0.6.3.post1
watchfiles                        0.24.0
websockets                        14.0
wheel                             0.44.0
xformers                          0.0.27.post2
xxhash                            3.5.0
yarl                              1.17.1
zipp                              3.21.0

上传模型文件到服务器上

三 启动服务

启动服务:

python3 -m vllm.entrypoints.openai.api_server --port 9109 --gpu-memory-utilization 0.95 --dtype auto --trust-remote-code --model /workspace/models/Qwen1.5-7B-Chat --tensor-parallel-size 1 --max-model-len 1024 --served-model-name qwen1.5 --max-num-seqs 256 --max-num-batched-tokens 32768

--model /workspace/models/Qwen1.5-7B-Chat 为模型文件目录

--port 9109 为端口

--tensor-parallel-size 1 是并行度,即几张GPU

image-20240730150609173

四 测试

重新开一个会话窗口,进入所创建的conda环境

下载测试脚本:

wget https://mirrors.aheadai.cn/scripts/vllm-performance-stream-qwen1.5-long-4090-0.5.0.py

下载测试数据集:

wget https://mirrors.aheadai.cn/data/alpaca_gpt4_data_input_1k.json

开始测试:

python3 ./vllm-performance-stream-qwen1.5-long-4090-0.5.0.py http://localhost:9109 > chatglm2-6b-1gpu.log 2>&1

测试大概需要一小时左右

测试结果保存在log文件中

本文系作者 @ admin 原创发布在 文档中心 | AheadAI ,未经许可,禁止转载。