一.概述

Qwen 是阿里巴巴集团 Qwen 团队研发的大语言模型和大型多模态模型系列。目前，大语言模型已升级至 Qwen2.5 版本。无论是语言模型还是多模态模型，均在大规模多语言和多模态数据上进行预训练，并通过高质量数据进行后期微调以贴近人类偏好。Qwen 具备自然语言理解、文本生成、视觉理解、音频理解、工具使用、角色扮演、作为 AI Agent 进行互动等多种能力。

最新版本 Qwen2.5 有以下特点：

易于使用的仅解码器稠密语言模型，提供 0.5B 、1.5B 、3B 、7B 、14B 、32B 和 72B 共7种参数规模的模型，并且有基模型和指令微调模型两种变体（其中“ B ”表示“十亿”， 72B 即为 720 亿）
利用我们最新的数据集进行预训练，包含多达 18T tokens （其中“ T ”表示“万亿”， 18T 即为 18 万亿）
在遵循指令、生成长文本（超过 8K tokens ）、理解结构化数据（例如，表格）以及生成结构化输出特别是 JSON 方面有了显著改进
更加适应多样化的系统提示，增强了角色扮演的实现和聊天机器人的背景设置。
支持最多达 128K tokens 的上下文长度，并能生成多达 8K tokens 的文本。
支持超过 29 种语言，包括中文、英文、法文、西班牙文、葡萄牙文、德文、意大利文、俄文、日文、韩文、越南文、泰文、阿拉伯文等。

在本次教程中，我们将基于 Qwen2.5-7B 模型，展示如何进行推理操作。

二.快速开始

2.1 环境准备

运行硬件环境：

类别	详细信息
CPU	16 vCPU Intel(R) Xeon(R) Platinum 8481C
GPU	*RTX 4090D 1**
GPU 显存	24GB
CUDA 版本	12.1
操作系统	Ubuntu 22.04.3 LTS
Python 版本	3.10
PyTorch 版本	2.3.1

创建一个 conda 环境 qwen2_5 ，python 版本选择 3.10 ，并激活环境

conda create -n qwen2_5 python=3.10
conda activate qwen2_5

2.2 安装pytorch及其他库

conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers modelscope accelerate

2.3 从modelscope下载实例源码

cd autodl-tmp
# 创建 download.py 文件
touch download.py

进入 download.py 输入以下代码

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os
#  官方文档建议国内利用 modelscope 下载

model_dir = snapshot_download('qwen/Qwen2.5-7B-Instruct', cache_dir='/root/autodl-tmp', revision='master')

运行 download.py ，下载时间大约需要20min

python download.py

三.推理教程

使用 Qwen2.5 最简单的方法就是利用 transformers 库与之对话（当然也可以利用modelscope库）。

3.1 基础用法

创建 basic.py 文件

touch basic.py

进入 basic.py 输入以下代码

from transformers import AutoModelForCausalLM, AutoTokenizer
# 或者选择 modelscope 库
# from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "/root/autodl-tmp/qwen/Qwen2___5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 输入你的问题
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

运行 basic.py，即可得到输出结果

python basic.py

输出结果如下（每次的生成结果不一定相同，由参数Temperature决定）

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.17s/it]
Sure! A large language model (LLM) is a type of artificial intelligence model designed to understand and generate human-like text. These models are typically based on deep learning techniques, particularly transformer architectures, and are trained on vast amounts of textual data from the internet, books, articles, and other sources.

Key characteristics of LLMs include:

1. **Scale**: They consist of millions or even billions of parameters, which allows them to capture complex patterns in language.

2. **Training Data**: LLMs are trained on extensive datasets, enabling them to understand a wide range of topics and contexts.

3. **Generative Capabilities**: Once trained, these models can generate coherent text, answer questions, translate languages, summarize documents, and perform various natural language processing tasks.

4. **Context Understanding**: They can maintain context over long passages of text, making them effective for tasks that require understanding the nuances of language.

5. **Applications**: LLMs have numerous applications, including chatbots, content generation, language translation, virtual assistants, and more.

Notable examples of large language models include those developed by companies like Alibaba Cloud, such as Qwen, as well as models like GPT-3 from OpenAI and BERT from Google.

3.2 批处理

transformers 常用方法也支持批处理。

创建 batching.py 文件

touch batching.py

进入 batching.py 输入以下代码

from transformers import AutoModelForCausalLM, AutoTokenizer
# 或者选择 modelscope 库
# from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "/root/autodl-tmp/qwen/Qwen2___5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

message_batch = [
    [{"role": "user", "content": "Give me a detailed introduction to large language model."}],
    [{"role": "user", "content": "Hello!"}],
]
text_batch = tokenizer.apply_chat_template(
    message_batch,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs_batch = tokenizer(text_batch, return_tensors="pt", padding=True).to(model.device)

generated_ids_batch = model.generate(
    **model_inputs_batch,
    max_new_tokens=512,
)
generated_ids_batch = generated_ids_batch[:, model_inputs_batch.input_ids.shape[1]:]
response_batch = tokenizer.batch_decode(generated_ids_batch, skip_special_tokens=True)
print(response_batch)

运行 batching.py，即可得到输出结果

python batching.py

输出结果如下（每次的生成结果不一定相同，由参数Temperature决定）

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.23it/s]
['Certainly! Large Language Models (LLMs) are a class of artificial intelligence models that have been trained on massive amounts of text data to perform various natural language processing tasks. These models are designed to understand and generate human-like text, making them highly versatile tools in the field of natural language processing (NLP). Here’s a detailed introduction to LLMs:\n\n### 1. **Definition and Purpose**\nA large language model is a type of machine learning model that has been trained on extensive text corpora to predict the next word or phrase in a sequence given the preceding context. The primary purpose of these models is to understand and generate human language, which can be used for a wide range of applications such as text completion, translation, question-answering, summarization, and more.\n\n### 2. **Architecture**\nLarge language models typically follow a transformer architecture, which was introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. The transformer architecture consists of multiple layers of self-attention mechanisms, which allow the model to weigh the importance of different words in a sentence when predicting the next token.\n\n#### Key Components:\n- **Embeddings**: Each word in the input sequence is converted into a dense vector representation.\n- **Self-Attention Mechanism**: This mechanism allows the model to focus on different parts of the input sequence when generating the next token.\n- **Feed-Forward Layers**: These layers process the output from the self-attention mechanism and help in capturing complex patterns in the data.\n- **Normalization Layers**: These layers normalize the outputs of the feed-forward layers to ensure stable training.\n- **Output Layer**: This layer generates the probability distribution over the vocabulary for the next token.\n\n### 3. **Training Process**\nLLMs are trained using massive datasets, often containing trillions of words. The training process involves optimizing the model parameters to minimize the difference between the predicted probabilities and the actual labels in the training data. This is typically done using techniques like stochastic gradient descent (SGD) or Adam optimization algorithms.\n\n#### Training Data:\n- **Text Corpora**: These include books, articles, websites, and other textual content.\n- **Annotations**: In some cases, the data might be annotated with additional information, such as part-of-speech tags or named entities.\n\n### 4. **Applications**\nLLMs have a wide range of applications across various domains:\n\n- **Text Completion**: Autocomplete features in search engines, chatbots, and writing assistants.\n- **Translation', 'Hello! Nice to meet you. How can I assist you today?']

3.3 模型测试

为了测试模型的性能，我们准备了一个测试文件，我们计划让模型批处理100个问题，并测试运行过程中的吞吐量、首Token延迟、每Token延迟、端到端时延、端到端时延、显存使用率、功耗。

下载测试文件并运行

wget https://mirrors.aheadai.cn/data/test-qwen2_5.py
python test-qwen2_5.py

将会得到一个log文档：Qwen2.5_Inference_Batching_AutoDL4090D_log.txt，此文档链接为公共资源 | AheadAI

部分内容如下（此文档仅供参考）

INFO:root:First Token Latency: 29.9870 seconds
INFO:root:Per Token Latency: 0.0586 seconds
INFO:root:End-to-End Latency: 29.9928 seconds
INFO:root:Throughput: 17.0708 tokens/s
INFO:root:Memory Usage: 59 %
INFO:root:Power Consumption: 364.86 W

注意：更多内容可以参考 Qwen 的官方文档：Hugging Face transformers - Qwen

附：Python 环境中的已安装包（`pip list`）

Package	Version
accelerate	1.1.1
Brotli	1.0.9
certifi	2024.8.30
charset-normalizer	3.3.2
filelock	3.13.1
fsspec	2024.10.0
gmpy2	2.1.2
huggingface-hub	0.26.3
idna	3.7
Jinja2	3.1.4
MarkupSafe	2.1.3
mkl_fft	1.3.11
mkl_random	1.2.8
mkl-service	2.4.0
modelscope	1.20.1
mpmath	1.3.0
networkx	3.3
numpy	1.26.4
packaging	24.2
pillow	11.0.0
pip	24.2
psutil	6.1.0
PySocks	1.7.1
PyYAML	6.0.2
regex	2024.11.6
requests	2.32.3
safetensors	0.4.5
setuptools	75.1.0
sympy	1.13.2
tokenizers	0.21.0
torch	2.3.1
torchaudio	2.3.1
torchvision	0.18.1
tqdm	4.67.1
transformers	4.47.0
triton	2.3.1
typing_extensions	4.11.0
urllib3	2.2.3
wheel	0.44.0

本文系作者 @ admin 原创发布在文档中心 | AheadAI ，未经许可，禁止转载。

Qwen2.5-7B实例推理教程

一.概述

二.快速开始

2.1 环境准备

2.2 安装pytorch及其他库

2.3 从modelscope下载实例源码

三.推理教程

3.1 基础用法

3.2 批处理

3.3 模型测试

附：Python 环境中的已安装包（`pip list`）

评论

链接

联系方式

158 5271 2728 （同微信，齐先生）

关注我们

Qwen2.5-7B实例推理教程

一.概述

二.快速开始

2.1 环境准备

2.2 安装pytorch及其他库

2.3 从modelscope下载实例源码

三.推理教程

3.1 基础用法

3.2 批处理

3.3 模型测试

附：Python 环境中的已安装包（pip list）

评论

158 5271 2728 （同微信，齐先生）

附：Python 环境中的已安装包（`pip list`）