226 字
1 分钟
大模型推理优化技术
一、推理优化概述
1.1 延迟与吞吐
graph TB
A[请求] --> B[Prefill]
B --> C[Decode]
C --> D[首 token 延迟]
D --> E[生成 token]
style A fill:#90EE90
style C fill:#FFD700
| 阶段 | 特点 | 优化方向 |
|---|---|---|
| Prefill | 计算密集 | 算子融合、Tensor Parallel |
| Decode | 访存密集 | KV Cache、Continuous Batching |
1.2 优化手段
| 优化维度 | 技术 |
|---|---|
| 延迟 | KV Cache、投机解码 |
| 吞吐 | Continuous Batching、Prefill Batching |
| 显存 | 量化、PageAttention |
| 分布式 | Tensor Parallel、Pipeline Parallel |
二、Continuous Batching
2.1 传统 Batching 问题
# 传统 Batching 等待所有请求# 短请求等待长请求完成
class StaticBatching: def process_batch(self, requests): # 等待所有请求准备完成 while len(requests) < self.batch_size: time.sleep(0.1) return self._run_batch(requests)2.2 Continuous Batching 原理
class ContinuousBatching: """收到请求立即入队""" def add_request(self, request): self.pending.append(request)
def step(self): # 动态调度 running = [r for r in self.running if r.is_running()] self.running = running for req in self.pending[:self.max_batch]: req.start() self.running.append(req)
# 处理完的请求立即返回,接收新请求 return [r.result() for r in running if r.is_done()]2.3 vLLM 配置
# vLLM 配置engine_config: max_model_len: 4096 tensor_parallel_size: 2 gpu_memory_utilization: 0.9 pipeline_parallel_size: 1三、KV Cache
3.1 Cache 结构
class KVCache: """KV Cache 存储""" def __init__(self): self.k_cache = torch.zeros(MAX_SEQ_LEN, HEAD_DIM) self.v_cache = torch.zeros(MAX_SEQ_LEN, HEAD_DIM)
def update(self, position, k, v): self.k_cache[position] = k self.v_cache[position] = v
def get(self, start, end): return self.k_cache[start:end], self.v_cache[start:end]3.2 Paged Attention
# vLLM 的分页管理class PagedKVCache: """物理块管理虚拟 Cache""" def __init__(self, block_size=16): self.free_blocks = list(range(num_blocks)) self.block_table = {} # 请求 ID -> 物理块
def allocate(self, request_id, seq_len): num_blocks = ceil(seq_len / block_size) blocks = [self.free_blocks.pop() for _ in range(num_blocks)] self.block_table[request_id] = blocks四、量化技术
4.1 量化原理
# INT8 量化class Int8Quantizer: def quantize(self, weights): # 激活 scale = weights.abs().max() / 127 quantized = (weights / scale).round().to(torch.int8) return quantized, scale
def dequantize(self, quantized, scale): return quantized.float() * scale4.2 GPTQ 量化
# GPTQ 量化命令python gptq.py \ --model meta-llama/Llama-2-7b \ --bits 4 \ --dataset wikitext2 \ --desc_act # 描述性激活4.3 AWQ 量化
# Activation-Aware Quantizationclass AWQ: def find_scale(self, weights, activations): # 找最优缩放因子 scales = {} for name, w in weights.items(): act = get_activation(name) scale = (act.abs().mean() / w.abs().mean()) scales[name] = scale return scales五、Tensor Parallel
5.1 张量并行原理
graph TB
A[输入] --> T1[Tensor Slice 1]
A --> T2[Tensor Slice 2]
T1 --> O1[输出 1]
T2 --> O2[输出 2]
O1 --> F[AllReduce]
O2 --> F
5.2 Megatron-LM 配置
# Tensor Parallel 配置tensor_parallel: enabled: true size: 8 # 8 卡并行 method: ring # 或 method: all-reduce5.3 通信优化
# AllReduce 通信优化class AllReduceOptimizer: def all_reduce(self, tensor, op="sum"): # 异步通信 # 计算与通信重叠 pass六、推理服务部署
6.1 vLLM 部署
# 启动 vLLM 服务python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-7b \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 40966.2 TensorRT-LLM 部署
# 编译模型trtllm-build --model meta-llama/Llama-2-7b \ --output /tmp/model.engine \ --tp 26.3 OpenAI 兼容 API
# 请求格式import openaiclient = openai.OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create( model="llama-2-7b", messages=[{"role": "user", "content": "Hello"}], max_tokens=100)七、总结
| 优化手段 | 效果 |
|---|---|
| Continuous Batching | 吞吐提升 10x+ |
| PagedAttention | 显存利用率提升 |
| INT8/INT4 量化 | 显存减半/四分之一 |
| Tensor Parallel | 多卡加速 |
| FlashAttention | 注意力计算加速 |
支持与分享
如果这篇文章对你有帮助,欢迎支持作者或分享给更多人
部分信息可能已经过时
相关文章 智能推荐






