mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
226 字
1 分钟
大模型推理优化技术
2025-04-01

一、推理优化概述#

1.1 延迟与吞吐#

graph TB A[请求] --> B[Prefill] B --> C[Decode] C --> D[首 token 延迟] D --> E[生成 token] style A fill:#90EE90 style C fill:#FFD700
阶段特点优化方向
Prefill计算密集算子融合、Tensor Parallel
Decode访存密集KV Cache、Continuous Batching

1.2 优化手段#

优化维度技术
延迟KV Cache、投机解码
吞吐Continuous Batching、Prefill Batching
显存量化、PageAttention
分布式Tensor Parallel、Pipeline Parallel

二、Continuous Batching#

2.1 传统 Batching 问题#

# 传统 Batching 等待所有请求
# 短请求等待长请求完成
class StaticBatching:
def process_batch(self, requests):
# 等待所有请求准备完成
while len(requests) < self.batch_size:
time.sleep(0.1)
return self._run_batch(requests)

2.2 Continuous Batching 原理#

class ContinuousBatching:
"""收到请求立即入队"""
def add_request(self, request):
self.pending.append(request)
def step(self):
# 动态调度
running = [r for r in self.running if r.is_running()]
self.running = running
for req in self.pending[:self.max_batch]:
req.start()
self.running.append(req)
# 处理完的请求立即返回,接收新请求
return [r.result() for r in running if r.is_done()]

2.3 vLLM 配置#

# vLLM 配置
engine_config:
max_model_len: 4096
tensor_parallel_size: 2
gpu_memory_utilization: 0.9
pipeline_parallel_size: 1

三、KV Cache#

3.1 Cache 结构#

class KVCache:
"""KV Cache 存储"""
def __init__(self):
self.k_cache = torch.zeros(MAX_SEQ_LEN, HEAD_DIM)
self.v_cache = torch.zeros(MAX_SEQ_LEN, HEAD_DIM)
def update(self, position, k, v):
self.k_cache[position] = k
self.v_cache[position] = v
def get(self, start, end):
return self.k_cache[start:end], self.v_cache[start:end]

3.2 Paged Attention#

# vLLM 的分页管理
class PagedKVCache:
"""物理块管理虚拟 Cache"""
def __init__(self, block_size=16):
self.free_blocks = list(range(num_blocks))
self.block_table = {} # 请求 ID -> 物理块
def allocate(self, request_id, seq_len):
num_blocks = ceil(seq_len / block_size)
blocks = [self.free_blocks.pop() for _ in range(num_blocks)]
self.block_table[request_id] = blocks

四、量化技术#

4.1 量化原理#

# INT8 量化
class Int8Quantizer:
def quantize(self, weights):
# 激活
scale = weights.abs().max() / 127
quantized = (weights / scale).round().to(torch.int8)
return quantized, scale
def dequantize(self, quantized, scale):
return quantized.float() * scale

4.2 GPTQ 量化#

# GPTQ 量化命令
python gptq.py \
--model meta-llama/Llama-2-7b \
--bits 4 \
--dataset wikitext2 \
--desc_act # 描述性激活

4.3 AWQ 量化#

# Activation-Aware Quantization
class AWQ:
def find_scale(self, weights, activations):
# 找最优缩放因子
scales = {}
for name, w in weights.items():
act = get_activation(name)
scale = (act.abs().mean() / w.abs().mean())
scales[name] = scale
return scales

五、Tensor Parallel#

5.1 张量并行原理#

graph TB A[输入] --> T1[Tensor Slice 1] A --> T2[Tensor Slice 2] T1 --> O1[输出 1] T2 --> O2[输出 2] O1 --> F[AllReduce] O2 --> F

5.2 Megatron-LM 配置#

# Tensor Parallel 配置
tensor_parallel:
enabled: true
size: 8 # 8 卡并行
method: ring
# 或 method: all-reduce

5.3 通信优化#

# AllReduce 通信优化
class AllReduceOptimizer:
def all_reduce(self, tensor, op="sum"):
# 异步通信
# 计算与通信重叠
pass

六、推理服务部署#

6.1 vLLM 部署#

# 启动 vLLM 服务
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 4096

6.2 TensorRT-LLM 部署#

# 编译模型
trtllm-build --model meta-llama/Llama-2-7b \
--output /tmp/model.engine \
--tp 2

6.3 OpenAI 兼容 API#

# 请求格式
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="llama-2-7b",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=100
)

七、总结#

优化手段效果
Continuous Batching吞吐提升 10x+
PagedAttention显存利用率提升
INT8/INT4 量化显存减半/四分之一
Tensor Parallel多卡加速
FlashAttention注意力计算加速

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

大模型推理优化技术
https://blog.souloss.com/posts/ai-engineering/llm-inference-optimization/
作者
Souloss
发布于
2025-04-01
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时