mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
309 字
1 分钟
Fine-tuning 与模型微调技术
2025-04-05

一、为什么需要 Fine-tuning#

1.1 预训练 vs 微调#

flowchart LR A[大规模预训练] --> B[通用能力] B --> C[垂直领域微调] C --> D[专业能力]
阶段目标资源需求
预训练学习通用知识万卡 GPU 集群
SFT 有监督微调指令遵循单卡~多卡
RLHF对齐人类偏好标注数据
DPO/PO 优化直接偏好优化偏好数据

1.2 全量微调 vs 参数高效微调#

# 全量微调 - 更新所有参数
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
for param in model.parameters():
param.requires_grad = True
# LoRA - 只更新低秩矩阵
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
for param in model.parameters():
param.requires_grad = False
# 只训练 LoRA 参数
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"]
)
model = get_peft_model(base_model, lora_config)

二、LoRA 原理#

2.1 低秩矩阵分解#

# LoRA 核心思想
# W ∈ R^(d×k) 原始权重
# W' = W + ΔW = W + BA
# B ∈ R^(d×r), A ∈ R^(r×k), r << min(d, k)
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=8):
super().__init__()
self.original = nn.Linear(in_features, out_features, bias=False)
# LoRA 矩阵
self.lora_A = nn.Parameter(torch.randn(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
def forward(self, x):
return self.original(x) + x @ self.lora_A @ self.lora_B

2.2 训练效率对比#

方法参数量GPU 显存训练时间
全量微调 7B7B80GB+数天
LoRA 7B数十MB20GB数小时
QLoRA 7B数十MB8GB数小时

三、QLoRA 量化#

3.1 量化原理#

# 4-bit NormalFloat 量化
class NF4Quant:
def __init__(self, bits=4):
self.bits = bits
def quantize(self, weights):
# 1. 计算 absmax
absmax = weights.abs().max()
# 2. 归一化到 [-1, 1]
normalized = weights / absmax
# 3. 量化到 4-bit
qweight = quantize_to_nf4(normalized)
return qweight, absmax

3.2 QLoRA 训练#

# 使用 bitsandbytes 量化
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
quantization_config=quantization_config
)
# 添加 LoRA
model = get_peft_model(model, lora_config)

四、Adapter Tuning#

4.1 Adapter 结构#

class Adapter(nn.Module):
def __init__(self, d_model, adapter_size=64):
super().__init__()
self.down_proj = nn.Linear(d_model, adapter_size)
self.up_proj = nn.Linear(adapter_size, d_model)
self.act = nn.ReLU()
def forward(self, x):
return x + self.up_proj(self.act(self.down_proj(x)))

4.2 多 Adapter 混合#

# 为不同任务训练不同 Adapter
adapters = {
"legal": legal_adapter,
"medical": medical_adapter,
"code": code_adapter
}
def use_adapter(model, task):
model.set_adapter(task)
return model.generate(**inputs)

五、Prefix Tuning#

5.1 前缀调优#

# 可学习的前缀向量
class PrefixTuning(nn.Module):
def __init__(self, num_layers, hidden_size, prefix_len=10):
super().__init__()
# 可学习的前缀
self.prefix = nn.Parameter(
torch.randn(prefix_len, 2, hidden_size)
)
def forward(self, input_ids):
batch_size = input_ids.shape[0]
# 拼接前缀
prefix = self.prefix.expand(batch_size, -1, -1)
return prefix, input_ids

5.2 Prefix-tuning vs LoRA#

特性Prefix TuningLoRA
训练参数更少中等
效果略差持平全量微调
推理开销增加序列长度无额外开销
实现复杂度

六、训练实战#

6.1 数据准备#

# 指令微调数据格式
{
"instruction": "将以下句子翻译成英文",
"input": "你好,世界",
"output": "Hello, world"
}
# alpaca 格式
{
"instruction": "...",
"input": "...",
"output": "..."
}

6.2 训练配置#

# DeepSpeed 配置
trainer:
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
gradient_checkpointing: true
fp16: false
bf16: true
max_steps: 1000
warmup_steps: 100
logging_steps: 10

七、总结#

方法参数量效果适用场景
全量微调全部最好资源充足
LoRA~1%接近全量资源有限
QLoRA~0.5%良好单卡可用
Adapter~1-5%良好多任务切换
Prefix<1%一般极低成本

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Fine-tuning 与模型微调技术
https://blog.souloss.com/posts/ai-engineering/ai-engineering-fine-tuning/
作者
Souloss
发布于
2025-04-05
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时