Fine-tuning 与模型微调技术

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

309 字

1 分钟

Fine-tuning 与模型微调技术

2025-04-05

AI

/

微调

一、为什么需要 Fine-tuning#

1.1 预训练 vs 微调#

flowchart LR A[大规模预训练] --> B[通用能力] B --> C[垂直领域微调] C --> D[专业能力]

阶段	目标	资源需求
预训练	学习通用知识	万卡 GPU 集群
SFT 有监督微调	指令遵循	单卡~多卡
RLHF	对齐人类偏好	标注数据
DPO/PO 优化	直接偏好优化	偏好数据

1.2 全量微调 vs 参数高效微调#

1
# 全量微调 - 更新所有参数
2
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
3
for param in model.parameters():
4
    param.requires_grad = True
5

6
# LoRA - 只更新低秩矩阵
7
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
8
for param in model.parameters():
9
    param.requires_grad = False
10

11
# 只训练 LoRA 参数
12
lora_config = LoraConfig(
13
    r=8,
14
    lora_alpha=16,
15
    target_modules=["q_proj", "v_proj"]
16
)
17
model = get_peft_model(base_model, lora_config)

二、LoRA 原理#

2.1 低秩矩阵分解#

1
# LoRA 核心思想
2
# W ∈ R^(d×k) 原始权重
3
# W' = W + ΔW = W + BA
4
# B ∈ R^(d×r), A ∈ R^(r×k), r << min(d, k)
5

6
class LoRALinear(nn.Module):
7
    def __init__(self, in_features, out_features, rank=8):
8
        super().__init__()
9
        self.original = nn.Linear(in_features, out_features, bias=False)
10
        # LoRA 矩阵
11
        self.lora_A = nn.Parameter(torch.randn(in_features, rank))
12
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
13

14
    def forward(self, x):
15
        return self.original(x) + x @ self.lora_A @ self.lora_B

2.2 训练效率对比#

方法	参数量	GPU 显存	训练时间
全量微调 7B	7B	80GB+	数天
LoRA 7B	数十MB	20GB	数小时
QLoRA 7B	数十MB	8GB	数小时

三、QLoRA 量化#

3.1 量化原理#

1
# 4-bit NormalFloat 量化
2
class NF4Quant:
3
    def __init__(self, bits=4):
4
        self.bits = bits
5

6
    def quantize(self, weights):
7
        # 1. 计算 absmax
8
        absmax = weights.abs().max()
9
        # 2. 归一化到 [-1, 1]
10
        normalized = weights / absmax
11
        # 3. 量化到 4-bit
12
        qweight = quantize_to_nf4(normalized)
13
        return qweight, absmax

3.2 QLoRA 训练#

1
# 使用 bitsandbytes 量化
2
from transformers import BitsAndBytesConfig
3

4
quantization_config = BitsAndBytesConfig(
5
    load_in_4bit=True,
6
    bnb_4bit_quant_type="nf4",
7
    bnb_4bit_compute_dtype=torch.bfloat16
8
)
9

10
model = AutoModelForCausalLM.from_pretrained(
11
    "meta-llama/Llama-2-7b",
12
    quantization_config=quantization_config
13
)
14

15
# 添加 LoRA
16
model = get_peft_model(model, lora_config)

四、Adapter Tuning#

4.1 Adapter 结构#

1
class Adapter(nn.Module):
2
    def __init__(self, d_model, adapter_size=64):
3
        super().__init__()
4
        self.down_proj = nn.Linear(d_model, adapter_size)
5
        self.up_proj = nn.Linear(adapter_size, d_model)
6
        self.act = nn.ReLU()
7

8
    def forward(self, x):
9
        return x + self.up_proj(self.act(self.down_proj(x)))

4.2 多 Adapter 混合#

1
# 为不同任务训练不同 Adapter
2
adapters = {
3
    "legal": legal_adapter,
4
    "medical": medical_adapter,
5
    "code": code_adapter
6
}
7

8
def use_adapter(model, task):
9
    model.set_adapter(task)
10
    return model.generate(**inputs)

五、Prefix Tuning#

5.1 前缀调优#

1
# 可学习的前缀向量
2
class PrefixTuning(nn.Module):
3
    def __init__(self, num_layers, hidden_size, prefix_len=10):
4
        super().__init__()
5
        # 可学习的前缀
6
        self.prefix = nn.Parameter(
7
            torch.randn(prefix_len, 2, hidden_size)
8
        )
9

10
    def forward(self, input_ids):
11
        batch_size = input_ids.shape[0]
12
        # 拼接前缀
13
        prefix = self.prefix.expand(batch_size, -1, -1)
14
        return prefix, input_ids

5.2 Prefix-tuning vs LoRA#

特性	Prefix Tuning	LoRA
训练参数	更少	中等
效果	略差	持平全量微调
推理开销	增加序列长度	无额外开销
实现复杂度	高	低

六、训练实战#

6.1 数据准备#

1
# 指令微调数据格式
2
{
3
    "instruction": "将以下句子翻译成英文",
4
    "input": "你好，世界",
5
    "output": "Hello, world"
6
}
7

8
# alpaca 格式
9
{
10
    "instruction": "...",
11
    "input": "...",
12
    "output": "..."
13
}

6.2 训练配置#

1
# DeepSpeed 配置
2
trainer:
3
  per_device_train_batch_size: 4
4
  gradient_accumulation_steps: 4
5
  gradient_checkpointing: true
6
  fp16: false
7
  bf16: true
8
  max_steps: 1000
9
  warmup_steps: 100
10
  logging_steps: 10

七、总结#

方法	参数量	效果	适用场景
全量微调	全部	最好	资源充足
LoRA	~1%	接近全量	资源有限
QLoRA	~0.5%	良好	单卡可用
Adapter	~1-5%	良好	多任务切换
Prefix	<1%	一般	极低成本