RAG 检索增强生成原理

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

291 字

1 分钟

RAG 检索增强生成原理

2025-02-07

AI

/

RAG

/

LLM

一、RAG 概述#

1.1 为什么需要 RAG#

graph TB subgraph LLM[" LLM 局限性"] direction TB A[" 知识截止 训练数据有时效性"] B[" 幻觉回答 编造不存在的事实"] C[" 私有知识缺失 无法访问企业内部数据"] end subgraph RAG[" RAG 解决方案"] direction TB D[" 外部知识检索 实时获取最新信息"] E[" 上下文注入 基于事实生成"] F[" 生成时引用 可追溯来源"] end A -.->|解决| D B -.->|解决| E C -.->|解决| F style LLM fill:#ffcccc,stroke:#ff6666 style RAG fill:#ccffcc,stroke:#66cc66

graph TB subgraph Problem["问题场景"] P1["用户问: 公司最新财报数据?"] P2["LLM: 我不知道..."] P2 --> P3["无法回答"] end subgraph Solution["RAG 方案"] S1["用户问: 公司最新财报数据?"] S1 --> S2[" 检索知识库"] S2 --> S3[" 获取财报文档"] S3 --> S4[" LLM + Context"] S4 --> S5["基于事实回答"] end Problem --> Solution style Problem fill:#ffe6e6 style Solution fill:#e6ffe6

LLM 问题	RAG 解决
知识过时	实时检索最新内容
幻觉回答	基于检索事实生成
私有知识缺失	连接自有知识库
无法引用来源	提供可追溯的 Context

1.2 RAG 完整架构#

flowchart TB subgraph Offline[" 离线处理流程"] direction LR O1[" 原始文档"] --> O2[" 文档切分 Chunk Splitter"] O2 --> O3[" Embedding 向量化模型"] O3 --> O4[" 向量数据库 Milvus/Qdrant"] end subgraph Online[" 在线查询流程"] direction TB Q1[" 用户 Query"] --> Q2[" Query 改写"] Q2 --> Q3[" Query Embedding"] Q3 --> Q4[" 向量检索 Top-K 召回"] Q4 --> Q5{" 混合检索?"} Q5 -->|是| Q6[" RRF 融合排序"] Q5 -->|否| Q7[" 候选文档"] Q6 --> Q7 Q7 --> Q8[" 重排序 Reranker"] Q8 --> Q9[" Context 组装"] end subgraph Generation[" LLM 生成"] direction LR G1[" Context"] --> G2[" LLM"] Q1 --> G2 G2 --> G3[" 答案输出"] G3 --> G4[" 引用标注"] end O4 -.->|索引| Q4 Q9 --> G1 style Offline fill:#e3f2fd,stroke:#1976d2 style Online fill:#fff3e0,stroke:#f57c00 style Generation fill:#e8f5e9,stroke:#388e3c

flowchart LR subgraph QueryFlow["用户查询流程"] A["用户 Query"] --> B["Query 改写"] B --> C["向量检索"] C --> D["Context 组装"] D --> E["LLM 生成"] E --> F["引用标注"] end subgraph IndexFlow["离线索引流程"] G["文档切分"] --> H["Embedding"] H --> I["向量入库"] end style QueryFlow fill:#f3e5f5 style IndexFlow fill:#e0f7fa

二、文档处理#

2.1 文档切分#

1
class TextSplitter:
2
    def split(self, text: str, chunk_size: int = 500) -> list[str]:
3
        # 1. 文本标准化
4
        text = self.normalize(text)
5

6
        # 2. 按段落/句子切分
7
        chunks = self.split_by_paragraph(text)
8

9
        # 3. 合并过小片段
10
        chunks = self.merge_small_chunks(chunks, chunk_size)
11

12
        return chunks

2.2 切分策略#

1
# 固定大小切分
2
def fixed_split(text, chunk_size=500, overlap=50):
3
    chunks = []
4
    start = 0
5
    while start < len(text):
6
        end = start + chunk_size
7
        chunks.append(text[start:end])
8
        start = end - overlap
9
    return chunks
10

11
# 语义切分（保留句子完整性）
12
def semantic_split(text):
13
    sentences = split_by_sentence(text)
14
    chunks = []
15
    current = []
16
    for sent in sentences:
17
        current.append(sent)
18
        if len(' '.join(current)) > chunk_size:
19
            chunks.append(' '.join(current[:-1]))
20
            current = [sent]
21
    return chunks

三、检索流程#

3.0 检索架构概览#

flowchart TB subgraph Input[" 输入"] Q["用户查询"] end subgraph Rewrite[" Query 处理"] R1["同义词扩展"] R2["意图识别"] R3["查询改写"] end subgraph Retrieve[" 检索层"] direction TB V1["向量检索 语义相似"] K1["关键词检索 BM25"] H1["混合检索 RRF 融合"] end subgraph Rerank[" 重排序"] R4["Cross-Encoder 精确打分"] R5["Top-K 选择"] end subgraph Output[" 输出"] O["相关文档"] end Q --> R1 --> R2 --> R3 R3 --> V1 R3 --> K1 V1 --> H1 K1 --> H1 H1 --> R4 --> R5 --> O style Input fill:#e1f5fe style Rewrite fill:#fff8e1 style Retrieve fill:#f3e5f5 style Rerank fill:#e8f5e9 style Output fill:#fce4ec

3.1 Query 改写#

1
# Query 优化
2
class QueryRewriter:
3
    def rewrite(self, query: str) -> str:
4
        # 1. 隐式表述展开
5
        query = self.expand_abbreviations(query)
6

7
        # 2. 同义词扩展
8
        query = self.expand_synonyms(query)
9

10
        # 3. 假设类型注入
11
        if "如何" in query:
12
            query += " 请提供具体步骤"
13

14
        return query

3.2 混合检索#

flowchart TB subgraph Query[" 用户查询"] Q["'如何优化 RAG 检索效果?'"] end subgraph VectorSearch[" 向量检索"] direction TB V1["Query Embedding"] V2["向量相似度计算 Cosine/L2"] V3["语义相关文档"] V1 --> V2 --> V3 end subgraph KeywordSearch[" 关键词检索"] direction TB K1["分词处理"] K2["BM25 打分"] K3["关键词匹配文档"] K1 --> K2 --> K3 end subgraph Fusion[" RRF 融合"] direction TB F1["倒数排名融合 RRF Score"] F2["综合排序"] F3["统一候选集"] F1 --> F2 --> F3 end subgraph Result[" 检索结果"] R["Top-K 相关文档"] end Q --> V1 Q --> K1 V3 --> F1 K3 --> F1 F3 --> R style Query fill:#e8eaf6 style VectorSearch fill:#e3f2fd style KeywordSearch fill:#fff3e0 style Fusion fill:#f3e5f5 style Result fill:#e8f5e9

1
# 混合检索 = 关键词 + 向量
2
class HybridRetriever:
3
    def retrieve(self, query: str, top_k: int = 10):
4
        # 向量检索
5
        vector_results = self.vector_index.search(
6
            self.embed(query), top_k * 2
7
        )
8

9
        # 关键词检索
10
        bm25_results = self.bm25_index.search(query, top_k * 2)
11

12
        # RRF 融合
13
        fused = self.rrf_fusion(
14
            vector_results,
15
            bm25_results,
16
            top_k
17
        )
18

19
        return fused
20

21
# RRF (Reciprocal Rank Fusion)
22
def rrf(results_list, k=60):
23
    scores = {}
24
    for results in results_list:
25
        for i, doc in enumerate(results):
26
            doc_id = doc.id
27
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
28
    return sorted(scores.items(), key=lambda x: -x[1])

四、Context 组装#

4.0 Context 组装流程#

flowchart LR subgraph Input[" 输入"] Q["用户问题"] D["检索文档 Top-K"] end subgraph Build[" 组装过程"] direction TB B1["Token 计数"] B2["文档排序 相关度优先"] B3["窗口截断 Max Tokens"] B4["格式化拼接"] B1 --> B2 --> B3 --> B4 end subgraph Template[" Prompt 模板"] T[""" 系统: 你是一个助手 Context: {documents} 问题: {query} 请基于 Context 回答: """] end subgraph Output[" 输出"] P["完整 Prompt"] end Q --> Build D --> Build Build --> Template Template --> P style Input fill:#e3f2fd style Build fill:#fff8e1 style Template fill:#f3e5f5 style Output fill:#e8f5e9

4.1 Context 窗口管理#

1
class ContextBuilder:
2
    def __init__(self, max_tokens: int = 4000):
3
        self.max_tokens = max_tokens
4

5
    def build(self, query: str, retrieved_docs: list) -> str:
6
        context_parts = [f"问题: {query}\n"]
7
        used_tokens = self.count_tokens(context_parts[0])
8

9
        for doc in retrieved_docs:
10
            doc_text = f"参考文档:\n{doc.content}"
11
            doc_tokens = self.count_tokens(doc_text)
12

13
            if used_tokens + doc_tokens > self.max_tokens:
14
                break
15

16
            context_parts.append(doc_text)
17
            used_tokens += doc_tokens
18

19
        return '\n\n'.join(context_parts)

4.2 重排序#

flowchart TB subgraph Input[" 候选文档"] D1["Doc 1: 相关度 0.72"] D2["Doc 2: 相关度 0.68"] D3["Doc 3: 相关度 0.65"] D4["Doc 4: 相关度 0.61"] D5["Doc 5: 相关度 0.58"] end subgraph Reranker[" Cross-Encoder 重排序"] direction TB R1["Query + Doc 联合编码"] R2["Attention 交互"] R3["精确相关性打分"] R1 --> R2 --> R3 end subgraph Output[" 重排序结果"] O1["Doc 3: 0.95 "] O2["Doc 1: 0.89 "] O3["Doc 5: 0.82 "] O4["Doc 2: 0.45 ⬇"] O5["Doc 4: 0.32 ⬇"] end Input --> Reranker Reranker --> Output style Input fill:#e3f2fd style Reranker fill:#fff8e1 style Output fill:#e8f5e9

1
class Reranker:
2
    def rerank(self, query: str, docs: list, top_k: int = 5):
3
        # 使用交叉编码器重排序
4
        pairs = [(query, doc.content) for doc in docs]
5
        scores = self.cross_encoder.predict(pairs)
6

7
        ranked = sorted(
8
            zip(docs, scores),
9
            key=lambda x: -x[1]
10
        )[:top_k]
11

12
        return [doc for doc, _ in ranked]

五、工程实践#

5.1 评估指标#

1
# RAG 评估指标
2
metrics = {
3
    # 检索质量
4
    "retrieval_recall": "相关文档召回比例",
5
    "retrieval_precision": "召回文档中相关比例",
6

7
    # 生成质量
8
    "answer_relevance": "答案与问题的相关性",
9
    "faithfulness": "答案对 Context 的忠实度",
10
    "answer_correctness": "答案正确性"
11
}

5.2 常见问题#

问题	原因	解决方案
检索不到相关内容	切分不当/Embedding 质量差	调整 chunk_size，使用更好 Embedding
Context 太长超限	文档太多/太长	重排序、摘要压缩
引用不准确	文档切分破坏完整性	增加元数据、父子文档

5.3 生产部署#

1
# RAG 系统架构
2
rag_pipeline:
3
  document_store: milvus
4
  vector_store: openai_embedder
5
  llm: gpt-4-turbo
6

7
  retrieval:
8
    top_k: 10
9
    rerank: true
10
    hybrid_search: true
11

12
  generation:
13
    max_tokens: 2000
14
    temperature: 0.1

六、总结#

mindmap root((RAG 系统)) 文档处理智能切分固定大小语义切分父子文档元数据管理检索向量检索 HNSW/IVF 关键词检索 BM25 混合检索 RRF 融合重排序 Cross-Encoder 精确打分生成上下文压缩引用标注多轮对话

模块	关键技术
文档处理	智能切分、父子文档
检索	向量检索、关键词检索、混合检索
重排序	交叉编码器
生成	上下文压缩、引用标注