mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
1189 字
3 分钟
综合实战:构建一个迷你编程语言
2026-03-22

前 18 章从词法分析、语法分析、语义分析、中间表示、优化、寄存器分配、指令选择、代码生成、LLVM、JIT、GC、V8、Go 编译器、Rust 编译器、WASM、AI 优化一路走来。现在到了综合运用的时候——从零构建一个迷你编程语言 MiniLang

flowchart LR SRC["源码<br/>MiniLang"] --> LEX["词法分析<br/>Ch2"] --> PAR["语法分析<br/>Ch3"] --> SEM["语义分析<br/>Ch4"] SEM --> IR["IR 生成<br/>Ch5"] --> OPT["优化<br/>Ch6"] --> CG["代码生成<br/>Ch10"] CG --> EXEC["解释执行"] style SRC fill:#e3f2fd,stroke:#1565c0 style OPT fill:#e8f5e9,stroke:#2e7d32 style EXEC fill:#fff3e0,stroke:#e65100

一、语言设计#

1.1 MiniLang 特性#

MiniLang:一个静态类型的命令式语言
特性:
- 整数和布尔类型
- 变量声明和赋值
- 算术和比较运算
- if/else 条件
- while 循环
- 函数定义和调用
- 返回值
示例程序:
fn factorial(n: int) -> int {
if n <= 1 {
return 1;
}
return n * factorial(n - 1);
}
fn main() -> int {
let result: int = factorial(10);
return result;
}

1.2 语法规则(简化 BNF)#

program → function*
function → "fn" IDENT "(" params? ")" "->" type block
paramsparam ("," param)*
paramIDENT ":" type
type → "int" | "bool"
block → "{" statement* "}"
statement → let_stmt | assign_stmt | if_stmt | while_stmt | return_stmt | expr_stmt
let_stmt → "let" IDENT ":" type "=" expr ";"
assign_stmt → IDENT "=" expr ";"
if_stmt → "if" expr block ("else" block)?
while_stmt → "while" expr block
return_stmt → "return" expr? ";"
expr → comparison
comparison → addition (("==" | "!=" | "<" | ">" | "<=" | ">=") addition)?
addition → multiplication (("+" | "-") multiplication)*
multiplication → unary (("*" | "/") unary)*
unary → ("!" | "-") unary | call
call → primary ("(" args? ")")?
primary → INT | "true" | "false" | IDENT | "(" expr ")"

二、词法分析器#

flowchart TB SRC["源码字符串"] --> SKIP["跳过空白符"] SKIP --> CH{"当前字符"} CH -->|数字| NUM["read_number()<br/>INT_LIT"] CH -->|字母/_| ID["read_identifier()<br/>关键字/IDENT"] CH -->|运算符| OP["read_operator()<br/>+,-,*,/,==,..."] CH -->|分隔符| PUNCT["read_punctuation()<br/>(,),{,},..."] NUM --> LOOP{"还有字符?"} ID --> LOOP OP --> LOOP PUNCT --> LOOP LOOP -->|是| SKIP LOOP -->|否| EOF["添加 EOF Token"] style SRC fill:#e3f2fd,stroke:#1565c0 style EOF fill:#e8f5e9,stroke:#2e7d32

2.1 Token 定义#

from enum import Enum, auto
from dataclasses import dataclass
class TokenType(Enum):
# 关键字
FN = auto(); LET = auto(); IF = auto(); ELSE = auto()
WHILE = auto(); RETURN = auto(); TRUE = auto(); FALSE = auto()
INT_TYPE = auto(); BOOL_TYPE = auto()
# 运算符
PLUS = auto(); MINUS = auto(); STAR = auto(); SLASH = auto()
EQ = auto(); NEQ = auto(); LT = auto(); GT = auto()
LTE = auto(); GTE = auto(); ASSIGN = auto(); BANG = auto()
# 分隔符
LPAREN = auto(); RPAREN = auto(); LBRACE = auto(); RBRACE = auto()
COMMA = auto(); SEMICOLON = auto(); COLON = auto(); ARROW = auto()
# 字面量
INT_LIT = auto(); IDENT = auto()
# 特殊
EOF = auto()
@dataclass
class Token:
type: TokenType
value: str
line: int
col: int

2.2 Lexer 实现#

class Lexer:
KEYWORDS = {
'fn': TokenType.FN, 'let': TokenType.LET,
'if': TokenType.IF, 'else': TokenType.ELSE,
'while': TokenType.WHILE, 'return': TokenType.RETURN,
'true': TokenType.TRUE, 'false': TokenType.FALSE,
'int': TokenType.INT_TYPE, 'bool': TokenType.BOOL_TYPE,
}
def __init__(self, source: str):
self.source = source
self.pos = 0
self.line = 1
self.col = 1
def tokenize(self) -> list[Token]:
tokens = []
while self.pos < len(self.source):
self.skip_whitespace()
if self.pos >= len(self.source):
break
ch = self.source[self.pos]
if ch.isdigit():
tokens.append(self.read_number())
elif ch.isalpha() or ch == '_':
tokens.append(self.read_identifier())
elif ch == '-' and self.peek_next() == '>':
tokens.append(self.make_token(TokenType.ARROW, '->'))
elif ch in '+-*/=!<>':
tokens.append(self.read_operator())
elif ch in '(){}:,;':
tokens.append(self.read_punctuation())
else:
raise SyntaxError(f"Unexpected character '{ch}' at {self.line}:{self.col}")
tokens.append(Token(TokenType.EOF, '', self.line, self.col))
return tokens
def read_number(self) -> Token:
start = self.pos
while self.pos < len(self.source) and self.source[self.pos].isdigit():
self.advance()
return Token(TokenType.INT_LIT, self.source[start:self.pos], self.line, self.col)
def read_identifier(self) -> Token:
start = self.pos
while self.pos < len(self.source) and (self.source[self.pos].isalnum() or self.source[self.pos] == '_'):
self.advance()
word = self.source[start:self.pos]
token_type = self.KEYWORDS.get(word, TokenType.IDENT)
return Token(token_type, word, self.line, self.col)

三、语法分析器#

3.1 AST 节点#

from dataclasses import dataclass
from typing import Optional
@dataclass
class Expr: pass
@dataclass
class IntLit(Expr):
value: int
@dataclass
class BoolLit(Expr):
value: bool
@dataclass
class VarRef(Expr):
name: str
@dataclass
class BinaryOp(Expr):
op: str; left: Expr; right: Expr
@dataclass
class UnaryOp(Expr):
op: str; operand: Expr
@dataclass
class CallExpr(Expr):
name: str; args: list[Expr]
@dataclass
class Stmt: pass
@dataclass
class LetStmt(Stmt):
name: str; type_name: str; init: Expr
@dataclass
class AssignStmt(Stmt):
name: str; value: Expr
@dataclass
class IfStmt(Stmt):
condition: Expr; then_block: list[Stmt]; else_block: Optional[list[Stmt]]
@dataclass
class WhileStmt(Stmt):
condition: Expr; body: list[Stmt]
@dataclass
class ReturnStmt(Stmt):
value: Optional[Expr]
@dataclass
class ExprStmt(Stmt):
expr: Expr
@dataclass
class Function:
name: str; params: list[tuple[str, str]]; return_type: str; body: list[Stmt]
@dataclass
class Program:
functions: list[Function]

3.2 递归下降解析器#

class Parser:
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.pos = 0
def parse(self) -> Program:
functions = []
while not self.at_end():
functions.append(self.parse_function())
return Program(functions)
def parse_function(self) -> Function:
self.expect(TokenType.FN)
name = self.expect(TokenType.IDENT).value
self.expect(TokenType.LPAREN)
params = self.parse_params()
self.expect(TokenType.RPAREN)
self.expect(TokenType.ARROW)
return_type = self.parse_type()
body = self.parse_block()
return Function(name, params, return_type, body)
def parse_expr(self) -> Expr:
return self.parse_comparison()
def parse_comparison(self) -> Expr:
left = self.parse_addition()
while self.match(TokenType.EQ, TokenType.NEQ, TokenType.LT,
TokenType.GT, TokenType.LTE, TokenType.GTE):
op = self.previous().value
right = self.parse_addition()
left = BinaryOp(op, left, right)
return left
def parse_addition(self) -> Expr:
left = self.parse_multiplication()
while self.match(TokenType.PLUS, TokenType.MINUS):
op = self.previous().value
right = self.parse_multiplication()
left = BinaryOp(op, left, right)
return left
def parse_multiplication(self) -> Expr:
left = self.parse_unary()
while self.match(TokenType.STAR, TokenType.SLASH):
op = self.previous().value
right = self.parse_unary()
left = BinaryOp(op, left, right)
return left
def parse_primary(self) -> Expr:
if self.match(TokenType.INT_LIT):
return IntLit(int(self.previous().value))
if self.match(TokenType.TRUE):
return BoolLit(True)
if self.match(TokenType.FALSE):
return BoolLit(False)
if self.match(TokenType.IDENT):
name = self.previous().value
if self.match(TokenType.LPAREN):
args = self.parse_args()
self.expect(TokenType.RPAREN)
return CallExpr(name, args)
return VarRef(name)
if self.match(TokenType.LPAREN):
expr = self.parse_expr()
self.expect(TokenType.RPAREN)
return expr
raise SyntaxError(f"Unexpected token: {self.peek()}")

四、语义分析#

flowchart TB PROG["Program"] --> REG["注册所有函数签名"] REG --> CHECK["逐函数检查"] CHECK --> SCOPE["进入新作用域"] SCOPE --> PARAMS["声明参数类型"] PARAMS --> STMTS["逐语句检查"] STMTS --> EXPR_CHECK["表达式类型推导"] EXPR_CHECK --> TYPE_MATCH{"类型匹配?"} TYPE_MATCH -->|否| ERR["TypeError"] TYPE_MATCH -->|是| OK["返回推导类型"] OK --> SCOPE_EXIT["退出作用域"] style ERR fill:#ffcdd2,stroke:#c62828 style OK fill:#c8e6c9,stroke:#2e7d32
Warning

MiniLang 的类型检查是单向的——只做类型验证,不做类型推导。这意味着每个变量声明必须显式标注类型(let x: int = 5)。真实的编译器(如 Rust、Haskell)会做双向类型检查和 Hindley-Milner 推导,但那需要更复杂的统一算法。

4.1 类型检查#

class TypeChecker:
def __init__(self):
self.scopes: list[dict[str, str]] = [{}] # 符号表栈
self.functions: dict[str, Function] = {}
def check_program(self, program: Program):
for func in program.functions:
self.functions[func.name] = func
for func in program.functions:
self.check_function(func)
def check_function(self, func: Function):
self.enter_scope()
for name, type_name in func.params:
self.declare(name, type_name)
for stmt in func.body:
self.check_stmt(stmt, func.return_type)
self.exit_scope()
def check_expr(self, expr: Expr) -> str:
if isinstance(expr, IntLit): return 'int'
if isinstance(expr, BoolLit): return 'bool'
if isinstance(expr, VarRef):
return self.lookup(expr.name)
if isinstance(expr, BinaryOp):
left_type = self.check_expr(expr.left)
right_type = self.check_expr(expr.right)
if left_type != right_type:
raise TypeError(f"Type mismatch: {left_type} vs {right_type}")
if expr.op in ('+', '-', '*', '/'):
if left_type != 'int':
raise TypeError(f"Arithmetic on non-int: {left_type}")
return 'int'
if expr.op in ('==', '!=', '<', '>', '<=', '>='):
return 'bool'
if isinstance(expr, CallExpr):
func = self.functions.get(expr.name)
if not func:
raise NameError(f"Undefined function: {expr.name}")
if len(expr.args) != len(func.params):
raise TypeError(f"Wrong number of arguments for {expr.name}")
for (arg, (pname, ptype)) in zip(expr.args, func.params):
arg_type = self.check_expr(arg)
if arg_type != ptype:
raise TypeError(f"Argument type mismatch: expected {ptype}, got {arg_type}")
return func.return_type
raise ValueError(f"Unknown expression type: {type(expr)}")

五、IR 生成#

MiniLang 的编译管线将源码逐步变换为可执行的目标代码,每个阶段都有明确的输入/输出格式:

编译阶段输入输出核心数据结构关键操作
词法分析源码字符串Token 流Token(type, value, line, col)字符分类、关键字识别、运算符合并
语法分析Token 流ASTProgram → Function → Stmt/Expr递归下降、运算符优先级、错误恢复
语义分析AST带类型的 AST符号表栈 list[dict[str, str]]类型检查、作用域解析、函数签名验证
IR 生成带类型的 AST三地址码IRInstr(op, dest, arg1, arg2)临时变量分配、标签生成、表达式展开
优化三地址码优化后三地址码常量表 dict[str, int] + 使用集常量折叠、死代码消除
代码生成优化后三地址码栈机器指令PUSH/LOAD/STORE/OP/CALL栈机器映射、函数调用约定
解释执行栈机器指令执行结果list[int] + 变量表指令派发、递归调用、返回值传递
flowchart LR SRC["源码字符串"] --> LEX["词法分析<br/>Token 流"] --> PAR["语法分析<br/>AST"] --> SEM["语义分析<br/>带类型的 AST"] SEM --> IRG["IR 生成<br/>三地址码"] --> OPT["优化<br/>常量折叠 + DCE"] --> CG["代码生成<br/>栈机器指令"] CG --> INTERP["解释执行<br/>栈机器"] style SRC fill:#e3f2fd,stroke:#1565c0 style OPT fill:#e8f5e9,stroke:#2e7d32 style INTERP fill:#fff3e0,stroke:#e65100

5.1 三地址码#

@dataclass
class IRInstr:
op: str # 操作码
dest: str = '' # 目标变量
arg1: str = '' # 操作数1
arg2: str = '' # 操作数2
label: str = '' # 标签(跳转目标)
class IRGenerator:
def __init__(self):
self.temp_count = 0
self.label_count = 0
self.instructions: list[IRInstr] = []
def new_temp(self) -> str:
self.temp_count += 1
return f"t{self.temp_count}"
def new_label(self) -> str:
self.label_count += 1
return f"L{self.label_count}"
def gen_function(self, func: Function) -> list[IRInstr]:
self.instructions = []
self.emit('func', label=func.name)
for name, _ in func.params:
self.emit('param', dest=name)
for stmt in func.body:
self.gen_stmt(stmt)
self.emit('endfunc')
return self.instructions
def gen_expr(self, expr: Expr) -> str:
if isinstance(expr, IntLit):
return str(expr.value)
if isinstance(expr, VarRef):
return expr.name
if isinstance(expr, BinaryOp):
left = self.gen_expr(expr.left)
right = self.gen_expr(expr.right)
dest = self.new_temp()
self.emit(expr.op, dest=dest, arg1=left, arg2=right)
return dest
if isinstance(expr, CallExpr):
args = [self.gen_expr(a) for a in expr.args]
for arg in args:
self.emit('push', arg1=arg)
dest = self.new_temp()
self.emit('call', dest=dest, arg1=expr.name, arg2=str(len(args)))
return dest
raise ValueError(f"Unknown expr: {type(expr)}")

六、优化#

编译器优化的核心思路是:在保证语义等价的前提下,减少运行时计算量。以下两种优化是最基础的,也是几乎所有编译器都会实现的:

优化技术目标实现方式效果
常量折叠消除编译期可确定的计算识别操作数全为常量的表达式,直接计算结果3 + 58,减少运行时指令
死代码消除删除不会被执行的代码分析变量使用关系,删除未被引用的赋值let x = 1; (x 未使用) → 删除
常量传播将常量值替换到使用处跟踪常量赋值,在引用处替换为字面值let x = 3; y = x + 1y = 4

6.1 常量折叠#

class ConstantFolder:
def optimize(self, instructions: list[IRInstr]) -> list[IRInstr]:
constants: dict[str, int] = {}
result = []
for instr in instructions:
if instr.op == 'assign' and instr.arg1.isdigit():
constants[instr.dest] = int(instr.arg1)
result.append(instr)
elif instr.op in ('+', '-', '*', '/'):
left_val = constants.get(instr.arg1)
right_val = constants.get(instr.arg2)
if left_val is not None and right_val is not None:
# 编译期计算
folded = self.eval_op(instr.op, left_val, right_val)
constants[instr.dest] = folded
result.append(IRInstr('assign', dest=instr.dest, arg1=str(folded)))
else:
result.append(instr)
else:
result.append(instr)
return result
def eval_op(self, op: str, a: int, b: int) -> int:
if op == '+': return a + b
if op == '-': return a - b
if op == '*': return a * b
if op == '/': return a // b
raise ValueError(f"Unknown op: {op}")

6.2 死代码消除#

class DeadCodeEliminator:
def optimize(self, instructions: list[IRInstr]) -> list[IRInstr]:
used: set[str] = set()
# 第一遍:收集使用的变量
for instr in instructions:
if instr.arg1 and not instr.arg1.isdigit():
used.add(instr.arg1)
if instr.arg2 and not instr.arg2.isdigit():
used.add(instr.arg2)
# 第二遍:删除未使用的赋值
result = []
for instr in instructions:
if instr.op in ('assign', '+', '-', '*', '/', '/') and instr.dest not in used:
continue # 死代码,跳过
result.append(instr)
return result

七、代码生成#

7.1 栈机器目标#

class StackCodeGenerator:
def generate(self, instructions: list[IRInstr]) -> list[str]:
code = []
for instr in instructions:
if instr.op == 'func':
code.append(f"FUNC {instr.label}")
elif instr.op == 'endfunc':
code.append("ENDFUNC")
elif instr.op == 'param':
code.append(f"LOAD_PARAM {instr.dest}")
elif instr.op == 'assign':
code.append(f"PUSH {instr.arg1}")
code.append(f"STORE {instr.dest}")
elif instr.op in ('+', '-', '*', '/'):
code.append(f"LOAD {instr.arg1}")
code.append(f"LOAD {instr.arg2}")
code.append(f"OP {instr.op}")
code.append(f"STORE {instr.dest}")
elif instr.op == 'call':
code.append(f"CALL {instr.arg1} {instr.arg2}")
code.append(f"STORE {instr.dest}")
elif instr.op == 'return':
if instr.arg1:
code.append(f"LOAD {instr.arg1}")
code.append("RET")
return code

7.2 解释器#

class Interpreter:
def __init__(self):
self.stack: list[int] = []
self.variables: dict[str, int] = {}
self.functions: dict[str, list[str]] = {}
def run(self, code: list[str]) -> int:
# 加载函数
self.load_functions(code)
# 调用 main
return self.call_function('main', [])
def call_function(self, name: str, args: list[int]) -> int:
func_code = self.functions[name]
local_vars = {}
# 绑定参数
for i, instr in enumerate(func_code):
if instr.startswith('LOAD_PARAM'):
var_name = instr.split()[1]
local_vars[var_name] = args.pop(0)
# 执行
ip = 0
while ip < len(func_code):
instr = func_code[ip]
parts = instr.split()
op = parts[0]
if op == 'PUSH':
self.stack.append(int(parts[1]))
elif op == 'LOAD':
self.stack.append(local_vars.get(parts[1], self.variables.get(parts[1], 0)))
elif op == 'STORE':
local_vars[parts[1]] = self.stack.pop()
elif op == 'OP':
b, a = self.stack.pop(), self.stack.pop()
if parts[1] == '+': self.stack.append(a + b)
elif parts[1] == '-': self.stack.append(a - b)
elif parts[1] == '*': self.stack.append(a * b)
elif parts[1] == '/': self.stack.append(a // b)
elif op == 'RET':
return self.stack.pop() if self.stack else 0
elif op == 'CALL':
result = self.call_function(parts[1], [self.stack.pop() for _ in range(int(parts[2]))][::-1])
self.stack.append(result)
ip += 1
return 0

八、端到端测试#

8.1 测试用例#

def test_factorial():
source = """
fn factorial(n: int) -> int {
if n <= 1 {
return 1;
}
return n * factorial(n - 1);
}
fn main() -> int {
return factorial(10);
}
"""
# 1. 词法分析
tokens = Lexer(source).tokenize()
# 2. 语法分析
ast = Parser(tokens).parse()
# 3. 语义分析
TypeChecker().check_program(ast)
# 4. IR 生成
ir = IRGenerator().gen_program(ast)
# 5. 优化
ir = ConstantFolder().optimize(ir)
ir = DeadCodeEliminator().optimize(ir)
# 6. 代码生成
code = StackCodeGenerator().generate(ir)
# 7. 执行
result = Interpreter().run(code)
assert result == 3628800 # 10!
print(f"factorial(10) = {result}")

8.2 性能对比#

实现factorial(10) 时间说明
Python 解释器~1 μs基准
MiniLang 解释器~10 μs栈机器开销
MiniLang → LLVM~0.1 μs编译为原生代码

九、总结#

上一章中,我们看到了 MLGO 和 AutoTVM 如何用机器学习替代启发式规则,让编译器从数据中学习优化决策。现在,把前 18 章的知识融会贯通——从零构建一个迷你编程语言 MiniLang,亲手实现词法分析、语法分析、语义分析、IR 生成、优化和代码生成的完整流水线。

编译阶段实现方式对应章节
词法分析手写 LexerCh2 词法分析
语法分析递归下降 ParserCh3 语法分析
语义分析类型检查 + 符号表Ch4 语义分析
IR 生成三地址码Ch5 中间表示
优化常量折叠 + 死代码消除Ch6 优化基础
代码生成栈机器Ch10 代码生成
解释执行栈机器解释器
Tip

MiniLang 虽然简单,但它涵盖了编译器的完整流程:词法分析→语法分析→语义分析→IR生成→优化→代码生成→执行。理解了 MiniLang,你就理解了编译器的核心骨架。接下来可以尝试:

  • 添加闭包支持(需要捕获环境)

  • 添加模式匹配(需要更复杂的 IR)

  • 编译到 LLVM IR(使用 llvmlite)

  • 添加 JIT 执行(参考 Ch12 JIT编译

本系列到此完结。从编译器全景到本实战章,希望你已经从”编译器是黑盒”进阶到”能从零构建编译器”。

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

综合实战:构建一个迷你编程语言
https://blog.souloss.com/posts/compiler/compiler-hands-on-practice/
作者
Souloss
发布于
2026-03-22
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时