EnglishFrançais

7 · Compiler internals

Welcome, contributor. This chapter is the architecture reference for people hacking on amc itself. The compiler is a single straight pipeline: source → lexer → parser → resolver → typechecker → cgen → C.

If you came here because the tests broke after your change, jump straight to the "Adding a feature" recipes near the end.

Pipeline shape

foo.am ─▶ Lexer ─▶ tokens ─▶ Parser ─▶ AST ─▶ Resolver ─▶ TypeChecker ─▶ CGen ─▶ foo.c
                                       │           │            │           │
                                       │           │            │           └─ src/generator/c_gen.am
                                       │           │            └────────────  src/typechecker.am
                                       │           └─────────────────────────  src/resolver/{symbol,resolver}.am
                                       └─────────────────────────────────────  src/parser/{ast,parser}.am
                                                                               src/lexer/{token,lexer}.am

src/main.am (AmalgameCompiler.Run) drives the pipeline. Each phase is a separate class in its own file.

Lexer (src/lexer/)

Adding a new token:

  1. Add the variant to TokenType in src/lexer/token.am.
  2. Recognise it in lexer.am — usually inside the symbol-reading block (else if (c == "@") { ... }) or the keyword lookup (if (word == "guard") { ... }).

Parser (src/parser/)

Each construct has a dedicated parser function (ParseDecl, ParseClass, ParseMethod, ParseStmt, ParseExpr, ParseUnary, ParsePrimary, ParsePostfix, ParseCallArgs, ParseMatch, …).

Adding a new statement (guard, for example):

  1. Add the keyword token in the lexer.
  2. In ParseStmt, dispatch on the keyword: if (v == "guard") { return this.ParseGuard() }.
  3. Implement ParseGuard() — usually building a normal IF_STMT with a transformed condition (so the rest of the pipeline doesn't need to learn the new construct).

Resolver (src/resolver/)

Two passes:

  1. CollectDecl — registers every top-level type (class, enum) in the global scope so forward references work; builds the MemberTable mapping ClassName.MemberName → typeName.
  2. ResolveDecl — walks the AST, opens/closes scopes for methods/blocks/for-in/match-arm, registers locals on declaration, reports Unknown symbol 'x' for unresolved identifiers.

Local scope is a flat array of names with a stack of start-indices (ScopeStarts). PushScope records the current count, PopScope truncates entries declared since.

The resolver also owns the SourceMap — that's what powers the rustc-style snippets in error messages.

Adding a new builtin (e.g. a new String_* runtime helper):

  1. Add the C declaration to the right header in runtime/.
  2. In src/resolver/resolver.amRegisterBuiltins() — declare it as a global with its return type:
    this.DeclareGlobal("String_DamerauLevenshtein", "int", false)
    
  3. (Optional, but recommended) Add a return-type entry to BuiltinCallReturnType() and InferTypeFromExpr() in src/generator/c_gen.am so interpolation and type inference know about it.

TypeChecker (src/typechecker.am)

Errors carry their source snippet (loaded into Sources: SourceMap by main.am), which is rendered by TypeError.ToString().

CGen (src/generator/c_gen.am)

The biggest single file (~2000 lines). Two-pass:

Statement and expression emission split into many small functions (EmitStmt, EmitBlock, EmitExprStr, EmitMatch, …). Use this.Out.EmitLine / Indent_ / Dedent to maintain indentation.

The Emitter has a Streaming flag — when set, EmitLine writes directly to a file via File.StreamLine instead of accumulating in a List<string>. Used by gen_test.am's gen6 to write multi-MB amc_lib.c quickly.

Adding a feature in CGen:

  1. Decide the AST shape — is it a new NodeKind, or a flag on an existing one (e.g. ?. reuses MEMBER with Flag = true)?
  2. Add a branch in EmitStmt / EmitExprStr for the new shape.
  3. If the construct uses statements that the parent block won't see (e.g. a binder declaration that needs to be in scope for a guard), reach for GCC compound expressions: ({ stmt; expr; }).

Formatter (src/formatter/formatter.am)

amc fmt file.am re-emits source from the AST. The formatter walks the same AstNode tree the rest of the compiler builds, so anything expressible by the parser round-trips. Comments survive because the lexer emits them as COMMENT tokens (rather than skipping them as whitespace), and the parser collects them into Parser.Comments without putting them in the AST. Formatter.Sync(line) re-injects them at their original source line.

A few pieces collaborate to keep round-tripping idempotent:

tests/fmt/fmt_test.am checks idempotence + semantic equivalence on a small fixture (amc test ./tests/fmt/); the regression sweep that runs amc fmt on every compiler source must stay green.

Linter (src/linter.am)

amc --lint file.am runs a static-analysis pass on top of the parsed AST and emits non-fatal warnings. The Linter shares no state with the typechecker — it walks the AST top-down and collects LintWarning records into linter.Warnings.

Coverage today:

The unused/shadow checks rely on a small in-linter scope stack (parallel LocalNames / ScopeStarts arrays, same shape as the resolver). Use marking is recorded into an append-only UsedNames list; each local stores a UsedNames.Count() snapshot at declaration so PopScope can answer "did this name appear in UsedNames after I was declared?" without needing List<T>.Set. UsedNames is reset to a fresh list at each method entry to keep the snapshot indices meaningful per-method and bound memory.

Method and lambda params are tracked in scope so they participate in shadow detection, but they're never warned about as unused (_param would be unergonomic).

The skeleton is set up to grow more checks (suspicious patterns, implicit fallthrough in match, etc.) by extending LintStmt / LintExpr without touching the rest of the pipeline. Warnings always carry the per-program filename (populated from prog.Str2) so multi-file invocations report the right paths.

The CLI flag (--lint) is wired in main.am, after the typechecker pass and before code generation. Warnings don't bump ExitCodeamc --lint -o foo file.am still produces output.

Test runner (amc test)

amc test [<dir>] discovers *_test.am under <dir> (default .), compiles + runs each, and aggregates [PASS] <name>, [FAIL] <name>: <msg>, and [SKIP] <name> lines from each child's stdout. The runner lives in Program.RunTest in main.am; the subcommand is dispatched from Program.Main next to the fmt subcommand.

Pipeline per test file:

  1. Pre-flight (once per run) — resolve amcRuntime from $AMC_RUNTIME, else <dirname(amc)>/runtime, else fall back to ./runtime. Then load PackageRegistry (manifest + lock) and, for every package declaring [stdlib].sources, Program.PreCompilePackageSources compiles each .c once with gcc -O2 -I<amcRuntime> -w -c … to a cached /tmp/amc-pkg-<class>-<leaf>.o. The .o paths feed into the per-test link step below.
  2. Discover — shells out to find <dir> -name '*_test.am' -type f via Process.RunCapture. Cross-platform on POSIX and Windows MSYS2 (the CI's Windows path).
  3. Compile to C — invokes the running amc binary on the file (path from Args_Get(0)) with -o /tmp/amc_test_<idx> and --quiet. Emits <tmp>.c.
  4. Compile to nativegcc -O2 -Iruntime -I'<amcRuntime>' <tmp>.c <pkg-objs…> -lgc -lm -ldl -lpthread -o <tmp>. The pre-compiled package .o files from step 0 are spliced in so vendoring backends (SQLite, future DuckDB) link cleanly with no user intervention. -ldl / -lpthread are unconditional — harmless when no package needs them, required by SQLite.
  5. Run + parse — runs the test binary via Process.RunCapture and scans its stdout for tag-prefixed lines. Anything else is ignored. A non-zero exit with no tag lines is reported as [FAIL] <crash> exit=N so silent crashes still register.

The runner exits non-zero if any case FAILs or any file fails to compile; otherwise zero. The convention deliberately stays framework-free for v1 — a richer Assert module + test_<name> auto-discovery is a possible v2.

LSP server (amc lsp, src/lsp.am)

amc lsp runs a minimal LSP 3.x server speaking JSON-RPC 2.0 over stdio with the standard Content-Length: N\r\n\r\n<N bytes> framing. v1 implements:

Hover, completion, and goto-definition are out of scope for v1 and will land in a follow-up.

JSON handling is ad-hoc rather than a real parser: the JsonStr(body, key) and JsonInt(body, key) static helpers find "<key>", skip to the value, and read until the appropriate terminator (handling backslash escapes for strings). LSP messages don't have ambiguous keys at the depth we extract (method, id, uri, text), so this trades correctness on arbitrary JSON for ~50 lines of code instead of a full tagged union + recursive-descent parser. If hover or completion need deeper extraction, promote the codec to a proper stdlib/Json module.

Two new runtime helpers shipped to support the framing: Console_ReadBytes(n) reads exactly n bytes from stdin (the LSP body after parsing Content-Length), and Console_Flush() drains the stdout buffer so the client doesn't block waiting on buffered replies.

The resolver gained a parallel RawErrors: List<ResolverError> that's kept in lock-step with the formatted Errors: List<string>. ResolverError is a tiny local mirror of TypeError — they duplicate fields rather than share, because resolver.am is compiled before typechecker.am in the bundle and we want the source-file dependency graph to stay one-way.

Snapshot bootstrap (snapshot/, tools/save-snapshot.sh)

build_amc.sh has a 2-rung bootstrap chain:

  1. ./amc — the freshly-built self-hosted compiler.
  2. ./snapshot/amc — last known-good amc, captured by tools/save-snapshot.sh after a green test run. The portable snapshot/amc_lib.c is committed; the binary is rebuilt by gcc on each platform that needs it.

The snapshot rung exists so we can introduce new syntax without losing the bootstrap. If ./amc breaks mid-development, build_amc.sh falls through to ./snapshot/amc, which still understands every syntax shipped at the time of the last tools/save-snapshot.sh run. From a clean clone, recompile snapshot/amc from the tracked snapshot/amc_lib.c with one gcc invocation — see snapshot/INFO.md.

When introducing new syntax, take a snapshot before using the new construct in src/*.am. That way, if the implementation has a regression, the snapshot still works.

main.am

Glue:

  1. Parses CLI args (Args.Count, Args.Get).
  2. Reads each input file, runs Lexer + Parser (Pass 1 of CGen).
  3. Builds the FullResolver, feeds all programs, runs both passes.
  4. Builds the TypeChecker, runs on the first program (multi-file typechecking is partial today).
  5. Runs CGen Pass 2 over each program.
  6. Emits the final int main() wrapper unless --lib or no Program.Main was found.

gen_test.am

src/generator/gen_test.am is the "build the build" — when run, it parses every .am file in the compiler, drives a single CGen across all of them, and writes the result to src/amc_lib.c. This is the canonical self-host artefact.

It also runs in streaming mode (SetStreaming(true)), bypassing the in-memory line list and writing directly to File_StreamLine.

Tests

When you add a feature:

  1. Drop a sample in tests/samples/montest.am (one feature, one observable).
  2. Add an assertion to the appropriate bundle, e.g. in core_bundle/core_test.am:
    g_montest.Add(new CoreCase("ma feature", "sortie attendue"))
    Program.RunGroup("./tests/samples/montest.am", "montest", g_montest)
    
  3. Verify: ./amc test ./tests/core_bundle/.

For tooling tests (LSP, lint, --check, --lib, multi-file, external), use the specialised helpers already defined in core_test.am: RunLspCheck, RunLintCheck, RunCheckFail, RunCCheck, RunLibTest, RunMultiFile, RunExternalTest, RunCmdGrep, etc.

Common gotchas

Where to ask