Compiler Fuzzing in EK9

EK9 includes a built-in grammar-based fuzzer that generates random EK9 programs, compiles them, and tracks statistics including error code coverage, phase penetration, and compiler crashes. This is used for compiler quality assurance — verifying that the compiler handles every possible input gracefully.

This feature has been built into the compiler and shipped, so that anyone can test the compiler.
If you do find an crashes or issues, then let us know and supply the 'offending' ek9 source that caused the issue. Then we can address it. We build new versions of the complier weekly, so a fix won't take long.

Looking for test generation? If you want to generate tests for your code, see Test Generation instead. This page documents the compiler fuzzer that tests the compiler itself.

What is Compiler Fuzzing?

The EK9 compiler has a 22-phase compilation pipeline, over 300 distinct error codes, and supports 29 construct types with complex interactions. Grammar-based fuzzing systematically exercises this complexity by generating random-but-structurally-plausible EK9 programs and compiling them.

The fuzzer answers three questions:

Running the Fuzzer

The fuzzer runs for a specified number of minutes, generating and compiling programs continuously:

  • $ ek9 -fuzz 30 // 30 minutes, human-readable output
  • $ ek9 -fuzz0 30 // Terse CI pass/fail (one line)
  • $ ek9 -fuzz2 1440 // 24 hours, JSON for pipelines
  • $ ek9 -fuzz6 60 // 1 hour, HTML dashboard

The format suffix convention (-fuzz0, -fuzz2, -fuzz6) mirrors the test format convention (-t0, -t2, -t6) — each action owns its format suffixes.

Crash-triggering source files are always saved to ./fuzz-crashes/ regardless of output format. The HTML dashboard is written to ./fuzz-report/.

Five-Strand Generation Strategy

The fuzzer uses five complementary generation strategies to maximise the diversity of generated programs. Each strand produces different error distributions, and together they exercise the compiler more thoroughly than any single strategy could.

Strand 1: Template-Based ATN Generation (~30%)

The primary strand uses 25+ built-in templates for common EK9 patterns (classes, functions, traits, records, etc.) and fills them using ANTLR4 Augmented Transition Network (ATN) walks of the EK9 grammar. At each grammar decision point, the generator makes a random choice, producing structurally plausible code. This strand produces the highest density of syntactically correct programs.

Strand 2: Compiler-Aware Injection (~20%)

This strand harvests real symbols from previously compiled Q&A example files (505 templates) and injects them into generated programs. By using real type names, method signatures, and module structures, these programs exercise deeper semantic analysis phases that pure random generation rarely reaches.

Strand 3: Template Mutation (~20%)

This strand takes working Q&A example files and applies single-point mutations: dropping modifiers, swapping types, changing operators, altering indentation, duplicating lines, injecting boolean literals, stripping guards, and swapping adjacent statements. These targeted mutations exercise specific error detection paths (E08010, E08030, E08081, E11050, etc.) that random generation is unlikely to trigger.

Strand 4: Targeted Mutation (~15%)

This strand applies known-incorrect code substitutions from the Q&A system's typicalErrors entries. Each mutation replaces a correct snippet with its known-incorrect counterpart and verifies that the compiler detects the expected error code. Mismatches are saved to fuzz-crashes/strand4-mismatches/ for investigation.

Strand 5: Template Enrichment (~15%)

This strand combines 1–4 working Q&A templates into a single module, extracts their type and function definitions, then generates a program body with deeply nested control flow (if/else, switch, for, while, try/catch, streams) using those real types. This bridges the gap between Strand 1 (random programs that die at FULL_RESOLUTION) and Strand 2 (flat probe calls): programs use semantically valid types in complex nested structures, reaching deeper compiler phases like IR_GENERATION and CODE_GENERATION. 15% of enriched programs receive an additional random mutation for error path coverage.

Strand Share Strength
1. Template-Based ATN~30%High volume, broad grammar coverage, many parse errors for parser robustness
2. Compiler-Aware~20%Deeper phase penetration, exercises type resolution and semantic checks
3. Template Mutation~20%Targeted error code coverage, exercises specific detection logic
4. Targeted Mutation~15%Known-error validation, mismatch diagnostics, typicalErrors coverage
5. Template Enrichment~15%Cross-type interaction, nested control flow, deep phase penetration

Multi-Strand Filtering and Ablation Studies

By default all five strands run with adaptive weight distribution. The -s1 through -s5 flags can be combined to run any subset of strands:

  • $ ek9 -fuzz6 30 -s1 // S1 only (grammar baseline)
  • $ ek9 -fuzz6 30 -s1 -s2 // S1 + S2 (grammar + symbol oracle)
  • $ ek9 -fuzz6 30 -s1 -s2 -s3 // S1 + S2 + S3 (add mutation)
  • $ ek9 -fuzz6 30 -s1 -s2 -s3 -s4 // S1 + S2 + S3 + S4 (add targeted)

When multiple strands are active, the adaptive weight system distributes iterations proportionally among them — strands that discover more error codes receive higher dispatch weights. Single-strand runs route 100% of iterations to that strand.

This is particularly useful for ablation studies: comparing S1-only error coverage against S1+S2, S1+S2+S3, etc. quantifies each strand's marginal contribution. The HTML dashboard includes strand effectiveness comparisons, adaptive weight timelines, and phase penetration charts that visualise these differences (see Strand Effectiveness and Adaptive Weight Timeline in the dashboard section below).

Output Formats

Human-Readable (-fuzz)

The default format prints terminal histograms, phase penetration charts, and a summary to stdout. Suitable for interactive monitoring during development:

EK9 Fuzzer: 30 minutes, seed 1709312456789
Programs: 14,832 | Parse: 72.4% | Crashes: 631 | Errors: 156/307 (50.8%)

Phase Distribution:
  READING                        ████░░░░░░░░░░░░░░░░  5.8%
  SYMBOL_DEFINITION              █████████░░░░░░░░░░░░ 23.9%
  FULL_RESOLUTION                ██████████░░░░░░░░░░░ 26.0%
  CODE_GENERATION_AGGREGATES     ████████████░░░░░░░░░ 32.3%

Terse CI (-fuzz0)

One-line pass/fail for CI gates. Returns exit code 0 if no new crashes were found:

FUZZ OK: 14832 programs, 0 new crashes, 156/307 errors (50.8%) in 30m

JSON (-fuzz2)

Produces two files for programmatic analysis:

{
  "duration": "PT30M",
  "programs": 14832,
  "parseRate": 0.724,
  "crashes": 631,
  "errorCoverage": { "triggered": 156, "total": 307 },
  "phases": { "READING": 860, "SYMBOL_DEFINITION": 3546, ... },
  "constructs": { "class": 4231, "function": 3892, ... }
}

HTML Dashboard (-fuzz6)

Generates an interactive dashboard at ./fuzz-report/index.html with charts, heatmaps, and drill-down details. This is the richest output format and the recommended way to review fuzzing results.

Reading the Dashboard

The HTML dashboard (-fuzz6) provides twelve visualisation sections. Each is described below with the key metrics to watch.

Status Banner and KPI Cards

Fuzzer dashboard status banner with duration, throughput, and four KPI donut charts

The status banner shows duration, programs generated, throughput, crash count, and corrections. The border colour indicates overall health: green (no crashes), amber (few crashes), or red (significant crashes).

Four KPI donut charts provide at-a-glance metrics:

Timing Breakdown and Source Statistics

Timing breakdown donuts showing generation, parse check, and compilation time split

Source statistics grid showing min/avg/max lines, total bytes, files generated, and compile rate

Three mini-donuts show where time is spent: generation, parse checking, and compilation. Source statistics show min/avg/max lines per program, total bytes generated, file counts, and compile rate. If compilation dominates, programs are reaching deep phases (good). If parse checking dominates, most programs fail early (consider adjusting generation strategy).

Phase Distribution

Stacked bar chart showing program counts per compilation phase

Horizontal bars show how far programs penetrate the 22-phase compilation pipeline. Each bar represents a phase where programs were rejected — programs that pass a phase move to the next bar. A healthy distribution shows programs spread across all phases, not clustered at the front.

The phase penetration funnel provides an alternative view of the same data, showing bar width proportional to the fraction of programs surviving to each phase, with dropout percentages annotated between phases:

Phase penetration funnel showing program survival rates through each compiler phase with dropout percentages

The three largest dropouts — Parsing, FULL_RESOLUTION, and SYMBOL_DEFINITION — identify where generation strategies are most often rejected. A steep drop at FULL_RESOLUTION (the single largest dropout point) reveals that type resolution is the primary barrier to deep testing.

Error Code Coverage Heatmap

Interactive heatmap grid of 307 error codes grouped by category

The largest dashboard section shows all ~307 compiler error codes as a searchable, filterable grid. Error codes are grouped by category (E01xxx Lexer/Parser, E05xxx Hierarchy, E06xxx Resolution, etc.).

Use the search box to find specific error codes, or the filter buttons (All / Triggered / Untriggered) to focus on gaps.

Construct Coverage

Heatmap of 29 EK9 construct types with crash indicators

A heatmap of all 29 EK9 construct types: class, function, record, trait, service, component, program, enumeration, generic-type, dynamic-class, dynamic-function, and more. Colour intensity indicates frequency of generation. Red-bordered cells with a pulse animation indicate constructs that have caused compiler crashes.

The goal is uniform coverage across all construct types. If some constructs are underrepresented, the generation templates may need adjustment.

Control Flow Coverage

Grouped bar charts for control flow types with statement, guard, and expression sub-bars

Grouped horizontal bars for each control flow type (for-in, do-while, for-range, switch, if, while, throw, try-catch, stream, etc.). Each type has three sub-bars:

Crash badges on specific control flow types highlight where the compiler is most vulnerable. Stream operations and deeply nested constructs often reveal the most interesting bugs.

Argument Count Distribution

Heatmap grid showing frequency of 0 to 25+ argument counts

Shows the frequency distribution of argument counts (0-25+) in generated functions and methods. A realistic distribution has most functions with 0-3 parameters, with decreasing frequency for higher counts. Edge cases at 15+ parameters stress the compiler's parameter handling.

Template Usage

Template utilization heatmap showing usage across 426 Q and A example files

Shows utilisation of the Q&A example templates used by Strand 2 (compiler-aware injection), Strand 3 (template mutation), and Strand 5 (template enrichment). Identifies underused templates that may need attention to ensure comprehensive coverage.

Per-Strand Phase Distribution

Per-strand phase distribution showing stacked bars coloured by strand at each compiler phase

Stacked horizontal bars show how each of the five strands contributes to programs at every compilation phase. Each bar is coloured by strand (S1 red, S2 green, S3 amber, S4 blue, S5 purple), making it easy to see which generation strategy dominates at each phase depth.

Below the bars, the S5 Harvest Summary shows statistics for template enrichment:

Strand Effectiveness Comparison

Strand effectiveness comparison table showing programs, dispatch percentage, first discoveries, average phase depth, and pipeline penetration per strand

Per-strand metrics side by side: program count, dispatch percentage, first discoveries, average phase depth, and pipeline penetration rates. Depth and breadth of discovery are genuinely orthogonal — the deepest-penetrating strand is not the most productive at discovering new error codes.

Adaptive Weight Timeline

Adaptive weight timeline showing strand dispatch percentages shifting over 30 minutes

Strand dispatch percentages over time. The adaptive system adjusts weights every 30 seconds based on recent error code discovery rates and pipeline penetration depth. Strands that discover more error codes receive higher dispatch weights; strands that plateau lose budget to more productive strategies.

Strand 4: Targeted Mutation Statistics

This section provides detailed statistics on Strand 4’s known-error mutation testing. Each card shows a specific aspect of mutation effectiveness:

Error Code Convergence

Error code convergence curve showing cumulative discovery and discovery rate over time with per-strand timeline

The convergence curve plots cumulative unique error codes triggered over time (green line) against the total error code ceiling (red dashed line), with a discovery rate overlay (orange dashed line). The bottom strip shows per-strand discovery events colour-coded by strand.

When the discovery rate approaches zero — as it typically does after 5–10 minutes — the developer knows that continued fuzzing at current settings will not improve error code coverage. The response is not to fuzz longer at the same settings but to change strategy: improve generators, add new strand types, or target specific untriggered codes. However, error code discovery saturation does not mean bug discovery saturation — the combination space within each error code category remains vast, and longer runs explore more of it.

Best Practices

See Also