GPU kernels are written in Helix itself
Every kernel in the telemetry is an @kernel function in the Helix language,
compiled to PTX by kovc. No CUDA C does the tensor math; no vendor ML library sits
underneath. This one is verbatim and complete:
@kernel
fn vector_add(a: f32, b: f32, c: f32, n: i32) {
let i = block_idx() * block_dim() + thread_idx();
c[i] = a[i] + b[i]
}
what this enables
Click any kernel name in the live panels and read the exact source that produced the
PTX running on the GPU — the whole math stack is auditable in one language.
where it lives: helixc/examples/vector_add_kernel.hx (+13 more bundled on this page) · docs/HELIX_V1_LANGUAGE_SPEC.md (@kernel → PTX) · docs/HELIX_V1_DEFINITION_OF_DONE.md #3 (GPU corpus green on hardware)
Autodiff is a language built-in
grad / grad_rev / grad_rev_all are built into the language — source-level
forward- and reverse-mode differentiation with chain rules across user functions.
Below (excerpt): a committed program doing real gradient descent inside a
Helix-emitted binary.
@pure
fn loss(w: f32) -> f32 {
let d = w - 7.0;
d * d
}
// ... five steps of gradient descent:
let g = grad_rev(loss)(w_now as f32);
let w_new_f = (w_now as f32) - lr * g;
what this enables
Training loops where the compiler derives the gradients — and the reverse-AD boundary
compile-time-rejects impure bodies (the purity scan in helixc/bootstrap/parser.hx).
Honesty note: CPU autodiff is corpus-proven; the GPU capstone’s backward pass used
hand-written kernels verified against finite differences — a stronger check than
auto-emitted, unverified code (grad-emits-GPU is post-v1.0).
where it lives: helixc/examples/dogfood_01_one_param.hx · README.md “What works today” · docs/HELIX_V1_DEFINITION_OF_DONE.md #4 (autodiff GREEN)
The toolchain rebuilds from 299 hand-typed bytes
stage0/hex0/hex0.bin is literally 299 bytes on disk — a hand-authored ELF whose
annotated source is hex0.hex (excerpt below). The ladder hex0 → hex1 → M0
→ M1 → M2-Planet → seed rebuilds the whole compiler from it. The hero
line of this page is this file.
7F 45 4C 46 ; 0x00 e_ident[EI_MAG] = "\x7fELF"
02 ; 0x04 EI_CLASS = ELFCLASS64
01 ; 0x05 EI_DATA = ELFDATA2LSB
...
02 00 ; 0x10 e_type = ET_EXEC (2)
3E 00 ; 0x12 e_machine = EM_X86_64 (0x3E)
what this enables
No binary blob has to be trusted: every byte from the seed up is auditable, and the
seed is small enough to check by hand.
where it lives: stage0/hex0/hex0.bin (299 B) · stage0/hex0/hex0.hex (annotated byte-by-byte) · README.md
Self-hosting, proven byte-identical — and Python-free
kovc is written in Helix and compiles itself to a byte-identical fixpoint
starting from the raw-binary seed; an independent gcc-built lineage corroborates it
(a Wheeler diverse-double-compile). The live toolchain contains zero Python —
the only .py left is the fenced external oracle the gates compare against.
Schematic of the recorded result (real numbers from the DoD):
seed.bin (62 KB, from the hex0 ladder)
compiles kovc.hx -> K1
K1 -> K2 -> K3 -> K4
K2 == K3 == K4 # byte-identical, 601707 B,
# sha b52d7d6f... (DoD #1)
what this enables
The “trusting trust” answer: the compiler you audited is provably the
compiler that compiled the kernels you are watching.
where it lives: scripts/selfhost_fixpoint_rawbinary.sh · docs/HELIX_V1_DEFINITION_OF_DONE.md #1 / #5 / #6 · docs/K_DDC_POST_K4.md · verification/oracle/ (the fenced oracle)
Models ship only through fail-closed parity gates
A model appears in this demo only after its served output matched an independent
oracle — GPT-2 token-for-token via the serve gate; the Llama-arch leg (SmolLM2)
through its own full-model gate with a negative control. Verbatim from the committed
gate script:
# FAIL-CLOSED legs (all must pass for LLAMA_MODEL_GATE_PASS):
# [4] G-L1: GPU post-layer-0 residual vs oracle (compare-block0, tol 2e-3)
# [5] G-L2a: full-model last-row logits vs oracle (argmax EXACT + max-abs < 5e-2)
what this enables
The continuation you read was produced by kernels that already proved they reproduce
the reference model exactly — verification is the product, not a footnote.
where it lives: scripts/helix_serve_gate.sh (GPT-2, token-for-token) · scripts/llama_model_gate.sh (SmolLM2) · demo/dashboard.html (the recorded numbers)
Optimised, then re-proven — never trusted on speed
Standing property of this stack: an optimisation ships only after it re-passes the
token-for-token gate. Right now the KV-cache fast-decode kernels are in
gating — authored, on disk, and claiming nothing until the gate is green. Their
own header says so, verbatim:
// Mathematically the SAME dot products as tiled_matmul_abt -- the token-for-token
// gate must prove the KV path emits IDENTICAL ids to the full re-forward.
what this enables
Speed comes from caching and better kernels, never from approximation — and this
page’s “fast” mode only unlocks once the gated backend advertises it.
where it lives: helixc/examples/gpu_gemv_abt_kernel.hx / gpu_gemv_ab_kernel.hx / gpu_softmax_row_kernel.hx (in gating now) · scripts/gate_kovc.sh · demo/dashboard.html
Every gate proves it can fail — the negative control
A green checkmark means nothing if the check cannot go red. So every model gate runs a
negative control: it deliberately corrupts the weights and the same parity leg
MUST fail, or the whole gate fails. Verbatim from the committed evidence log of
the latest sampling re-gate (scripts/_gate_sampling4.log):
=== [7] NEGATIVE CONTROL: corrupted weights must FAIL the logits leg ===
NEG-CONTROL OK: corrupted weights correctly FAILED
...
NEG control : PASS
LLAMA_MODEL_GATE_PASS
what this enables
Confidence that the token-for-token checks are load-bearing, not decorative: tampered
weights are caught by construction. Honesty note: this is committed offline gate
evidence — no tamper test runs live in your browser, and this page never claims one did.
where it lives: scripts/_gate_sampling4.log (lines above, verbatim) · scripts/llama_model_gate.sh (the gate) · demo/dashboard.html (all recorded verdicts)