Helix ×GPT-2▶ RECORDED RESULTS — real gate outputs, not live
One verified stack. Pick your model and your trust boundary.
Every model here earned its place the same way: its output matched an independent referee, token for token,
running on the same 8 Helix-built kernels powered by Helix. Click a card for the full recorded result.
Key: anything with this symbol is powered by Helix — compiled from Helix source by kovc.
THE MODELS
Four sizes, one stack — all verified 25/25
From the 124M starter to the 1.5B flagship, the GPT-2 family runs on the exact same 8 kernels — only the dimensions change. SmolLM2 brings 2024's Llama architecture onto the same stack with just 3 extra kernels.
TWO PATHS
Choose how far your trust has to reach
Both paths run the same model and pass the same token-for-token gates — they differ only in what you must trust beneath the auditable code.
WHY IT MATTERS
Same kernels at every size
124M → 774M → 1.5B ran through the identical 8 kovc-emitted kernels — zero new operations at scale, only dimension changes read from each model's config. That's the integration pitch in one line: bring your weights; the verified stack underneath doesn't change. (44,019 bytes of PTX, 8 entry points.)
Honest residuals: fp32 · verified to PTX, not SASS · single GPU (sm_86) · base models, not assistants · the oracle shares the model's spec. Every number on this page is a recorded, committed result.start here · guided run · expert · proof · models
The original demo model: small enough to generate in seconds when warm, real enough to prove the stack. A 2019 base completion model — it continues text, it doesn't chat.
Recorded gate result
argmax id 262 — EXACT · max-abs logit diff 2.59e-04
token-for-token 25/25 vs the independent numpy oracle
"The capital of France is the capital of the French
Republic, and the capital of the French Republic is
the capital of the French" (greedy — repetitive is honest)
Three times the layers of 124M (36), six times the parameters — and not one new kernel. The same 8 Helix-built kernels simply ran with bigger dimensions.
Recorded gate result
argmax id 262 — EXACT · max-abs logit diff 3.8e-05
token-for-token 25/25 · same 8 kernels, zero new ops
The model the guided run replays: 1.5 billion parameters, 48 layers, running at fp32 on one 8 GB consumer GPU via per-layer weight streaming. Live pacing is ≈10 s per token — the demo is about verifiability, not speed, and says so.
Recorded gate result
argmax id 262 — EXACT · max-abs logit diff 4.4e-05
token-for-token 25/25 · 195.5 s / 20 tokens (0.102 tok/s)
"The capital of France is the city of Paris. It is the
capital of France and the largest city in France. It is"
A 2024-architecture model (the family behind today's open models): grouped-query attention, RoPE, SwiGLU, RMSNorm, 30 layers. It cost exactly 3 new Helix kernels — the compiler itself didn't change at all.
Your prompt runs through GPU kernels written in the Helix language and compiled by the from-raw kovc into PTX — the last layer a person can read and audit. Turning PTX into the chip's actual instructions is done by NVIDIA's closed assembler: that single step is trusted once, and the demo says so out loud.
What's auditable vs trusted
auditable : hex0 (299 B) → seed → kovc → 8 kernels → PTX (44,019 B)
trusted : ptxas (PTX→SASS) · CUDA driver · GPU silicon
· the small C launcher (committed, readable)
For auditors who want zero closed tools in the loop: the same model forward runs entirely on the CPU, where every instruction traces back to the 299-byte root. No GPU assembler, no driver — just slower.
The trade
trust : nothing closed between hex0 and the output
(the shared host TCB is disclosed in the residuals)
speed : far slower than the GPU path — built to verify,
not to serve