Helix ×GPT-2▶ RECORDED RESULTS — real gate outputs, not live
The proof — what was checked, and how you can re-check it.
Four independent checks stand behind the demo. Each card below is one check: a plain-language sentence, the
real recorded number, and — one click deeper — exactly what ran and where it lives in the repo. Nothing here is live;
these are the committed results of real runs, and one command reproduces the core on your machine in about a minute.
Key: anything with this symbol is powered by Helix — compiled from Helix source by kovc.
EXHIBIT A
The output under test
GPT-2-XL was given five words and asked for twenty more, running entirely on Helix-built kernels powered by Helix. This exact sentence is what every check below judges:
The capital of France is the city of Paris. It is the capital of France and the largest city in France. It is
THE FOUR CHECKS
Click any card for the full story
Each one attacks the question “did this really run correctly?” from a different angle. All four passed, fail-closed — a single mismatch turns the whole run red.
THE FOUNDATION
The ladder under it all
Every tool was built only by the tool before it — no pre-built compiler is ever trusted. Hover any rung; the sizes are the real committed binaries.
Reproduce it yourself — no GPU, no weights, about a minute:
# clean checkout · CPU-only · fail-closed (exits red on any mismatch)
git clone https://github.com/Questeria/helix && cd helix
bash scripts/reproduce_trust.sh # asserts:seed 9837db12 · fixpoint 0992dddd · DDC K1 84363adb
EVERY MODEL
Nothing ships without a gate
A model appears in this demo only after its output matched the independent oracle. The recorded results, per model — hover the column titles for what they mean:
model
argmax
max score diff
tokens
GPT-2 124M · 12 layers
id 262 — exact
2.59e-04
25 / 25
GPT-2-Large 774M · 36 layers
id 262 — exact
3.8e-05
25 / 25
GPT-2-XL 1.5B · 48 layers
id 262 — exact
4.4e-05
25 / 25
SmolLM2-135M · 30 layers · 2024 Llama arch
id 260 — exact
4.9e-05 over 49,152
25 / 25
THE EDGES
Honest residuals — said before you ask
Complete to PTX, not SASS. Below the PTX layer, NVIDIA's closed assembler, driver and the C launcher are trusted-once.
fp32 only, single GPU (sm_86). One RTX 3070-class card; no other precision or hardware is claimed.
The oracle shares the spec. It's an independent implementation of GPT-2, not an independent specification — it catches implementation bugs, not shared misunderstandings.
~10 s/token live, by design. The recorded run took 195.5 s for 20 tokens (0.102 tok/s). Trust first; speed is roadmap.
The attestation's full hashes appear on a live run. This page shows the recorded prefixes (e.g. model file sha 248dfc39…, 548,105,171 bytes).
Honest residuals: fp32 · verified to PTX, not SASS · single GPU (sm_86) · base models, not assistants · the oracle shares the model's spec. Every number on this page is a recorded, committed result.start here · guided run · expert · proof · models
Before the model ran, the demo deleted every pre-built tool and rebuilt the whole compiler ladder from the 299 hand-typed bytes — then checked three fingerprints. If even one byte had drifted, the run would have stopped red right there.
The three anchors (recorded)
REPRODUCE_TRUST: PASS
seed sha 9837db12… the C-subset bootstrap compiler
fixpoint sha 0992dddd… kovc compiles itself: K2 == K3 == K4
gcc-DDC sha 84363adb… an independent compiler (gcc) built the
same step byte-identically — the
“trusting trust” defense
The kernels that then ran GPT-2 were emitted by that freshly-rebuilt compiler — so the chain from hand-typed bytes to GPU output is unbroken.
A completely separate program — plain numpy, reading the original public weights, sharing no code with Helix — ran the same prompt. Its 20 chosen tokens were compared to Helix's, id by id. All 25 ids (5 prompt + 20 generated) matched.
Recorded figures (GPT-2 124M leg)
GPT2_LOGITS_PARITY_PASS last-token argmax id 262 — EXACT
max-abs logit diff 2.59e-04 on scores of magnitude ~130
GPT2_GENERATE_MATCH_PASS token-for-token, 25/25 ids
The scale legs repeat this at 774M (3.8e-05) and 1.5B (4.4e-05) through the same 8 Helix kernels — zero new ops, only dimensions change. The SmolLM2 gate adds a corrupted-weights negative control that correctly fails, proving the comparator has teeth.
Flaky systems can be right once by luck. So the demo runs the same request twice and hashes the generated token ids — both runs must produce the identical hash. They did: 8a2595cd…, byte-for-byte.
Why it matters
Determinism means anyone re-running the demo with the same inputs can expect the same outputs — reproducibility isn't a promise, it's an asserted property of every green run. (fp32, greedy decoding, fixed kernels.)
check it: scripts/gpt2_demo_attest.sh (leg C: the byte-identical rerun)
One command runs everything above in sequence — rebuild from raw, parity vs the oracle, byte-identical rerun — and only if every leg passes writes a signed record: DEMO_ATTEST_PASS.
What the record names
the generated sentence (verbatim)
the three from-raw anchors 9837db12 · 0992dddd · 84363adb
the live model file model.safetensors
sha 248dfc39… · 548,105,171 bytes
the two equal run hashes 8a2595cd… == 8a2595cd…
the honest-residuals card (included in the attestation itself)
Attestations are regenerated per run — the recorded prefixes shown here come from the committed green-run transcript; a live run writes the full values.
check it: scripts/gpt2_demo_attest.sh · docs/HELIX_GPT2_DEMO_RUNBOOK.md §3 (the captured transcript)