LLM Comprehension Test¶
Date: March 19, 2026
Objective¶
Test whether the 27-line AXL Rosetta specification is sufficient for large language models to decode and generate valid AXL packets - with no prior training, no fine-tuning, and no context beyond the spec itself.
If LLMs can parse AXL from a cold start, the protocol is self-documenting. Any LLM-backed agent can be handed the Rosetta and immediately participate on an AXL bus.
Setup¶
| Parameter | Value |
|---|---|
| Specification | 27-line Rosetta |
| Models tested | 4 |
| Prior context | None (new conversation per model) |
| Rounds | 2 |
Models Under Test¶
| Model | Provider | Notes |
|---|---|---|
| Grok 3 | xAI | Large-scale reasoning model |
| GPT-4.5 | OpenAI | Latest GPT series |
| Qwen 3.5 35B | Alibaba | Same model used in Battleground experiments |
| Llama 4 | Meta | Open-weight model |
Each model was given the 27-line Rosetta in a fresh conversation with no prior AXL context. No examples, no few-shot prompts, no system instructions beyond the spec itself.
Round 1: Decode and Generate¶
Each model was tested on 9 tasks: 8 decode tasks (parse existing AXL packets and extract structured information) and 1 generate task (produce a valid AXL packet from a natural language description).
Decode Tasks¶
Models were presented with AXL packets and asked to extract specific fields:
# Example decode task
Input: S:PAY.3|AXL-5|AXL-2|TRADE|amount:#1200|T:1710892800
Task: "Who is paying whom, how much, and in what domain?"
Expected: AXL-5 pays AXL-2, 1200 units, TRADE domain
Generate Task¶
Models were given a natural language description and asked to produce a valid packet:
# Example generate task
Input: "Agent AXL-9 sends a critical security alert to all agents
reporting that AXL-4 is compromised"
Expected: S:COMM.1|AXL-9|AXL-ALL|SECURE|alert:AXL-4_compromised|CRIT
Round 1 Results¶
| Model | Decode (8) | Generate (1) | Total (9) | Accuracy |
|---|---|---|---|---|
| Grok 3 | 8/8 | 1/1 | 9/9 | 100% |
| GPT-4.5 | 7/8 | 1/1 | 8/9 | 88.9% |
| Qwen 3.5 35B | 7/8 | 1/1 | 8/9 | 88.9% |
| Llama 4 | 7/8 | 1/1 | 8/9 | 88.9% |
| Average | 93.3% |
All four models successfully generated valid AXL packets from natural language. Decode errors were minor field-extraction mistakes, not structural failures.
Round 2: Wormhole Test¶
The Wormhole test increased difficulty. Models were given multi-packet sequences and asked to trace causal chains across packets - identifying which events caused which responses, across different agents and domains.
# Example Wormhole sequence
S:DATA.1|AXL-3|AXL-ALL|TRADE|ticker:BTC=#67291|T:1710892800
S:DATA.2|AXL-7|AXL-ALL|TRADE|analysis:bearish_divergence|T:1710892812
S:PAY.1|AXL-5|AXL-3|TRADE|amount:#500|T:1710892815
S:COMM.1|AXL-SENTINEL|AXL-ALL|SECURE|alert:unusual_payment_pattern|T:1710892820
Task: "Trace the causal chain. What triggered the alert?"
Expected: Ticker -> Analysis -> Payment -> Alert
Round 2 Results¶
| Model | Wormhole Score | Accuracy |
|---|---|---|
| Grok 3 | Near perfect | 100% |
| GPT-4.5 | Near perfect | 97.2% |
| Qwen 3.5 35B | Near perfect | 98.6% |
| Llama 4 | Near perfect | 98.6% |
| Average | 98.6% |
All four models correctly decoded cross-paradigm causation chains - tracing events from data packets through payment packets to security alerts, across multiple agents and domains.
Combined Results¶
| Round | Task Type | Average Accuracy |
|---|---|---|
| Round 1 | Decode + Generate | 93.3% |
| Round 2 | Wormhole (causal chains) | 98.6% |
| Combined | All tasks | 95.8% |
Key Observations¶
-
Cold-start comprehension: All four models parsed AXL packets correctly from a 27-line spec with zero prior exposure. The Rosetta is self-documenting.
-
Generation validity: Every model produced structurally valid AXL packets on the first attempt. The pipe-delimited format with typed fields is LLM-native.
-
Causal chain tracing: All four models correctly traced cross-paradigm causation - following events across
DATA,PAY, andCOMMdomains, through different agents, using timestamps for ordering. This validates the Time bridge design. -
Higher accuracy on harder tasks: Round 2 (98.6%) scored higher than Round 1 (93.3%). Models performed better on complex multi-packet analysis than on individual field extraction - suggesting the protocol's structure aids comprehension as context increases.
-
Cross-model consistency: No model-specific failures. The protocol is not optimized for any particular LLM architecture. It works across transformer variants from four different providers.
Implications¶
A 27-line specification achieving 95.8% comprehension across four LLMs with zero prior context means:
- Any LLM-backed agent can join an AXL bus by reading the Rosetta. No training pipeline required.
- Protocol updates propagate instantly - update the Rosetta, and every LLM agent can parse the new format on next inference.
- The Rosetta is the onboarding mechanism. No SDK, no client library, no integration guide. Twenty-seven lines of text.