LLM Comprehension Test¶

Date: March 19, 2026

Objective¶

Test whether the 27-line AXL Rosetta specification is sufficient for large language models to decode and generate valid AXL packets - with no prior training, no fine-tuning, and no context beyond the spec itself.

If LLMs can parse AXL from a cold start, the protocol is self-documenting. Any LLM-backed agent can be handed the Rosetta and immediately participate on an AXL bus.

Setup¶

Parameter	Value
Specification	27-line Rosetta
Models tested	4
Prior context	None (new conversation per model)
Rounds	2

Models Under Test¶

Model	Provider	Notes
Grok 3	xAI	Large-scale reasoning model
GPT-4.5	OpenAI	Latest GPT series
Qwen 3.5 35B	Alibaba	Same model used in Battleground experiments
Llama 4	Meta	Open-weight model

Each model was given the 27-line Rosetta in a fresh conversation with no prior AXL context. No examples, no few-shot prompts, no system instructions beyond the spec itself.

Round 1: Decode and Generate¶

Each model was tested on 9 tasks: 8 decode tasks (parse existing AXL packets and extract structured information) and 1 generate task (produce a valid AXL packet from a natural language description).

Decode Tasks¶

Models were presented with AXL packets and asked to extract specific fields:

# Example decode task
Input:  S:PAY.3|AXL-5|AXL-2|TRADE|amount:#1200|T:1710892800
Task:   "Who is paying whom, how much, and in what domain?"
Expected: AXL-5 pays AXL-2, 1200 units, TRADE domain

Generate Task¶

Models were given a natural language description and asked to produce a valid packet:

# Example generate task
Input:  "Agent AXL-9 sends a critical security alert to all agents
         reporting that AXL-4 is compromised"
Expected: S:COMM.1|AXL-9|AXL-ALL|SECURE|alert:AXL-4_compromised|CRIT

Round 1 Results¶

Model	Decode (8)	Generate (1)	Total (9)	Accuracy
Grok 3	8/8	1/1	9/9	100%
GPT-4.5	7/8	1/1	8/9	88.9%
Qwen 3.5 35B	7/8	1/1	8/9	88.9%
Llama 4	7/8	1/1	8/9	88.9%
Average				93.3%

All four models successfully generated valid AXL packets from natural language. Decode errors were minor field-extraction mistakes, not structural failures.

Round 2: Wormhole Test¶

The Wormhole test increased difficulty. Models were given multi-packet sequences and asked to trace causal chains across packets - identifying which events caused which responses, across different agents and domains.

# Example Wormhole sequence
S:DATA.1|AXL-3|AXL-ALL|TRADE|ticker:BTC=#67291|T:1710892800
S:DATA.2|AXL-7|AXL-ALL|TRADE|analysis:bearish_divergence|T:1710892812
S:PAY.1|AXL-5|AXL-3|TRADE|amount:#500|T:1710892815
S:COMM.1|AXL-SENTINEL|AXL-ALL|SECURE|alert:unusual_payment_pattern|T:1710892820

Task: "Trace the causal chain. What triggered the alert?"
Expected: Ticker -> Analysis -> Payment -> Alert

Round 2 Results¶

Model	Wormhole Score	Accuracy
Grok 3	Near perfect	100%
GPT-4.5	Near perfect	97.2%
Qwen 3.5 35B	Near perfect	98.6%
Llama 4	Near perfect	98.6%
Average		98.6%

All four models correctly decoded cross-paradigm causation chains - tracing events from data packets through payment packets to security alerts, across multiple agents and domains.

Combined Results¶

Round	Task Type	Average Accuracy
Round 1	Decode + Generate	93.3%
Round 2	Wormhole (causal chains)	98.6%
Combined	All tasks	95.8%

Key Observations¶

Cold-start comprehension: All four models parsed AXL packets correctly from a 27-line spec with zero prior exposure. The Rosetta is self-documenting.
Generation validity: Every model produced structurally valid AXL packets on the first attempt. The pipe-delimited format with typed fields is LLM-native.
Causal chain tracing: All four models correctly traced cross-paradigm causation - following events across DATA, PAY, and COMM domains, through different agents, using timestamps for ordering. This validates the Time bridge design.
Higher accuracy on harder tasks: Round 2 (98.6%) scored higher than Round 1 (93.3%). Models performed better on complex multi-packet analysis than on individual field extraction - suggesting the protocol's structure aids comprehension as context increases.
Cross-model consistency: No model-specific failures. The protocol is not optimized for any particular LLM architecture. It works across transformer variants from four different providers.

Implications¶

A 27-line specification achieving 95.8% comprehension across four LLMs with zero prior context means:

Any LLM-backed agent can join an AXL bus by reading the Rosetta. No training pipeline required.
Protocol updates propagate instantly - update the Rosetta, and every LLM agent can parse the new format on next inference.
The Rosetta is the onboarding mechanism. No SDK, no client library, no integration guide. Twenty-seven lines of text.