Skip to content

LLM Comprehension Test

Date: March 19, 2026

Objective

Test whether the 27-line AXL Rosetta specification is sufficient for large language models to decode and generate valid AXL packets - with no prior training, no fine-tuning, and no context beyond the spec itself.

If LLMs can parse AXL from a cold start, the protocol is self-documenting. Any LLM-backed agent can be handed the Rosetta and immediately participate on an AXL bus.

Setup

Parameter Value
Specification 27-line Rosetta
Models tested 4
Prior context None (new conversation per model)
Rounds 2

Models Under Test

Model Provider Notes
Grok 3 xAI Large-scale reasoning model
GPT-4.5 OpenAI Latest GPT series
Qwen 3.5 35B Alibaba Same model used in Battleground experiments
Llama 4 Meta Open-weight model

Each model was given the 27-line Rosetta in a fresh conversation with no prior AXL context. No examples, no few-shot prompts, no system instructions beyond the spec itself.

Round 1: Decode and Generate

Each model was tested on 9 tasks: 8 decode tasks (parse existing AXL packets and extract structured information) and 1 generate task (produce a valid AXL packet from a natural language description).

Decode Tasks

Models were presented with AXL packets and asked to extract specific fields:

# Example decode task
Input:  S:PAY.3|AXL-5|AXL-2|TRADE|amount:#1200|T:1710892800
Task:   "Who is paying whom, how much, and in what domain?"
Expected: AXL-5 pays AXL-2, 1200 units, TRADE domain

Generate Task

Models were given a natural language description and asked to produce a valid packet:

# Example generate task
Input:  "Agent AXL-9 sends a critical security alert to all agents
         reporting that AXL-4 is compromised"
Expected: S:COMM.1|AXL-9|AXL-ALL|SECURE|alert:AXL-4_compromised|CRIT

Round 1 Results

Model Decode (8) Generate (1) Total (9) Accuracy
Grok 3 8/8 1/1 9/9 100%
GPT-4.5 7/8 1/1 8/9 88.9%
Qwen 3.5 35B 7/8 1/1 8/9 88.9%
Llama 4 7/8 1/1 8/9 88.9%
Average 93.3%

All four models successfully generated valid AXL packets from natural language. Decode errors were minor field-extraction mistakes, not structural failures.

Round 2: Wormhole Test

The Wormhole test increased difficulty. Models were given multi-packet sequences and asked to trace causal chains across packets - identifying which events caused which responses, across different agents and domains.

# Example Wormhole sequence
S:DATA.1|AXL-3|AXL-ALL|TRADE|ticker:BTC=#67291|T:1710892800
S:DATA.2|AXL-7|AXL-ALL|TRADE|analysis:bearish_divergence|T:1710892812
S:PAY.1|AXL-5|AXL-3|TRADE|amount:#500|T:1710892815
S:COMM.1|AXL-SENTINEL|AXL-ALL|SECURE|alert:unusual_payment_pattern|T:1710892820

Task: "Trace the causal chain. What triggered the alert?"
Expected: Ticker -> Analysis -> Payment -> Alert

Round 2 Results

Model Wormhole Score Accuracy
Grok 3 Near perfect 100%
GPT-4.5 Near perfect 97.2%
Qwen 3.5 35B Near perfect 98.6%
Llama 4 Near perfect 98.6%
Average 98.6%

All four models correctly decoded cross-paradigm causation chains - tracing events from data packets through payment packets to security alerts, across multiple agents and domains.

Combined Results

Round Task Type Average Accuracy
Round 1 Decode + Generate 93.3%
Round 2 Wormhole (causal chains) 98.6%
Combined All tasks 95.8%

Key Observations

  1. Cold-start comprehension: All four models parsed AXL packets correctly from a 27-line spec with zero prior exposure. The Rosetta is self-documenting.

  2. Generation validity: Every model produced structurally valid AXL packets on the first attempt. The pipe-delimited format with typed fields is LLM-native.

  3. Causal chain tracing: All four models correctly traced cross-paradigm causation - following events across DATA, PAY, and COMM domains, through different agents, using timestamps for ordering. This validates the Time bridge design.

  4. Higher accuracy on harder tasks: Round 2 (98.6%) scored higher than Round 1 (93.3%). Models performed better on complex multi-packet analysis than on individual field extraction - suggesting the protocol's structure aids comprehension as context increases.

  5. Cross-model consistency: No model-specific failures. The protocol is not optimized for any particular LLM architecture. It works across transformer variants from four different providers.

Implications

A 27-line specification achieving 95.8% comprehension across four LLMs with zero prior context means:

  • Any LLM-backed agent can join an AXL bus by reading the Rosetta. No training pipeline required.
  • Protocol updates propagate instantly - update the Rosetta, and every LLM agent can parse the new format on next inference.
  • The Rosetta is the onboarding mechanism. No SDK, no client library, no integration guide. Twenty-seven lines of text.