Agent Trace Language Experiments & AAAI Submission
A Trace-Language Framework for Agent Verification
A Trace-Language Framework for Agent Verification
We’ve finalized the experiments and used docling to generate the HTML outputs, alongside the camera-ready double-blind PDF for our upcoming AAAI conference submission: “A Trace-Language Framework for Agent Verification”.
The core thesis of our research demonstrates that agent behavior can be modeled as a trace language. By restricting a highly expressive generative model $\mathcal{G}$ (like an LLM) with a regular (Type-3) verifier $\mathcal{V}$ (like an Aho-Corasick DFA), we can achieve rigorous constraints over the agent’s operations while keeping the system computationally tractable.
Formal Verification Model
We establish the agent as an automaton that emits a string of symbols at each operational step. Formally, a run of an agent $\mathcal{A}$ is a sequence:
The trace of that run is the concatenation of all emitted symbols over its alphabet $\Sigma_\mathcal{A}$:
The trace language $L(\mathcal{A})$ is the set of all possible accepting traces:
From this, we define trace equivalence ($\mathcal{A}1 \equiv\tau \mathcal{A}_2 \iff L(\mathcal{A}_1) = L(\mathcal{A}_2)$), positing that an agent is its trace language, regardless of its internal neural or hard-coded architecture.
The mathematical pivot of the paper is Verifier Closure (Theorem 2). When a generator $\mathcal{G}$ in Chomsky class $\mathcal{L}_i$ is bound by a Type-3 regular verifier $\mathcal{V}$, the resulting verified composition is:
This proves that the verification budget remains linear ($O(n)$ time, $O(1)$ space) without inflating the complexity class of the original generator.
The Six Trace Language Experiments
To prove our framework’s viability, we conducted 6 structured experiments spanning controlled ablation, Chomsky classification, and live competitive deployments:
- Verifier Ablation Study: We proved that a deterministic finite automaton (DFA) verifier improves task success from 35.2% to 57.8% across ML generation tasks, intercepting zero-day trace faults faster and more consistently than an LLM-based judge.
- Chomsky Class Compression: By analyzing ReAct, Tree Search, and Planner-Executor architectures, we demonstrated that projecting agent execution traces onto a shared alphabet forces diverse architectures to collapse into structurally equivalent trace languages.
- Emergent Sub-Agent Discovery: K-Means clustering on transition-count matrices across 100 simulations revealed the spontaneous emergence of sub-agent roles—without the need for hardcoded instructions.
- Trace-Language Verification Analyzer: We developed an execution trace analyzer that extracts alphabet size, stack nesting depth, and cycle counts to heuristically classify the agent’s Chomsky hierarchy class.
- Safety-Critical Verification: A custom safety DFA effectively intercepted 100% of malicious or forbidden sequences (
drop_table,send_money) before execution. - NeuroGolf 2026 (Live Competition Deployment): We embedded our DFA verifier directly into the NeuroGolf 2026 Kaggle Championship notebook generation pipeline. By analyzing the execution pipelines of top-tier submissions, our DFA enforced strict architectural routing for baseline optimization—yielding a 50% improvement in our competition score (climbing from 2739 to 4127).
Resources & Documentation
We have published the full open-source Trace-Language Verification (TLV) framework on GitHub, and the paper has been compiled to HTML using Docling. You can explore the methodology, experiment scripts, and mathematical proofs via our GitHub Pages documentation:
🔗 Agent Trace Language Documentation Site
Competition Links
Our live-testing environment spans international AI conferences and data science platforms:
- AAAI 2026 Conference Competitions: The premier international AI conference platform where our Trace Language architecture is slated for presentation.
- IJCAI 2026 Competitions: The official host of the Abstraction and Reasoning Corpus (ARC-AGI) competitions.
- Kaggle: The 2026 NeuroGolf Championship: The live leaderboard where our DFA verifier pushed our solution into the top decile of submissions.
The double-blind main.pdf is now compiled and prepped in the repository for submission. Good luck on the AAAI review cycle!