How Structured Reasoning Makes AI Agents Smarter and More Efficient

This article is also available as a research paper. If you'd like a PDF version to read offline or share with your team, you can download it here.

‍

The problem with how AI models reason today

When you ask a large language model (LLM) to solve a complex problem, it does what humans do when thinking out loud: it generates a long stream of text, working through the problem step by step in natural language before arriving at an answer.

This approach — known as Chain-of-Thought (CoT) prompting — has dramatically improved AI accuracy. But it comes with a catch. The longer and more verbose the model's reasoning, the more tokens it uses. And tokens cost money. Worse, that reasoning is often noisy: the model can drift off-topic, repeat itself, or circle back unnecessarily. You end up paying for verbosity that doesn't improve the answer.

For companies deploying AI agents at scale — running thousands of queries per day — this is a real operational problem. The most powerful models are expensive, and the cheapest models often lack the reasoning depth needed for complex tasks. It has felt like a forced trade-off: accuracy vs. cost.

This paper argues that trade-off is a false choice.

What BRAID does differently

BRAID (Bounded Reasoning for Autonomous Inference and Decisions) replaces free-form natural language reasoning with a structured, symbolic representation: a Mermaid flowchart.

Instead of letting a model "think aloud" through a problem, BRAID first encodes the logical reasoning path as a compact diagram — a set of nodes and directed edges that map out exactly what steps are needed, in what order, with what conditions. This diagram is then passed to the model as a structured instruction, telling it precisely how to approach the problem rather than leaving it to improvise.

The key insight: structure is a substitute for raw model power. When the reasoning path is pre-defined and deterministic, a smaller, cheaper model can execute it accurately — because it doesn't need to figure out how to think, only what to do at each step.

BRAID turns reasoning into two separable tasks:

Generation: A capable model creates the reasoning graph once.
Solving: A smaller, faster, cheaper model follows that graph to produce answers at scale.

Because the graph can be cached and reused across many queries, the cost of generation is amortized. In practice, the solving step — which runs every time — becomes dramatically cheaper.

How we tested it

We evaluated BRAID across three rigorous benchmark datasets:

GSM-Hard — challenging mathematical word problems
SCALE MultiChallenge — complex multi-step reasoning tasks (newly released, minimising risk of model memorisation)
AdvancedIF — instruction-following tasks requiring precise constraint adherence

We tested multiple combinations of GPT model tiers as generators and solvers, tracking accuracy, cost per query (in US cents), and a metric we defined called Performance-per-Dollar (PPD) — accuracy achieved relative to cost incurred, normalised against a GPT-5 Medium baseline.

All results and raw data are available at benchmark.openserv.ai.

What we found

Smaller models can match larger ones — with the right structure

On GSM-Hard, GPT-5 Nano Minimal with BRAID improved from 94% accuracy (without BRAID) to 98% — matching the performance of GPT-5 Medium at a fraction of the cost.

On SCALE MultiChallenge, the gains were even more striking. GPT-4o jumped from 19.9% accuracy to 53.7% with BRAID. GPT-5 Nano Minimal went from 23.9% to 45.2%, outperforming the larger GPT-5 Minimal running without structure (40.4%).

On AdvancedIF, GPT-5 Nano Minimal more than doubled its accuracy: from 18% to 40%.

This pattern — smaller model with BRAID matching or beating a larger model without it — appeared consistently across all three datasets. We call it the BRAID Parity Effect.

The economics are transformative

The PPD results are where the economic case becomes clear.

On GSM-Hard, pairing GPT-4.1 as the generator with GPT-5 Nano Minimal as the solver produced a PPD of 74.06 — meaning this configuration was over 74 times more cost-efficient than the GPT-5 Medium baseline, at equal or better accuracy.

On SCALE MultiChallenge, using GPT-5 Medium to generate and GPT-5 Nano Medium to solve achieved a PPD of 30.31, while hitting 59.2% accuracy (comparable to the best monolithic results).

On AdvancedIF, the optimal production configuration — GPT-5 Medium generating, GPT-5 Nano Medium solving — delivered a 16x cost reduction while maintaining 57% accuracy versus the baseline's 63%. For most production deployments, that is an acceptable trade-off.

The "Golden Quadrant"

Our results consistently pointed to one optimal architecture: use a high-capability model to generate the reasoning graph once, and a low-cost model to execute it repeatedly.

The combination of a capable generator (e.g., GPT-4.1 or GPT-5 Medium) with a nano-tier solver achieves efficiency gains of 30–74x over monolithic deployments, depending on the task type. This isn't just an engineering curiosity — it changes what's economically viable for companies building AI agents at scale.

Why this works: what structured prompting does to a model

Several mechanisms explain why structure improves performance:

It eliminates reasoning drift. Free-form CoT allows models to go off-topic, repeat themselves, or generate tokens that carry no semantic value. A bounded graph prevents this by constraining every reasoning step to a defined node.

It separates planning from execution. BRAID makes explicit that figuring out how to solve a problem is different from solving it. By decoupling these, each can be handled by the model tier best suited for it.

It activates different layers. Research in mechanistic interpretability shows that LLMs process structured input differently: later transformer layers engage more strongly with logical structure than with natural-language reasoning, improving signal fidelity.

It makes constraints explicit. For tasks requiring adherence to rules (instruction following, legal or policy compliance, constrained generation), encoding constraints as flowchart nodes with verification loops forces the model to check its output — something free-form generation often skips.

Principles for building effective BRAID graphs

Through empirical testing, we identified four design principles that determine whether a reasoning graph works well:

1. Node atomicity. Each node should represent a single, discrete reasoning step — ideally fewer than 15 tokens. Nodes that combine observation, analysis, and conclusion reintroduce the noise of unstructured prompting.

2. Scaffolding, not leaking. Nodes should encode how to approach a problem, not pre-generate the answer. A node that reads "Draft introduction: Dear Team, I regret to inform you..." defeats the purpose. A node that reads "Draft introduction: Acknowledge recent success → Pivot to financial news → Maintain regretful but professional tone" works correctly.

3. Deterministic branching. Edges between nodes should be labelled with explicit conditions ("If text > 300 words → B"), not open-ended transitions. This transforms inference from probabilistic token prediction into a deterministic traversal.

4. Terminal verification loops. Effective graphs end with a "critic" phase: nodes that check the output against constraints before finalising. If a check fails, the graph routes back to a revision node. This gives smaller models a mechanism for self-correction they don't naturally possess.

Looking ahead

Several extensions to BRAID are planned:

Specialist Architect models. Rather than using a general-purpose model to generate graphs, a fine-tuned model trained specifically on reasoning graph construction could produce higher-quality structures at lower cost.

Dynamic re-planning. The current framework treats graphs as static. Future versions will allow the solver to signal a topology error and trigger targeted re-generation of affected subgraphs, enabling adaptation to unexpected inputs.

Visual graph ingestion. With the rise of vision-language models, we plan to explore feeding rendered visual representations of Mermaid diagrams rather than raw code — potentially leveraging spatial reasoning capabilities in next-generation multimodal models.

What this means for Neol

BRAID was developed by Armağan Amcalar, Neol's Chief Technology Officer, in collaboration with Eyup Cinar at Eskisehir Osmangazi University. The research was conducted at OpenServ Labs and specifically acknowledges Neol for testing the BRAID framework in industrial settings and providing feedback on real-world deployment.

At Neol, this research informs how we architect the intelligence layer inside Neol Hub. The platform handles complex people-data reasoning at scale — matching skills, interpreting natural-language queries, enriching sparse profiles, and surfacing relevant connections across large networks. These tasks require high reasoning accuracy, but they also need to run efficiently across millions of queries.

BRAID's split-architecture approach directly shapes how Neol builds and deploys its AI agents: using structured reasoning graphs to maintain precision while keeping inference costs viable for enterprise-scale deployment.