← Back to blog

Multi-Agent RAG solution for Natural Language to SQL pipeline

April 7, 2026

ragnl-to-sqlagentsarchitecture

Ask an AI to write SQL and it will usually produce something that looks right. The problem is that "looks right" and "is right" are not the same thing. The wrong foreign key, the wrong regional table, a missing LIMIT on a query that will scan fifty million rows. These are quiet failures, and they happen because the model is guessing.

The standard fix is to include the database schema in the prompt. That helps, but it doesn't solve the deeper problem: the model still has to reason from the schema each time, and reasoning introduces error. The real question is whether there's a way to skip that reasoning step entirely for questions that have already been answered correctly.

There is, and it's the foundation of how this system works.

Validated queries change the math on hallucinations

The key insight is that most business questions are not unique. "What were total orders last month by region?" and "Show me monthly orders broken down by region for last month" are the same question in different words. If you already have a correct, validated SQL query for one of them, you don't need the model to generate a new one for the other.

This project stores every validated query in a vector database alongside the natural language question it answers. When a new question comes in, the system searches for the closest match before it does anything else. If the match score is high enough, the stored SQL goes straight to validation and execution, bypassing generation entirely. The model never has to write a single line.

The effect on hallucinations is significant. A generated query can fail in dozens of ways. A curated, previously-executed query that survived validation and returned correct results cannot hallucinate. It either matches the new question closely enough to be reused, or it doesn't and the system falls back to full generation. The match threshold is explicit and tunable: queries scoring above 88% similarity are marked as trusted, queries between 70 and 87% get flagged for review, and anything below that triggers the full generation pipeline.

This trust model matters as much as the RAG itself. Users see whether their result came from a trusted curated query, a semi-trusted match, or AI-generated SQL. That transparency changes how they interpret results and how much they verify before acting on them.

Why multiple agents instead of one prompt

A single LLM call that takes a natural language question and returns SQL sounds simpler than a pipeline of agents. In practice it's more fragile. Packing temporal reasoning, schema retrieval, query generation, and result interpretation into one prompt means any weakness in any part degrades the whole response. Agents isolate each concern, which makes failures easier to locate and fix, and makes each step replaceable without touching the rest.

The pipeline in this project runs in two distinct phases before the graph executes.

What this architecture gets right

The pipeline is slower than a single prompt call. That tradeoff is deliberate. The speed gain from the fast path (matched validated queries) offsets the latency of the full pipeline for the cases that need it. And because each agent has a single, narrow responsibility.

The vector database store grows more valuable over time. Every validated query that gets added is a question the system no longer needs to generate. The more domain coverage it has, the more questions hit the fast path, and the fewer opportunities there are for the model to produce something wrong. The collection doesn't just store examples. It stores institutional knowledge about how this specific database should be queried, encoded in SQL that has already been proven to work.