NL2SQL is a platform that translates natural language business questions into semantically correct SQL queries. Unlike generic AI tools, it uses a RAG pipeline with an ontology layer and curated query matching to understand what your data actually means.

How is NL2SQL different from ChatGPT or other AI SQL generators?

Generic AI tools translate words into SQL syntax. NL2SQL translates business intent — it knows that "revenue" means L_EXTENDEDPRICE × (1 − L_DISCOUNT), that "last quarter" must be dynamically anchored to today, and which tables to join. This domain knowledge comes from a vector knowledge base of schema metadata, business terms, and curated query examples.

What databases does NL2SQL support?

NL2SQL currently targets Snowflake data warehouses. The demo uses a TPC-H commerce schema (DBT.DWH) with tables like LINEITEM, ORDERS, CUSTOMER, PART, SUPPLIER, NATION, and REGION.

What are the confidence levels in NL2SQL results?

Every query result includes a confidence signal: Trusted (≥88% match to a curated query), Semi-Trusted (70–87% match, review recommended), or Untrusted (AI-generated or validation failed). This helps users know when to rely on results directly and when to review the SQL.

Multi-Agent RAG solution for Natural Language to SQL pipeline

Ask an AI to write SQL and it will usually produce something that looks right. The problem is that "looks right" and "is right" are not the same thing. The wrong foreign key, the wrong regional table, a missing LIMIT on a query that will scan fifty million rows. These are quiet failures, and they happen because the model is guessing.

The standard fix is to include the database schema in the prompt. That helps, but it doesn't solve the deeper problem: the model still has to reason from the schema each time, and reasoning introduces error. The real question is whether there's a way to skip that reasoning step entirely for questions that have already been answered correctly.

There is, and it's the foundation of how this system works.

Validated queries change the math on hallucinations

The key insight is that most business questions are not unique. "What were total orders last month by region?" and "Show me monthly orders broken down by region for last month" are the same question in different words. If you already have a correct, validated SQL query for one of them, you don't need the model to generate a new one for the other.

This project stores every validated query in a vector database alongside the natural language question it answers. When a new question comes in, the system searches for the closest match before it does anything else. If the match score is high enough, the stored SQL goes straight to validation and execution, bypassing generation entirely. The model never has to write a single line.

The effect on hallucinations is significant. A generated query can fail in dozens of ways. A curated, previously-executed query that survived validation and returned correct results cannot hallucinate. It either matches the new question closely enough to be reused, or it doesn't and the system falls back to full generation. The match threshold is explicit and tunable: queries scoring at or above 92% similarity are marked as trusted, queries between 70 and 91% get flagged for review, and anything below that triggers the full generation pipeline.

This trust model matters as much as the RAG itself. Users see whether their result came from a trusted curated query, a semi-trusted match, or AI-generated SQL. That transparency changes how they interpret results and how much they verify before acting on them.

Why multiple agents instead of one prompt

A single LLM call that takes a natural language question and returns SQL sounds simpler than a pipeline of agents. In practice it's more fragile. Packing temporal reasoning, schema retrieval, query generation, and result interpretation into one prompt means any weakness in any part degrades the whole response. Agents isolate each concern, which makes failures easier to locate and fix, and makes each step replaceable without touching the rest.

The pipeline in this project runs in two distinct phases before the graph executes.

Phase one: analysis before generation

Before the LangGraph pipeline runs, two analyses happen in parallel. A temporal analysis agent sends the user's question to an LLM to detect any date or time references, then asks the user to confirm what those dates resolve to. "Last quarter" on a question asked in April is different from the same question asked in January, and the system makes that explicit rather than guessing.

At the same time, the question matching step searches the validated query store in a vector database and returns the closest SQL match along with its similarity score. The user sees the matched query before any SQL runs and can choose to use it or ask for a fresh query instead. This is not just a performance optimization. It gives users direct control over when they trust a cached answer versus when they want the model to reason through the question from scratch.

Geography analysis also runs here, detecting whether the question implies a specific regional filter. Questions that reference country-level data often need to be routed to a regional table, and catching that before generation saves a retry cycle later.

Phase two: the agent pipeline

When there's no strong match, or the user requests a fresh query, the full pipeline runs through a series of specialized agents managed by LangGraph.

The question parser reads the natural language input and extracts structured intent. This structured representation is what every downstream agent works from. Passing raw text through the whole pipeline would mean each agent doing its own interpretation, which produces inconsistent results.

The schema retrieval agent takes that structured intent and searches the vector database for relevant information, and example queries. This is where the context for generation comes from. The retrieval is targeted, not a full schema dump. That keeps the generation prompt focused and reduces the surface area for mistakes.

The temporal logic agent converts whatever confirmed temporal information came out of phase one into the right expressions. Date handling is one of the most error-prone parts of SQL generation.

The SQL generator takes the structured intent, retrieved schema context, temporal information, and original question, then produces a SQL query using few-shot examples pulled from the same vector database store. The examples are matched to the current question type. Few-shot prompting with domain-specific examples produces more reliable SQL than zero-shot generation against a schema alone.

The validator checks the generated SQL. If validation fails, the graph routes back to the generator for another attempt, up to few retries. This retry loop catches common generation errors.

The executor runs the validated query against Snowflake database. Results come back as a structured dataset.

The result interpreter takes the question, the SQL, and the dataset and writes a natural language summary of what the data shows. The user gets a plain English answer alongside the query and results, not just a raw table they have to interpret themselves.

What this architecture gets right

The pipeline is slower than a single prompt call. That tradeoff is deliberate. The speed gain from the fast path (matched validated queries) offsets the latency of the full pipeline for the cases that need it. And because each agent has a single, narrow responsibility.

The vector database store grows more valuable over time. Every validated query that gets added is a question the system no longer needs to generate. The more domain coverage it has, the more questions hit the fast path, and the fewer opportunities there are for the model to produce something wrong. The collection doesn't just store examples. It stores institutional knowledge about how this specific database should be queried, encoded in SQL that has already been proven to work.