Blogs/AI

Improving SQL Queries With Fine-Tuning?

Written by Ajay Patel
Apr 24, 2026
6 Min Read
Improving SQL Queries With Fine-Tuning? Hero

AI-powered SQL generators perform well on simple prompts but often break on queries that involve grouped calculations, aggregation boundaries, or multi-table relationships. These failures are subtle, easy to miss, and costly in production.

This article examines one deceptively simple question: "What is the percentage of orders that have more than one product?" It walks through why two popular open-source models failed to answer it correctly, and how improving SQL queries with fine-tuning a model on a targeted dataset produced an executable result.

Why SQL Generation Breaks on Aggregation Queries

Answering a percentage-based aggregation question in SQL requires more than translating words into syntax. It requires understanding grouping semantics, aggregation order, and execution constraints. Generic SQL generation models are trained on broad datasets that do not always encode these rules precisely enough for edge cases.

The result is queries that look correct at a glance but fail the moment they hit a database engine.

The Prompt Being Tested

What is the percentage of orders that have more than one product?

This prompt tests whether a model understands how to group orders by ID, count line items per order, filter for orders with more than one item, and then calculate a percentage across the full order table. It requires a subquery or HAVING clause. It cannot be solved with a flat aggregation.

SQLCoder2: What It Generated and Why It Failed

SQLCoder2 produced this query:

sql

SELECT CAST(COUNT(DISTINCT CASE WHEN COUNT(shopify_orderlineitem.id) > 1 
THEN shopify_order.id END) AS FLOAT) / NULLIF(COUNT(DISTINCT shopify_order.id), 0) AS proportion 
FROM shopify_order 
JOIN shopify_orderlineitem ON shopify_order.id = shopify_orderlineitem.order_id;

Execution error:

sql

SQL Error [42803]: ERROR: aggregate function calls cannot be nested

The problem is structural. SQL does not allow aggregate functions like COUNT() to be nested directly inside another aggregate without a grouping boundary. To identify orders with multiple line items, you need a grouped subquery or a HAVING clause. This query provides neither.

SQLCoder-34B: What It Generated and Why It Failed

SQLCoder-34B produced this query:

sql

SELECT CAST(COUNT(DISTINCT o.id) filter (
    WHERE COUNT(DISTINCT oi.product_id) > 1) AS FLOAT) 
    / NULLIF(COUNT(DISTINCT o.id), 0) AS proportion
FROM shopify_order o
JOIN shopify_orderlineitem oi ON o.id = oi.order_id
GROUP BY o.id;

Execution error:

sql

SQL Error [42803]: ERROR: aggregate functions are not allowed in FILTER

This query places an aggregate condition inside a FILTER clause, which SQL does not permit. Conditions like COUNT(*) > 1 must be evaluated in a grouped context using HAVING, not inside a filter predicate. The larger model made a different structural mistake, but the result was the same: a query that cannot execute.

Improving SQL Generation through Fine-Tuning
Understand how fine-tuning enhances LLM performance for structured query generation. Includes dataset preparation and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 2 May 2026
10PM IST (60 mins)

What These Failures Have in Common

Both models produced syntactically plausible SQL that failed at runtime. The core issues were:

  • Neither model correctly handled nested aggregation, which requires a subquery or HAVING clause
  • Both reached for complex structures like CASE statements and FILTER clauses when simpler patterns would have worked
  • Neither validated output against execution rules before returning it
  • Both misidentified where the aggregation boundary needed to be placed

These are not random errors. They reflect systematic gaps in how generic SQL generation models handle multi-step aggregation logic.

The Fine-Tuning Approach

Improving SQL queries with fine-tuning starts with identifying the exact failure patterns. After seeing consistent structural errors in both models, a custom fine-tuning approach was used to explicitly teach correct aggregation patterns for this class of queries.

Building the Dataset

A custom dataset was built around the original prompt and variations involving grouped counts, percentage calculations, and multi-table joins. Each entry paired a natural-language question with an executable SQL query that followed correct aggregation rules.

The dataset contained 196 samples split as follows:

  • Training set: 80% (approximately 157 entries)
  • Validation set: 20% (approximately 39 entries)

Training Process

Fine-tuning completed in 43 minutes using a transfer-learning approach with Hugging Face Transformers, PEFT, and LoRA. LoRA reduced the number of trainable parameters while preserving the base model's general knowledge.

Training and validation loss curves remained closely aligned throughout, indicating stable learning without overfitting. The training loss decreased steadily, and the validation loss followed the same trend without significant divergence. This confirmed the model was generalizing to new examples rather than memorizing the training set.

What the Fine-Tuned Model Generated

sql

select
    (COUNT(*) * 100.0 / (
    select
        COUNT(*)
    from
        shopify_order))
from
    shopify_order
where
    id in (
    select
        order_id
    from
        shopify_orderlineitem
    group by
        order_id
    having
        COUNT(*) > 1);

This query executes correctly.

The subquery in the WHERE clause first identifies orders that have more than one line item using GROUP BY and HAVING. The outer query then counts how many orders meet that condition and divides by the total order count to produce a percentage. The * 100.0 ensures the result is returned as a percentage rather than a decimal ratio.

Why This Query Works Where the Others Failed

The fine-tuned model avoided nested aggregations entirely. Instead of trying to count inside a count, it separated the two concerns: identifying qualifying orders in a subquery, then calculating the proportion in the outer query.

This is the correct pattern for this class of problem, and it is exactly what the custom training dataset was designed to teach. For very large datasets, replacing the IN clause with EXISTS or a JOIN would improve performance, but the logic is correct as written.

Technical Stack

The fine-tuning pipeline used the following tools:

  • PyTorch: primary deep learning framework for model training
  • Hugging Face Transformers: access to pre-trained models and fine-tuning utilities
  • PEFT: parameter-efficient fine-tuning methods including LoRA
  • Datasets: data handling and preprocessing
  • Accelerate: distributed training and mixed precision support

About LoRA

LoRA (Low-Rank Adaptation) is a fine-tuning technique developed by Microsoft in 2021. It adds pairs of low-rank matrices to the weights of the original model. Only these matrices are updated during training, while the base model weights remain frozen.

This significantly reduces the number of trainable parameters without sacrificing task-specific performance. It is what made a 43-minute fine-tuning run practical rather than requiring days of GPU compute.

What This Approach Gets Right

Fine-tuning a pre-trained model rather than training from scratch offers clear advantages. Training time was 43 minutes instead of days or weeks.

Improving SQL Generation through Fine-Tuning
Understand how fine-tuning enhances LLM performance for structured query generation. Includes dataset preparation and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 2 May 2026
10PM IST (60 mins)

Computational requirements were a fraction of full training. The model retained general SQL knowledge from its pre-training while learning the specific aggregation patterns the dataset introduced.

Targeted datasets improve SQL generation accuracy in ways that prompt engineering alone cannot achieve. A model that has seen hundreds of examples of correct HAVING clause usage will apply that pattern reliably. A model that has only seen it occasionally will not.

Limitations of This Approach

The fine-tuned model is dataset-specific. It performs well on queries that match the patterns in its training data but may not generalize cleanly to significantly different query structures without additional training.

As database schemas evolve, the model may need to be retrained or updated, which requires ongoing effort. The quality of the output is also influenced by the quality of the base model. If the base model has systematic weaknesses, fine-tuning cannot fully correct them without a dataset large enough to override those patterns.

Conclusion

Improving SQL queries with fine-tuning is more effective than relying on prompt engineering or generic models alone. Off-the-shelf SQL generation models often fail on aggregation-heavy queries despite producing syntactically convincing output. Fine-tuning with a focused, execution-verified dataset enables models to learn correct aggregation patterns and produce reliable SQL for specific use cases.

The investment is small: 196 samples, 43 minutes of training, and a working query where two generic models failed. For teams relying on AI-assisted SQL generation in production, targeted fine-tuning is a more reliable path than prompt engineering alone.

Frequently Asked Questions

Why do AI SQL generators fail on aggregation queries?

Generic models are trained on broad datasets that do not always encode SQL execution rules precisely. Queries involving nested counts, grouped filters, or percentage calculations require specific structural patterns that models must be explicitly taught through targeted training data.

What is LoRA and why is it used for fine-tuning?

LoRA is a parameter-efficient fine-tuning technique that adds low-rank matrices to the model's existing weights. Only these matrices are updated during training, which significantly reduces compute requirements while still adapting the model to new tasks effectively.

How much data is needed to fine-tune an SQL model?

This case used 196 samples and produced a correct, executable result. For narrow, well-defined query types, a small high-quality dataset is often more effective than a large noisy one.

Can this approach work for other SQL query types?

Yes. The same fine-tuning methodology can be applied to query optimization, schema-specific generation, data cleaning queries, and other structured SQL tasks by building a targeted dataset around the relevant patterns.

What tools are used for SQL fine-tuning?

Common tools include PyTorch, Hugging Face Transformers, PEFT for parameter-efficient methods like LoRA, and the Datasets library for data preprocessing and loading.

Author-Ajay Patel
Ajay Patel

Hi, I am an AI engineer with 3.5 years of experience passionate about building intelligent systems that solve real-world problems through cutting-edge technology and innovative solutions.

Share this article

Phone

Next for you

Active vs Total Parameters: What’s the Difference? Cover

AI

Apr 10, 20264 min read

Active vs Total Parameters: What’s the Difference?

Every time a new AI model is released, the headlines sound familiar. “GPT-4 has over a trillion parameters.” “Gemini Ultra is one of the largest models ever trained.” And most people, even in tech, nod along without really knowing what that number actually means. I used to do the same. Here’s a simple way to think about it: parameters are like knobs on a mixing board. When you train a neural network, you're adjusting millions (or billions) of these knobs so the output starts to make sense. M

Cost to Build a ChatGPT-Like App ($50K–$500K+) Cover

AI

Apr 7, 202610 min read

Cost to Build a ChatGPT-Like App ($50K–$500K+)

Building a chatbot app like ChatGPT is no longer experimental; it’s becoming a core part of how products deliver support, automate workflows, and improve user experience. The mobile app development cost to develop a ChatGPT-like app typically ranges from $50,000 to $500,000+, depending on the model used, infrastructure, real-time performance, and how the system handles scale. Most guides focus on features, but that’s not what actually drives cost here. The real complexity comes from running la

How to Build an AI MVP for Your Product Cover

AI

Apr 16, 202613 min read

How to Build an AI MVP for Your Product

I’ve noticed something while building AI products: speed is no longer the problem, clarity is. Most MVPs fail not because they’re slow, but because they solve the wrong problem. In fact, around 42% of startups fail due to a lack of market need. Building an AI MVP is not just about testing features; it’s about validating whether AI actually adds value. Can it automate something meaningful? Can it improve decisions or user experience in a way a simple system can’t? That’s where most teams get it