Facebook iconImproving SQL Queries with AI Fine-Tuning - F22 Labs
F22 logo
Blogs/AI

Improving SQL Queries with Fine-Tuning

Written by Ajay Patel
Feb 13, 2026
8 Min Read
Improving SQL Queries with Fine-Tuning Hero

Generating correct SQL queries is a foundational requirement in data analysis and database-driven systems. I’m writing this because AI-powered SQL generators often perform well on simple prompts but fail when queries involve grouped calculations, aggregation boundaries, or multi-table relationships. These failures are subtle, easy to miss, and costly in production.

This article examines a deceptively simple question:
“What is the percentage of orders that have more than one product?”

Although the intent is clear, answering it correctly requires understanding SQL aggregation rules, grouping semantics, and execution constraints. This post evaluates how two popular open-source models—SQLCoder2 and SQLCoder-34B—approached the problem, why their outputs failed at execution time, and how fine-tuning a model with a targeted dataset produced a correct, executable result. The goal is to show where generic SQL generation breaks down and how fine-tuning improves reliability for structured queries.

Initial Attempts with Open-Source SQL Generators

Initial Attempts with Open-Source SQL Generators
To evaluate how existing AI SQL generators handle aggregation logic, the same natural-language prompt was tested against two widely used open-source models: SQLCoder2 and SQLCoder-34B. Both are designed to translate natural language into SQL and are commonly used for analytics workflows.

Prompt used:

What is the percentage of orders that have more than one product?

This prompt tests whether the model understands grouping, aggregation order, and percentage calculation across related tables.

SQLCoder2 generated the following SQL query:

SELECT CAST(COUNT(DISTINCT CASE WHEN COUNT(shopify_orderlineitem.id) > 1 THEN shopify_order.id END) AS FLOAT) / NULLIF(COUNT(DISTINCT shopify_order.id), 0) AS proportion 
FROM shopify_order 
JOIN shopify_orderlineitem ON shopify_order.id = shopify_orderlineitem.order_id;

Execution error:

SQL Error [42803]: ERROR: aggregate function calls cannot be nested

This failure occurred because SQL does not allow aggregate functions such as COUNT() to be nested directly inside another aggregate without a grouping boundary. Identifying orders with multiple line items requires a grouped subquery or HAVING clause, which this query does not provide.

SQLCoder-34B produced the following query:

SELECT CAST(COUNT(DISTINCT o.id) filter (
                                         WHERE COUNT(DISTINCT oi.product_id) > 1) AS FLOAT) / NULLIF(COUNT(DISTINCT o.id), 0) AS proportion
FROM shopify_order o
JOIN shopify_orderlineitem oi ON o.id = oi.order_id
GROUP BY o.id;

Execution error:

SQL Error [42803]: ERROR: aggregate functions are not allowed in FILTER

This query incorrectly places an aggregate condition inside a FILTER clause. SQL requires aggregation conditions such as COUNT(*) > 1 to be evaluated in a grouped context using HAVING, not inside a filter predicate.

Discussion of limitations of these models

The failures from both SQLCoder2 and SQLCoder-34B highlight common weaknesses in generic SQL generation models:

  1. Lack of dialect-specific knowledge: Both models generated queries that aren't universally compatible across different SQL dialects. This highlights the need for AI models to be aware of specific SQL dialect constraints.
  2. Incorrect handling of nested aggregations: The models struggled with the concept of counting orders with multiple products, resulting in invalid nested aggregations.
  3. Overcomplication of the query: Both attempts used more complex structures (like CASE statements and FILTER clauses) than necessary, indicating a potential overfit to certain query patterns.
  4. Misinterpretation of the task: The models seemed to misunderstand the core requirement of identifying orders with more than one product, leading to overly complex and incorrect solutions.
  5. Lack of error checking: Neither model was able to validate its output against common SQL syntax rules, resulting in queries that would fail upon execution.

These limitations explain why syntactically plausible SQL can still fail at runtime and why domain-specific training is often required.

Custom Training Approach

After identifying consistent structural failures in existing models, a custom fine-tuning approach was used to explicitly teach aggregation patterns and execution-safe SQL construction for this class of queries.

Preparing a Custom Dataset

A custom dataset was built around the original prompt and variations involving grouped counts, percentage calculations, and multi-table joins. Each entry paired a natural-language question with an executable SQL query that followed correct aggregation rules.

The dataset contained 196 samples, split into training and validation sets to balance learning and generalization.

  • Training set: 80% of the data (approximately 157 entries)
  • Testing set: 20% of the data (approximately 39 entries)

This split provides a substantial amount of data for training while reserving a significant portion for testing, helping to ensure the model's ability to generalize to new, unseen examples.

Training Process

Fine-tuning completed in 43 minutes, demonstrating the efficiency of adapting an existing model rather than training from scratch. Training and validation loss curves remained aligned, indicating stable learning without overfitting.

To assess the quality of our training, we plotted the training loss against the validation loss over the course of the training process. This visualization helps us understand how well the model is learning and whether it's overfitting or underfitting the data.

The loss curves indicate a good fit.

  • The training loss decreased steadily over time, indicating that the model was learning from the training data.
  • The validation loss also decreased and closely followed the trend of the training loss, suggesting that the model was generalizing well to unseen data.
  • There was no significant divergence between training and validation loss, which would have indicated overfitting.

This indicates the fine-tuned model learned the target SQL patterns without simply memorizing the dataset.

Improving SQL Generation through Fine-Tuning
Understand how fine-tuning enhances LLM performance for structured query generation. Includes dataset preparation and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 14 Mar 2026
10PM IST (60 mins)

By taking this custom training approach, we were able to address the limitations of existing models and create a solution tailored to our specific SQL generation needs. The next section will reveal the results of this training and the SQL query our model was able to produce.

Post-Training Results

After our custom training process, we achieved a significant improvement in the model's ability to generate the correct SQL query for our specific problem. Let's examine the successful query and analyze its structure and functionality.

Presentation of the successful SQL query

The fine-tuned model generated the following executable SQL:

select
    (COUNT(*) * 100.0 / (
    select
        COUNT(*)
    from
        shopify_order))
from
    shopify_order
where
    id in (
    select
        order_id
    from
        shopify_orderlineitem
    group by
        order_id
    having
        COUNT(*) > 1);

This query correctly isolates orders with multiple products before calculating the overall percentage.

Analysis of the query's structure and functionality

1. Overall Structure:

    • The main query calculates the percentage
    • Subquery in WHERE clause identifies orders with multiple products

2. Key Features:

    • Uses COUNT(*) for efficient counting
    • Converts ratio to percentage with * 100.0
    • Avoids nested aggregations that caused previous errors

3. Functionality:

    • Correctly identifies orders with more than one product
    • Calculates percentage without risk of division by zero
    • Solves the original problem accurately

4. Improvements:

    • Overcomes issues from SQLCoder2 and SQLCoder-34B attempts
    • Demonstrates proper handling of subqueries and aggregations

5. Potential Optimization:

    • For very large datasets, consider replacing IN clause with EXISTS or JOIN for better performance

This query successfully solves the problem, showing the model's improved SQL comprehension and ability to handle complex queries without common errors.

Technical Details

Overview of training methodology

A transfer-learning approach was used, combining Hugging Face Transformers, PEFT, and LoRA to adapt the model efficiently. LoRA reduced trainable parameters while preserving the base model’s general knowledge.

Tools and libraries used

We utilized a robust stack of modern machine-learning tools and libraries for this project:

1. PyTorch: As our primary deep learning framework, PyTorch provided the foundation for model manipulation and training.

2. Hugging Face Transformers: This library offered easy access to state-of-the-art pre-trained models and tools for fine-tuning.

3. PEFT (Parameter-Efficient Fine-Tuning): A library from Hugging Face that implements various efficient fine-tuning methods, including LoRA.

4. Datasets: Another Hugging Face library used for efficient data handling and preprocessing.

5. Accelerate: This library was used to enable distributed training and mixed precision, optimizing our training process.

Discussion of the LoRA (Low-Rank Adaptation) technique

LoRA is an efficient fine-tuning technique introduced in 2021 by Microsoft researchers. It addresses the challenge of adapting large language models to specific tasks without the need to retrain all parameters, which can be computationally expensive and prone to overfitting.

Key Concepts:

1. Low-Rank Matrices: LoRA works by adding pairs of rank-decomposition matrices to the weights of the original model. These matrices are initialized to a zero matrix, ensuring that the model starts from the pre-trained weights.

2. Trainable Parameters: During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen. This significantly reduces the number of trainable parameters.

3. Adaptability: Despite the reduced parameter count, LoRA can adapt the model to new tasks effectively, often matching or exceeding the performance of full fine-tuning.

For more information : [check this paper]

Lessons Learned

Importance of custom datasets for specific use cases

Targeted datasets significantly improve SQL generation accuracy by teaching execution-safe patterns. Fine-tuning allows models to internalize aggregation boundaries that generic models often misapply.

Benefits of fine-tuning existing models 

Fine-tuning pre-trained models, rather than training from scratch, offers several advantages:

1. Reduced training time (our process took only 43 minutes)

2. Lower computational resource requirements

3. Ability to leverage the general knowledge of large language models while adapting to specific tasks

4. Improved performance on niche problems that general models might struggle with

These lessons underscore the value of customization in AI development, especially for specialized applications like SQL query generation. By combining the power of existing models with targeted fine-tuning, we can create more effective and efficient AI tools for specific data analysis needs.

Suggested Reads- How To Use Open Source LLMs (Large Language Model)

Future Improvements and Research Directions

As we reflect on our journey of fine-tuning an AI model for SQL query generation, it's clear that while we've made significant progress, there's still room for improvement and expansion. This section explores potential enhancements to our model and discusses how our approach could be applied to other SQL-related tasks.

Improving SQL Generation through Fine-Tuning
Understand how fine-tuning enhances LLM performance for structured query generation. Includes dataset preparation and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 14 Mar 2026
10PM IST (60 mins)

Potential ways to enhance the model further

1. Expanding the training dataset:

    • Incorporate a wider variety of SQL queries, covering more complex scenarios and edge cases.
    • Include queries from different database systems (e.g., MySQL, Oracle) to improve versatility.

2. Fine-tuning hyperparameters:

    • Experiment with different learning rates, batch sizes, and epochs to optimize model performance.

3. Incorporating user feedback:

    • Develop a system to collect and integrate user feedback on generated queries, creating a continuous improvement loop.

1. Query optimization:

  • Adapt the model to suggest performance improvements for existing queries.
  • Train on execution plans to recommend optimal indexing strategies.

2. Schema design:

  • Extend the model's capabilities to suggest database schema improvements based on query patterns.

3. SQL to natural language:

  • Train the model to generate human-readable explanations of complex SQL queries.

4. Data cleaning and preprocessing:

  • Adapt the model to generate SQL for common data cleaning tasks, such as handling null values or standardizing formats.

5. Temporal and spatial data queries:

  • Expand the model's capabilities to handle specialized queries for time-series or geographic data.

6. Security and access control:

  • Train the model to generate SQL that adheres to specific security policies or role-based access controls.

By pursuing these improvements and exploring new applications, we can continue to push the boundaries of AI-assisted SQL query generation. 

Limitations of Our Approach

While effective for this query class, the approach remains dataset-specific, requires maintenance as schemas evolve, and depends on the quality of the base model.

1. Dataset Specificity: The model was trained on a custom dataset tailored to this particular problem. This means it may not generalize well to significantly different SQL query tasks without additional training.

2. Time and Resource Intensive: Creating a custom dataset and training the model required significant time and computational resources. This approach may not be feasible for every SQL query challenge, especially in time-sensitive situations.

3. Limited Scope: Our solution addresses a specific type of SQL query. It may not be equipped to handle a wide range of SQL operations or complex database schemas without further training.

4. Maintenance and Updates: As database structures or requirements change, the model may need to be retrained or fine-tuned, which requires ongoing effort and expertise.

5. Dependency on Initial Models: Our approach is built upon existing models (SQLCoder2). The quality and limitations of these base models influence the final output, even after custom training.

Conclusion

This work demonstrates that off-the-shelf SQL generation models often fail on aggregation-heavy queries despite producing syntactically convincing output. Fine-tuning with a focused dataset enables models to learn correct execution patterns and produce reliable SQL for specific use cases.

AI-assisted SQL generation benefits most when paired with domain knowledge, validation, and targeted training, making fine-tuning a practical path to production-ready accuracy rather than relying solely on generic models.

Author-Ajay Patel
Ajay Patel

Hi, I am an AI engineer with 3.5 years of experience passionate about building intelligent systems that solve real-world problems through cutting-edge technology and innovative solutions.

Share this article

Phone

Next for you

How Good Is LightOnOCR-2-1B for Document OCR and Parsing? Cover

AI

Mar 6, 202636 min read

How Good Is LightOnOCR-2-1B for Document OCR and Parsing?

Building document processing pipelines is rarely simple. Most OCR systems rely on multiple stages: detection, text extraction, layout parsing, and table reconstruction. When documents become complex, these pipelines often break, making them costly and difficult to maintain. I wanted to understand whether a lightweight end-to-end model could simplify this process without sacrificing document structure. LightOnOCR-2-1B, released by LightOn, takes a different approach. Instead of relying on fragm

How To Build a Voice AI Agent (Using LiveKit)? Cover

AI

Mar 6, 20269 min read

How To Build a Voice AI Agent (Using LiveKit)?

Voice AI agents are becoming increasingly common in applications such as customer support automation, AI call centers, and real-time conversational assistants. Modern voice systems can process speech in real time, understand conversational context, handle interruptions, and respond with natural-sounding speech while maintaining low latency. I wanted to understand what it actually takes to build a production-ready voice AI agent using modern tools. In this guide, I explain how to build a voice

vLLM vs vLLM-Omni: Which One Should You Use? Cover

AI

Mar 10, 20267 min read

vLLM vs vLLM-Omni: Which One Should You Use?

Serving large language models efficiently is a major challenge when building AI applications. As usage scales, systems must handle multiple requests simultaneously while maintaining low latency and high GPU utilization. This is where inference engines like vLLM and vLLM-Omni become important. vLLM is designed to maximize performance for text-based LLM workloads, while vLLM-Omni extends the same architecture to support multimodal inputs such as images, audio, and video. In this guide, we compar