
AI-powered SQL generators perform well on simple prompts but often break on queries that involve grouped calculations, aggregation boundaries, or multi-table relationships. These failures are subtle, easy to miss, and costly in production.
This article examines one deceptively simple question: "What is the percentage of orders that have more than one product?" It walks through why two popular open-source models failed to answer it correctly, and how improving SQL queries with fine-tuning a model on a targeted dataset produced an executable result.
Answering a percentage-based aggregation question in SQL requires more than translating words into syntax. It requires understanding grouping semantics, aggregation order, and execution constraints. Generic SQL generation models are trained on broad datasets that do not always encode these rules precisely enough for edge cases.
The result is queries that look correct at a glance but fail the moment they hit a database engine.
What is the percentage of orders that have more than one product?This prompt tests whether a model understands how to group orders by ID, count line items per order, filter for orders with more than one item, and then calculate a percentage across the full order table. It requires a subquery or HAVING clause. It cannot be solved with a flat aggregation.
SQLCoder2 produced this query:
sql
SELECT CAST(COUNT(DISTINCT CASE WHEN COUNT(shopify_orderlineitem.id) > 1
THEN shopify_order.id END) AS FLOAT) / NULLIF(COUNT(DISTINCT shopify_order.id), 0) AS proportion
FROM shopify_order
JOIN shopify_orderlineitem ON shopify_order.id = shopify_orderlineitem.order_id;Execution error:
sql
SQL Error [42803]: ERROR: aggregate function calls cannot be nestedThe problem is structural. SQL does not allow aggregate functions like COUNT() to be nested directly inside another aggregate without a grouping boundary. To identify orders with multiple line items, you need a grouped subquery or a HAVING clause. This query provides neither.
SQLCoder-34B produced this query:
sql
SELECT CAST(COUNT(DISTINCT o.id) filter (
WHERE COUNT(DISTINCT oi.product_id) > 1) AS FLOAT)
/ NULLIF(COUNT(DISTINCT o.id), 0) AS proportion
FROM shopify_order o
JOIN shopify_orderlineitem oi ON o.id = oi.order_id
GROUP BY o.id;Execution error:
sql
SQL Error [42803]: ERROR: aggregate functions are not allowed in FILTERThis query places an aggregate condition inside a FILTER clause, which SQL does not permit. Conditions like COUNT(*) > 1 must be evaluated in a grouped context using HAVING, not inside a filter predicate. The larger model made a different structural mistake, but the result was the same: a query that cannot execute.
Walk away with actionable insights on AI adoption.
Limited seats available!
Both models produced syntactically plausible SQL that failed at runtime. The core issues were:
HAVING clauseCASE statements and FILTER clauses when simpler patterns would have workedThese are not random errors. They reflect systematic gaps in how generic SQL generation models handle multi-step aggregation logic.
Improving SQL queries with fine-tuning starts with identifying the exact failure patterns. After seeing consistent structural errors in both models, a custom fine-tuning approach was used to explicitly teach correct aggregation patterns for this class of queries.
A custom dataset was built around the original prompt and variations involving grouped counts, percentage calculations, and multi-table joins. Each entry paired a natural-language question with an executable SQL query that followed correct aggregation rules.
The dataset contained 196 samples split as follows:
Fine-tuning completed in 43 minutes using a transfer-learning approach with Hugging Face Transformers, PEFT, and LoRA. LoRA reduced the number of trainable parameters while preserving the base model's general knowledge.
Training and validation loss curves remained closely aligned throughout, indicating stable learning without overfitting. The training loss decreased steadily, and the validation loss followed the same trend without significant divergence. This confirmed the model was generalizing to new examples rather than memorizing the training set.
sql
select
(COUNT(*) * 100.0 / (
select
COUNT(*)
from
shopify_order))
from
shopify_order
where
id in (
select
order_id
from
shopify_orderlineitem
group by
order_id
having
COUNT(*) > 1);This query executes correctly.
The subquery in the WHERE clause first identifies orders that have more than one line item using GROUP BY and HAVING. The outer query then counts how many orders meet that condition and divides by the total order count to produce a percentage. The * 100.0 ensures the result is returned as a percentage rather than a decimal ratio.
The fine-tuned model avoided nested aggregations entirely. Instead of trying to count inside a count, it separated the two concerns: identifying qualifying orders in a subquery, then calculating the proportion in the outer query.
This is the correct pattern for this class of problem, and it is exactly what the custom training dataset was designed to teach. For very large datasets, replacing the IN clause with EXISTS or a JOIN would improve performance, but the logic is correct as written.
The fine-tuning pipeline used the following tools:
LoRA (Low-Rank Adaptation) is a fine-tuning technique developed by Microsoft in 2021. It adds pairs of low-rank matrices to the weights of the original model. Only these matrices are updated during training, while the base model weights remain frozen.
This significantly reduces the number of trainable parameters without sacrificing task-specific performance. It is what made a 43-minute fine-tuning run practical rather than requiring days of GPU compute.
Fine-tuning a pre-trained model rather than training from scratch offers clear advantages. Training time was 43 minutes instead of days or weeks.
Walk away with actionable insights on AI adoption.
Limited seats available!
Computational requirements were a fraction of full training. The model retained general SQL knowledge from its pre-training while learning the specific aggregation patterns the dataset introduced.
Targeted datasets improve SQL generation accuracy in ways that prompt engineering alone cannot achieve. A model that has seen hundreds of examples of correct HAVING clause usage will apply that pattern reliably. A model that has only seen it occasionally will not.
The fine-tuned model is dataset-specific. It performs well on queries that match the patterns in its training data but may not generalize cleanly to significantly different query structures without additional training.
As database schemas evolve, the model may need to be retrained or updated, which requires ongoing effort. The quality of the output is also influenced by the quality of the base model. If the base model has systematic weaknesses, fine-tuning cannot fully correct them without a dataset large enough to override those patterns.
Improving SQL queries with fine-tuning is more effective than relying on prompt engineering or generic models alone. Off-the-shelf SQL generation models often fail on aggregation-heavy queries despite producing syntactically convincing output. Fine-tuning with a focused, execution-verified dataset enables models to learn correct aggregation patterns and produce reliable SQL for specific use cases.
The investment is small: 196 samples, 43 minutes of training, and a working query where two generic models failed. For teams relying on AI-assisted SQL generation in production, targeted fine-tuning is a more reliable path than prompt engineering alone.
Generic models are trained on broad datasets that do not always encode SQL execution rules precisely. Queries involving nested counts, grouped filters, or percentage calculations require specific structural patterns that models must be explicitly taught through targeted training data.
LoRA is a parameter-efficient fine-tuning technique that adds low-rank matrices to the model's existing weights. Only these matrices are updated during training, which significantly reduces compute requirements while still adapting the model to new tasks effectively.
This case used 196 samples and produced a correct, executable result. For narrow, well-defined query types, a small high-quality dataset is often more effective than a large noisy one.
Yes. The same fine-tuning methodology can be applied to query optimization, schema-specific generation, data cleaning queries, and other structured SQL tasks by building a targeted dataset around the relevant patterns.
Common tools include PyTorch, Hugging Face Transformers, PEFT for parameter-efficient methods like LoRA, and the Datasets library for data preprocessing and loading.
Walk away with actionable insights on AI adoption.
Limited seats available!