Blogs/AI/Chain of draft: Thinking faster by writing less

Chain of draft: Thinking faster by writing less

Mar 17, 2025 • 12 Min Read

Written by Rabbani Shaik

Chain of draft: Thinking faster by writing less Hero

Chain of Draft, a concise reasoning prompting strategy inspired by human cognitive processes.

Paper : https://arxiv.org/html/2502.18600v1#abstract

Key Points

- The Research paper suggests that Chain of Draft (CoD) prompting strategy improves LLM reasoning efficiency by using fewer tokens while maintaining accuracy.

- Performs almost equally to Chain of Thought (CoT) prompting in many tasks.

- The evidence shows that CoD reduces latency and cost, making it practical for real-world applications.

Overview

The paper "Chain of Draft: Thinking Faster by Writing Less" explores a new prompting method called Chain of Draft (CoD) for Large Language Models (LLMs). CoD aims to make reasoning tasks more efficient by encouraging concise intermediate thoughts, similar to how humans think. This approach is compared to the traditional Chain of Thought (CoT) method, which is more verbose.

Methodology

CoD prompts LLMs to think step by step but keeps each step short, ideally around 5 words. For example, for a problem like "Jason had 20 lollipops, gave some to Denny, and now has 12, how many did he give?", CoD might reason as "20 - x = 12; x = 20 - 12 = 8," using fewer words than CoT.

Detailed Analysis

The paper "Chain of Draft: Thinking Faster by Writing Less" by Silei Xu, Wenhao Xi, Lingxiao Zhao, and Pengcheng He, introduces a novel prompting strategy for Large Language Models (LLMs) called Chain of Draft (CoD). This method aims to enhance the efficiency of reasoning tasks by reducing verbosity while maintaining or improving accuracy, drawing inspiration from human cognitive processes. Below, we delve into the methodology, experimental findings, and implications, providing a comprehensive overview for readers interested in LLM advancements

1. The Research Paper Suggests That Chain of Draft (CoD) Prompting Strategy Improves LLM Reasoning Efficiency by Using Fewer Tokens While Maintaining Accuracy

The paper introduces Chain of Draft (CoD) as a novel prompting strategy designed to enhance the reasoning efficiency of Large Language Models (LLMs). Unlike traditional methods that encourage verbose outputs, CoD instructs models to produce concise intermediate reasoning steps, typically limited to around five words per step.

This approach is inspired by human cognition, where people often draft minimal mental notes to solve problems efficiently rather than elaborating every thought in detail. The research demonstrates that this brevity reduces the number of tokens units of text processed by LLMs required to arrive at a solution.

For example, in the GSM8k arithmetic task, CoD used approximately 43.9 tokens with GPT-4o and 39.8 tokens with Claude 3.5 Sonnet, compared to CoT's 205.1 and 190.0 tokens, respectively a reduction of about 80%. Despite this significant decrease in token usage, CoD maintains high accuracy levels, achieving 91.1%-91.4% on GSM8k compared to CoT’s 95.4%-95.8%.

This suggests that CoD preserves the essential reasoning process while eliminating unnecessary verbosity, making it a more efficient alternative. The paper argues that this token efficiency not only speeds up processing but also lowers computational resource demands, a critical factor for scaling LLM applications.

2. Performs Almost Equally to Chain of Thought (CoT) Prompting in Many Tasks

The study compares CoD directly with Chain of Thought (CoT) prompting, a widely used method that prompts LLMs to generate detailed, step-by-step reasoning traces. Across a variety of tasks—including arithmetic (GSM8k), commonsense reasoning (date and sports understanding from BIG-bench), and symbolic reasoning (coin flip)—CoD demonstrates performance that is closely competitive with, and in some cases surpasses, CoT.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

For instance, in the sports understanding task, CoD achieved higher accuracy than CoT (98.3% vs. 95.9% for GPT-4o and 97.3% vs. 93.2% for Claude 3.5 Sonnet), showing that concise reasoning can sometimes enhance clarity and correctness. In the coin flip task, both methods reached 100% accuracy, but CoD did so with far fewer tokens (16.8-18.9 vs. 52.4-135.3).

Even in arithmetic tasks like GSM8k, where CoD’s accuracy (91.1%-91.4%) is slightly lower than CoT’s (95.4%-95.8%), the gap is small enough to suggest that CoD retains most of CoT’s problem-solving capability. The paper posits that this near-equivalence stems from CoD’s ability to distill reasoning into its most critical components, mirroring how humans prioritize key insights over exhaustive explanations.

This adaptability across task types highlights CoD’s versatility, positioning it as a viable alternative to CoT without significant trade-offs in effectiveness.

3. The Evidence Shows That CoD Reduces Latency and Cost, Making It Practical for Real-World Applications

One of CoD’s standout advantages is its ability to reduce both latency (the time taken to generate a response) and computational cost, which are crucial for deploying LLMs in practical settings.

The experimental results provide concrete evidence: on GSM8k, CoD cut latency from 4.2 seconds (CoT) to 1.0 second with GPT-4o and from 3.1 seconds to 1.6 seconds with Claude 3.5 Sonnet—a reduction of 76.2% and 48.4%, respectively. Similar trends appear in other tasks, such as the coin flip problem, where latency dropped from 1.4 seconds (CoT) to 0.8 seconds (CoD) with GPT-4o.

This speed improvement is directly tied to the lower token count, as fewer tokens mean less processing time and reduced API call expenses in cloud-based LLM systems. For real-world applications like chatbots, educational tools, or real-time decision-making systems where rapid responses and cost efficiency are paramount, CoD offers a compelling advantage.

The paper emphasizes that these savings do not come at the expense of transparency, as CoD still provides a clear reasoning trace, unlike standard prompting, which often skips steps and risks errors.

By balancing performance with practicality, CoD emerges as a method that could broaden the accessibility and scalability of LLM technologies, particularly in resource-constrained or latency-sensitive environments.

Experimental Results

The study tested CoD on tasks like arithmetic, commonsense reasoning, and symbolic problems using models like GPT-4o and Claude 3.5 Sonnet. Results show CoD uses far fewer tokens and reduces latency while often matching or exceeding CoT's accuracy.

For instance, on the GSM8k arithmetic task, CoD used about 20% of CoT's tokens with slightly lower but still high accuracy.

Implications

This efficiency makes CoD promising for applications where speed and cost matter, like real-time systems. The paper also suggests future research to combine CoD with other methods for even better results.

Background and Motivation

The paper begins by highlighting recent advances in reasoning models, such as OpenAI o1 and DeepSeek R1, which rely on structured reasoning frameworks like Chain of Thought (CoT). CoT prompting encourages LLMs to generate verbose, step-by-step reasoning, which, while effective, demands significant computational resources and increases latency. The authors note that humans typically employ a more efficient strategy, drafting concise intermediate thoughts that capture essential information. This observation motivates the development of CoD, which seeks to mimic this human approach to reduce cost and latency without sacrificing performance.

Methodology: Chain of Draft Prompting

CoD is described as a paradigm where LLMs generate minimalistic yet informative intermediate reasoning outputs. The prompting strategy instructs the model to "Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ." This guideline, while not strictly enforced, aims to limit each reasoning step to approximately 5 words, ensuring conciseness.

An example provided in the paper illustrates this approach. For the problem "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?", the responses are:

- **Standard Prompting:** Directly answers "8," lacking transparency and prone to hallucination.

- **CoT Prompting:** Verbose, e.g., "Initially, Jason had 20 lollipops... 20 - 12 = 8," using around 200 tokens, with higher latency.

- **CoD Prompting:** Concise, e.g., "20 - x = 12; x = 20 - 12 = 8," using about 40 tokens, maintaining transparency and correctness.

This example underscores CoD's ability to balance brevity with clarity, reducing the token count significantly compared to CoT while still showing the reasoning process.

Experimental Evaluation

The experiments evaluate CoD, CoT, and standard prompting across three categories of reasoning tasks: arithmetic (GSM8k dataset, 8,500 problems), commonsense (date and sports understanding from BIG-bench), and symbolic reasoning (coin flip task). The evaluation used two models: GPT-4o (gpt-4o-2024-08-06) from OpenAI and Claude 3.5 Sonnet (claude-3-5-sonnet-20240620) from Anthropic.

The results, summarized in the following tables, highlight CoD's performance in terms of accuracy, token usage, and latency:

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Table 1: GSM8k (Arithmetic Reasoning)

Table 2: Date Understanding (Commonsense Reasoning)

Table 3: Sports Understanding (Commonsense Reasoning)

Table 4: Coin Flip (Symbolic Reasoning)

These tables reveal that CoD consistently uses fewer tokens and has lower latency compared to CoT. For instance, on the GSM8k task, CoD reduces token usage by approximately 80% (e.g., 43.9 tokens for GPT-4o vs. 205.1 for CoT) and latency by 76.2% for GPT-4o and 48.4% for Claude 3.5 Sonnet. Accuracy is generally comparable, with CoD achieving 91.1%-91.4% on GSM8k compared to CoT's 95.4%-95.8%, a slight decrease. However, in tasks like sports understanding, CoD outperforms CoT, with accuracies of 98.3% for GPT-4o and 97.3% for Claude 3.5 Sonnet, compared to CoT's 95.9% and 93.2%, respectively. For the coin flip task, both CoD and CoT achieve 100% accuracy, but CoD uses far fewer tokens.

Below are some our Experiments:

Chain of Thought:

Prompt:
Find the sum of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8[x]. [ "2x^2 + 5", "6x^2 + 4x + 6", "0", "x^2 + 1" ] give me the answer in option, a, b, c, d


Thoughts:
I need to add two polynomials f(x) and g(x) in the polynomial ring Z_8[x].
f(x) = 4x - 5
 g(x) = 2x^2 - 4x + 2
To add polynomials, we simply add the coefficients of corresponding terms. Since we're working in Z_8[x], all the coefficients and arithmetical operations are done modulo 8.
Let's add the polynomials term by term:
First, the constant terms:
 (-5) + 2 = -3 ≡ 5 (mod 8)
Next, the coefficients of x:
 4 + (-4) = 0
Finally, the coefficient of x^2:
 0 + 2 = 2
So, (f + g)(x) = 2x^2 + 0x + 5 = 2x^2 + 5
Now, let's check this against the given options:
a) 2x^2 + 5
 b) 6x^2 + 4x + 6
 c) 0
 d) x^2 + 1
Our result is 2x^2 + 5, which matches option a).
Actually let me double-check this by working carefully in Z_8[x].
f(x) = 4x - 5
 First, I need to convert -5 to its representation in Z_8.
 -5 ≡ -5 + 8 ≡ 3 (mod 8)
 So f(x) = 4x + 3 in Z_8[x]
g(x) = 2x^2 - 4x + 2
 Let's verify that all coefficients are already in the proper form for Z_8:
 2 is already in the range [0, 7]
 -4 ≡ -4 + 8 ≡ 4 (mod 8) (actually this is not right; -4 ≡ 4 (mod 8))
 2 is already in the range [0, 7]
So g(x) = 2x^2 + 4x + 2 in Z_8[x] (I made a mistake earlier)
Now, let's add:
 (f + g)(x) = (4x + 3) + (2x^2 + 4x + 2) = 2x^2 + (4 + 4)x + (3 + 2) = 2x^2 + 8x + 5
In Z_8, 8 ≡ 0 (mod 8), so this simplifies to:
 (f + g)(x) = 2x^2 + 0x + 5 = 2x^2 + 5
So the answer is option a) 2x^2 + 5.
Wait, let me double-check my calculation of -5 in Z_8:
 -5 ≡ 3 (mod 8) is correct because:
 -5 + 8 = 3
 and 3 is in the range [0, 7]
And let's double-check -4 in Z_8:
 -4 ≡ 4 (mod 8) is correct because:
 -4 + 8 = 4
 and 4 is in the range [0, 7]
So f(x) = 4x + 3 and g(x) = 2x^2 + 4x + 2 in Z_8[x].
And their sum is:
 (f + g)(x) = 2x^2 + (4 + 4)x + (3 + 2) = 2x^2 + 8x + 5 = 2x^2 + 0x + 5 = 2x^2 + 5
The answer is option a) 2x^2 + 5.
Actually, let me triple-check this.
f(x) = 4x - 5
 Since we are working in Z_8, we need to represent -5 as an element of Z_8.
 In Z_8, -5 ≡ 3 (mod 8), because -5 + 8 = 3 and 3 is in the range [0, 7].
 So f(x) = 4x + 3 in Z_8[x].
g(x) = 2x^2 - 4x + 2
 Similarly, we need to represent -4 as an element of Z_8.
 In Z_8, -4 ≡ 4 (mod 8), because -4 + 8 = 4 and 4 is in the range [0, 7].
 So g(x) = 2x^2 + 4x + 2 in Z_8[x].
Now, we add the polynomials:
 (f + g)(x) = (4x + 3) + (2x^2 + 4x + 2)
 = 2x^2 + (4 + 4)x + (3 + 2)
 = 2x^2 + 8x + 5
In Z_8, 8 ≡ 0 (mod 8), so:
 (f + g)(x) = 2x^2 + 0x + 5
 = 2x^2 + 5
So the answer is indeed option a) 2x^2 + 5.
Answer:
To find the sum of the polynomials in Z₈[x], I need to add the corresponding terms and reduce the coefficients modulo 8.
Given:
f(x) = 4x - 5
g(x) = 2x² - 4x + 2
Solution
First, I'll convert the polynomials to their proper form in Z₈[x]:
For f(x) = 4x - 5:
The coefficient -5 ≡ 3 (mod 8) since -5 + 8 = 3
So f(x) = 4x + 3 in Z₈[x]
For g(x) = 2x² - 4x + 2:
The coefficient -4 ≡ 4 (mod 8) since -4 + 8 = 4
So g(x) = 2x² + 4x + 2 in Z₈[x]
Now I'll add these polynomials:
 (f + g)(x) = (4x + 3) + (2x² + 4x + 2)
 = 2x² + (4 + 4)x + (3 + 2)
 = 2x² + 8x + 5
In Z₈, 8 ≡ 0 (mod 8), so:
 (f + g)(x) = 2x² + 0 + 5 = 2x² + 5
The answer is a) 2x² + 5

Chain of Draft(COD):

Prompt: 
Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Find the sum of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8[x]. [ "2x^2 + 5", "6x^2 + 4x + 6", "0", "x^2 + 1" ] give me the answer in option, a, b, c, d

Thoughts:
I'll work with polynomials f(x) = 4x - 5 and g(x) = 2x^2 - 4x + 2 in Z_8[x]. This means all coefficients are in Z_8, which is the integers modulo 8.
Let me find the sum f(x) + g(x):
 f(x) + g(x) = (4x - 5) + (2x^2 - 4x + 2)
 = 2x^2 + (4x - 4x) + (-5 + 2)
 = 2x^2 + 0 + (-3)
 = 2x^2 - 3
But in Z_8, -3 is equivalent to 5 (because -3 + 8 = 5, or more directly, 8 - 3 = 5).
 So in Z_8[x], f(x) + g(x) = 2x^2 + 5
Now let me check which option this corresponds to:
 "2x^2 + 5" - This is our answer
 "6x^2 + 4x + 6" - Different from our answer
 "0" - Different from our answer
 "x^2 + 1" - Different from our answer
So the correct option is the first one, which would be option (a).
Actually, let me double-check my work:
f(x) = 4x - 5
 g(x) = 2x^2 - 4x + 2
f(x) + g(x) = 4x - 5 + 2x^2 - 4x + 2
 = 2x^2 + (4x - 4x) + (-5 + 2)
 = 2x^2 + 0 + (-3)
 = 2x^2 - 3
In Z_8, -3 ≡ 5 (mod 8) because 5 ≡ -3 + 8 (mod 8)
So f(x) + g(x) = 2x^2 + 5 in Z_8[x]
This confirms that the answer is option (a) "2x^2 + 5".
Answer:
To find the sum of polynomials in Z₈[x], I'll add the coefficients of like terms, working modulo 8.
Given:
f(x) = 4x - 5
g(x) = 2x² - 4x + 2
Step 1: Add the polynomials
 f(x) + g(x) = (4x - 5) + (2x² - 4x + 2)
 = 2x² + (4x - 4x) + (-5 + 2)
 = 2x² + 0 + (-3)
 = 2x² - 3
Step 2: Convert to modulo 8
 In Z₈, -3 ≡ 5 (mod 8) since -3 + 8 = 5
 Therefore, 2x² - 3 ≡ 2x² + 5 (mod 8)
Step 3: Match with the given options
 a) 2x² + 5
 b) 6x² + 4x + 6
 c) 0
 d) x² + 1
The sum f(x) + g(x) = 2x² + 5 in Z₈[x], which corresponds to option a.

As you can see the output(especially thinking process) is much smaller now which means less token usage, with the same level of accuracy.

Conclusion

In summary, CoD represents a promising advancement in LLM prompting strategies, offering a balance between accuracy and efficiency. Its ability to reduce token usage by up to 80% and latency by significant margins, while often matching CoT's performance, positions it as a valuable tool for real-world deployments. The paper's findings, supported by empirical evaluations, suggest that CoD could reshape how LLMs are used in latency-sensitive and cost-sensitive environments, with potential for further enhancements through future research

Rabbani Shaik

AI/ML Engineer

AI enthusiast who loves building cool stuff by leveraging AI. I explore new tools, experiment with ideas, and share what I learn along the way. Always curious, always building!

Next for you

MCP Practical Guide with STDIO Transport Cover

AI

Apr 22, 2025 • 5 min read

MCP Practical Guide with STDIO Transport

What if you could teach AI to search the internet or do math problems on its own? This guide shows you how to use something called MCP (Model Context Protocol) to give AI new abilities. We'll show you step-by-step how to build special connections between AI and tools using regular computer inputs and outputs. You'll learn how to make AI models like Claude use tools you create, all in real-time! Whether you want AI to look things up online or solve problems, this guide has everything you need t

MCP Practical Guide with SSE Transport Cover

AI

Apr 22, 2025 • 5 min read

MCP Practical Guide with SSE Transport

Integrating external data sources with AI models often requires complex and time-consuming custom coding. The Model Context Protocol (MCP) simplifies this by offering a standardised framework for seamless interaction. In this guide, we will walk through building an MCP server and MCP client with Server-Sent Events (SSE), providing step-by-step instructions to set up and run both. What is MCP? MCP serves as a universal interface that allows AI tools to interact seamlessly with content reposit

What is MLX? A Beginner's Guide To Apple’s Machine Learning Cover

AI

Apr 9, 2025 • 5 min read

What is MLX? A Beginner's Guide To Apple’s Machine Learning

Are you tired of compatibility issues when running machine learning on your Mac? MLX, a powerful, native machine learning framework designed specifically for macOS and Apple Silicon, is changing how Mac users approach machine learning. This guide unpacks everything beginners need to know about this exciting framework. From its innovative unified memory model to its impressive speed on Apple Silicon chips, we'll walk through why MLX matters and how you can start using it today. Keep reading to s