Chain-of-Thought in Action

As we have already understood that LLM are nothing but the next token predictors but we also expect them to think, how is that possible?
Large Language Models (LLMs) can generate poetry, code websites, and answer trivia with astonishing fluency. But ask one to solve a 3-step math word problem, and it often stumbles. Why? Because most LLMs don’t think, hey predict the next word based on training data.
To move beyond parroting patterns, we need to teach models to reason. That’s where Chain-of-Thought (CoT) and related techniques come in. These methods let models generate and use intermediate reasoning steps, like scratch work on a notepad before giving the final answer.
In this post, we’ll explore what it means for an LLM to “think,” why base models fail at reasoning, and how researchers and practitioners are upgrading them into thinking models.
Humans rarely jump straight to the answer. we work things out step by step.
Imagine:
Apples cost $2 each. If I buy 3 apples, how much do I pay?
A non-thinking LLM might blurt out “7”. A confident but wrong guess.
A thinking model would instead reason:
Each apple = $2
3 apples = 2 × 3 = 6
Final answer: $6
This intermediate reasoning process is called Chain-of-Thought (CoT). In practice, it looks like:
Reasoning: Each apple is $2. For 3 apples, 2 × 3 = 6.
Final Answer: 6Base LLMs are excellent at:
Fluency – producing grammatically perfect sentences.
Recall – pulling up facts seen in training.
But they struggle with:
Compositionality – combining multiple facts correctly.
Multi-hop reasoning – answering questions that require connecting dots.
Math and logic – precise symbolic reasoning.
They’re like students who can recite definitions but panic when asked to show their work.
Different approaches have been proposed to give models structured reasoning abilities:
Chain-of-Thought (CoT) : simple linear scratchpad.
Least-to-Most : break a problem into smaller subproblems.
ReAct : interleave reasoning with tool use (like calculators or search).
Tree-of-Thoughts : explore multiple possible reasoning paths in parallel.
Self-Consistency : generate multiple scratchpads and vote on the best.
Reflexion : let the model critique and revise its own reasoning.
Think of these as styles of thinking: some linear, some branching, some collaborative with tools. In this one we would use CoT.
Chain-of-Thought (CoT) is exactly what it sounds like: asking the model to generate a sequence of reasoning steps before giving an answer.
Question: What is 3 + 4 × 10 – 4 × 3?
Reasoning: First, solve multiplications. 4 × 10 = 40, 4 × 3 = 12.
Equation becomes 3 + 40 – 12 = 31.
Final Answer: 31The model isn’t just guessing anymore, it’s showing its work. This has three huge benefits:
Accuracy : avoids shallow mistakes by enforcing logical steps.
Interpretability : we can see why the model reached a conclusion.
Debuggability : if it fails, we can inspect the faulty step.
Here’s a small coding experiment I built using the OpenAI API. The idea is to force the assistant into a structured reasoning loop:
Always begin with START (state the problem).
Continue with multiple THINK steps (reasoning).
End with OUTPUT (final answer).
import OpenAI from "openai";
const client = new OpenAI();
async function main(){
const SYSTEM_PROMPT = `
You are an intelligent assistant that breaks down any problem and solves STEP BY STEP.
The steps you take are START, THINK and OUTPUT.
You should always keep thinking before providing the actual output.
Before giving an output, make sure everything is correct.
Rules:
- Strictly follow the JSON Format.
- Always follow the same sequence START, THINK and OUTPUT.
- Perform only one step at a time and wait for the next.
- Always do multiple THINK steps before OUTPUT.
Output JSON Format:
{ step : "START | THINK | OUTPUT", content : "string" }
Example:
User: Can you solve 3 + 4 * 10 - 4 * 3
ASSISTANT: { "step": "START", "content": "The user wants me to solve 3 + 4 * 10 - 4 * 3" }
ASSISTANT: { "step": "THINK", "content": "First solve multiplications: 4 * 10 = 40" }
ASSISTANT: { "step": "THINK", "content": "Equation becomes 3 + 40 - 12" }
ASSISTANT: { "step": "OUTPUT", "content": "31" }
`
const messages = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "What is 334 + 99 / 22 - 43 * 99" }
]
while(true){
const response = await client.chat.completions.create({
model: "gpt-4.1-mini",
messages: messages as any
});
const rawContent = response.choices[0]?.message.content;
const parsedContent = JSON.parse(rawContent!);
messages.push({
role: 'assistant',
content: JSON.stringify(parsedContent),
});
if(parsedContent.step === "START"){
console.log(`🆕 ${parsedContent.content}`)
continue;
}
if(parsedContent.step === "THINK"){
console.log(`🧠 ${parsedContent.content}`)
continue;
}
if(parsedContent.step === "OUTPUT"){
console.log(`🏁 ${parsedContent.content}`)
break;
}
}
}
main();🆕 The user wants me to solve 334 + 99 / 22 - 43 * 99
🧠 Let's start by solving the division: 99 / 22 = 4.5
🧠 Now the equation becomes 334 + 4.5 - 43 * 99
🧠 Next, solve the multiplication: 43 * 99 = 4257
🧠 Equation becomes 334 + 4.5 - 4257
🧠 Add first two terms: 334 + 4.5 = 338.5
🧠 Now equation is 338.5 - 4257 = -3918.5
🏁 Final Answer: -3918.5
Notice how the assistant never skips ahead, it thinks out loud until it reaches the conclusion.
This tiny demo shows some key principles of Chain-of-Thought:
Structured output : JSON keeps reasoning predictable and machine-readable.
Multiple steps of THINK : ensures careful breakdown.
Verification before OUTPUT : prevents sloppy answers.
This is a microcosm of how CoT is implemented in research:
Fine-tuning on step-by-step traces.
Using self-consistency (vote among multiple runs).
Training verifiers that check each step.
Chain-of-Thought is powerful, but not free:
It increases token cost (often 2–5× more).
Models can become verbose without added accuracy.
Sometimes rationales are plausible-sounding but wrong (post-hoc justifications).
Balancing accuracy vs efficiency is the ongoing challenge.
Chain-of-Thought transforms LLMs from fluent guessers into step-by-step solvers. By forcing reasoning to be explicit whether through fine-tuning, prompting, or coding tricks like the example above
we make models:
More accurate
More interpretable
Easier to debug
As the field progresses, the real frontier is making reasoning not only correct but also faithful, efficient, and safe enough for everyday use.
0
7
1