Building Self-Refining AI Agents with Ollama & Langfuse

Day 6 of 2026 Building!

Have you ever asked ChatGPT to write a report, read the output, and thought, "Eh, that's kind of generic"?

We've all been there. The standard "one-shot" approach to AI - prompt in, answer out - mimics a human blurting out the first thing that comes to mind. But real intelligence isn't just about generation; it's about refinement.

Today, I'm going to show you how to build a Self-Recursive Agent using Node.js, local LLMs (Ollama), and Langfuse. This agent doesn't just write; it crtiques its own work, fixes mistakes, and improves recursively until it meets a high quality standard.

Why Recursion > Fine-Tuning

If you want better model outputs, the industry standard advice is often "Fine-tune a model on your data!"

Fine-tuning is powerful, but it's also:

Expensive: Requires GPUs and compute.
Rigid: The model only learns what you show it.
Hard to Debug: If it fails, you don't know why.

Self-Recursion is the alternative. Instead of training a smarter model, you build a smarter workflow. You create a loop where "dumber" models can outperform "smarter" ones simply by having the chance to correct their own mistakes.

The Architecture: A Digital Newsroom

We simulate a team of experts working together:

The Planner (Llama 3.2): breaks the user's request into a JSON list of tasks.
The Executor (llama3.1:8b-instruct-q4_K_M): writes the content.
The Critic (gemma3:12b): reads the draft and critiques it against a strict rubric.
The Judge (gpt-oss:20b): assigns a numerical score (0-1).

If the score is below 0.9, the Critic's feedback is passed back to the Executor, and the loop flows again.

The Code: The "Thinking" Loop

The magic happens inside a simple while loop in Node.js. Here is the simplified logic from our index.js:

let score = 0.0;
let attempt = 0;

// Keep refining until we hit 90% quality
while (attempt < MAX_RETRIES && score < 0.9) {
 
 // 1. Executor writes/rewrites
 const report = await executorModel.generate({
 tasks: plan,
 previousDraft: currentReport,
 criticFeedback: feedback // <--- The secret sauce
 });

 // 2. Critic reviews the work
 const critique = await criticModel.chat({
 prompt: `Analyze this draft for logic gaps: ${report}`
 });

 // 3. Judge evaluates
 const judgment = await judgeModel.evaluate(report, critique);
 score = judgment.score;
 
 // Send the score to Langfuse for tracking
 await langfuse.score({
 name: "quality-gate",
 value: score,
 comment: judgment.reasoning
 });
}

Notice how we feed criticFeedback back into the Executor? That is the recursion. The model effectively "learns" from the critique in real-time, within the context window.

Observability: Seeing the Brain Work 🧠

When you have loops inside loops, console.log doesn't cut it. You need to see the trace of execution.

This is where Langfuse comes in. It's an open-source LLM engineering platform that lets us visualize exactly what our agent is doing.

In our code, we wrap the entire process in a trace:

const trace = langfuse.trace({ name: "Recursive-Research-Agent" });
// ... later ...
const loopSpan = trace.span({ name: `Refinement_Loop_V${attempt}` });

The result? A beautiful graph that looks like this:

Trace Start
- Planner (Output: "Task List")
- Loop 1 (Score: 0.6)
  - Executor → "Draft 1"
  - Critic → "You missed section X"
- Loop 2 (Score: 0.8)
  - Executor → "Draft 2 (Fixed X)"
  - Critic → "Tone is too casual"
- Loop 3 (Score: 0.95 - PASS) ✅

Without Langfuse, you're flying blind. With it, you can pinpoint exactly which model (Critic or Executor) is dropping the ball and adjust your prompts accordingly.

You don't need a massive cluster of H100s to build powerful AI. By chaining smaller, local models together in a self-correcting loop and monitoring them with tools like Langfuse, you can achieve results that rival much larger logic engines.

The code is open source - go clone it, fire up Ollama, and watch your agent teach itself to be better! 🚀

Join Harish on Peerlist!

Join amazing folks like Harish and thousands of other builders on Peerlist.