Scroll Launchpad Jobs Articles Search Blog Advertise

Blog • Tools • Store • Help
Support • Legal

Frederick Abila

Mar 09, 2026 • 5 min read

OrcBot's Self-Training Sidecar: How to Improve an AI Agent Without Letting It Rewrite Itself Live

Most AI agent systems stop at orchestration.

They can call tools, browse the web, write files, and recover from failure. But when it comes to getting better over time, they fall into one of two bad places:

1. They never learn from their own work at all.

2. They try to learn in place, inside the live runtime — which is risky, hard to audit, and easy to get wrong.

OrcBot takes a different approach.

Instead of letting the live agent mutate itself mid-flight, OrcBot uses a self-training sidecar. The agent keeps doing useful work. In the background, it captures examples of successful behavior, filters them, turns them into training data, evaluates candidate models, and lets an admin decide whether a new model should be promoted.

The agent learns from experience without turning production into an experiment.

The Problem With "Self-Improving" Agents

"Self-improving AI" sounds impressive until you ask what it really means.

In practice, most teams run into the same problems:

Good and bad outcomes get mixed together.

Sensitive data leaks into training logs.

The live system changes behavior without enough evaluation.

Nobody can explain why the model got better, worse, or just different.

For a production agent, that is not acceptable. If an agent is handling real user requests, messaging channels, files, and automation flows, then learning has to be controlled. You need traceability. You need evaluation. You need a rollback story.

That is what the OrcBot self-training sidecar is built for.

What It Actually Does

At a high level, OrcBot observes its own completed work and extracts useful training examples from it.

When the agent successfully finishes a task, the system records the trajectory of that run — the original task, the reasoning path, tool calls and outcomes, delivery quality signals, and the final result.

But not every run becomes training data.

The system filters out weak examples, unresolved failures, empty status loops, and low-value traces. Only accepted examples get exported into a training-ready dataset.

From there, OrcBot can build a JSONL export for training, generate a manifest for an offline training job, evaluate candidate models against accepted trajectories, register trained candidates, and promote a candidate into live use only after explicit approval.

This is not a vague "AI learns over time" claim. It is a concrete pipeline.

The Simple Version

If you know nothing about model training, think of it like this.

Imagine OrcBot is an employee. Every time it handles a task well, someone saves the best parts of that case into a notebook. After enough good cases, you use that notebook to train a new employee. Then you test that new employee before giving them real responsibility.

That is OrcBot's self-training sidecar.

The live agent keeps working.

The notebook gets better.

Candidate models get tested.

Rollout stays deliberate.

What It Does Not Do

This system does not train a frontier LLM from scratch. It is not trying to build the next GPT-class foundation model from zero parameters.

Instead, it is built for the realistic version of self-improvement: fine-tuning an existing base model, instruction-tuning a smaller local model, creating a more specialized model for OrcBot-style tasks, learning from your own workflows instead of generic internet data.

Why This Matters

Generic LLMs are broad. But broad is not the same as aligned.

An agent like OrcBot has a specific job: plan multi-step tasks, call tools correctly, recover from failures, communicate results clearly, and operate safely. A model that has seen examples of exactly that behavior can become much better at those tasks than a generic baseline.

The self-training sidecar helps you move from "this model is generally smart" to "this model is unusually good at being our agent."

The Workflow

1. Capture — Completed actions become trajectories: task, tool use, result signals, delivery outcome.

2. Filter — The system rejects low-quality traces, unresolved failures, and weak outcomes. Bad training data does not just waste time. It actively teaches the wrong behavior.

3. Redact — Sensitive information is cleaned before persistence and export. If you are going to train on real operational history, this step is non-negotiable.

4. Prepare — Once enough accepted examples exist, OrcBot writes a JSONL dataset and a job manifest for offline training — usable by any external training pipeline.

5. Evaluate — Candidate models are scored against accepted trajectories. The hard question: is the new model actually better, or just different?

6. Register — OrcBot tracks what each candidate is, where it came from, and how it performed.

7. Promote — Explicit and admin-controlled. The model only moves into live config after the evaluation gate passes and a human chooses to roll it out.

That separation is the whole safety story.

Why the Safety Model Is the Real Feature

A lot of systems talk about learning. Very few talk enough about control.

The most important thing about this system is not that it creates training data. Plenty of systems can dump logs into a file. The important part is that it treats training and rollout as separate concerns.

That gives you redaction before export, quality gating before training, evaluation before promotion, explicit rollout decisions, and a clear rollback path.

It is designed for teams that want improvement without surrendering operational discipline.

Artifacts It Produces

self-training-trajectories.json — all captured trajectories

self-training-trajectories.jsonl — accepted examples only

self-training-job.json — current offline training manifest

self-training-eval-report.json — evaluation results

self-training-launch.json — launch history and audit trail

self-training-candidates.json — registered candidate models

self-training-promotion.json — latest promotion record

You are not guessing what happened. You can open the artifacts and review the chain from captured work to promoted model.

The Core Idea in One Sentence

OrcBot turns real work into reviewable training data, evaluates candidate models offline, and only promotes new behavior when the evidence is good enough.

Final Thought

The future of agent systems is not just better prompting. It is better feedback loops.

The winners will be the systems that can learn from their own real work without becoming unstable, opaque, or impossible to trust. OrcBot's self-training sidecar is an early version of that future — a practical loop for turning production experience into better candidate models while keeping runtime behavior under control.

"It gets smarter over time" is marketing.

"It captures successful trajectories, filters them, produces training-ready datasets, evaluates candidates, and promotes them under admin control" is engineering.

OrcBot is built around the second idea.

Join Frederick on Peerlist!

Join amazing folks like Frederick and thousands of other builders on Peerlist.