Why AI Evals are needed

Most AI startups ship a model and call it done.

If you’re in a regulated industry, that’s the beginning, not the end.

Nic Romanos and I build AI that generates safety documents for high-risk construction work in Australia.

👷🏽♀️ SWMS — Safe Work Method Statements are legally required.

They protect workers from hazards.

Getting one wrong isn’t a bad user experience.

It’s a regulatory liability and, more importantly, a safety risk — we talked it about it here: https://lnkd.in/dzV3thby

So we built an evaluation framework and sharing it for FREE.

What it is

A continuous, automated system that scores every AI model we use, on every type of scenario our clients actually encounter, every time we run it.

Here are the internals:

• Real scenarios, not synthetic benchmarks.
We extracted test cases from production — demolition near load-bearing structures, asbestos discovery mid-renovation, electrical work near live services, confined space entry, excavation near gas mains. These are the exact situations that show up on Australian job sites.

• Known correct answers.
Each scenario has a reference output derived from WHS regulations, Safe Work Australia Codes of Practice, and state-specific requirements. Not “generally good safety advice.” The actual regulatory standard.

• Five scoring dimensions.
Completeness (30%) — did it cover all required hazards and controls?
Accuracy (30%) — are risk ratings and reg references correct?
Format (15%) — does it match the structure our application needs?
Domain relevance (15%) — does it cite real Australian standards?
Clarity (10%) — could a site supervisor use this on the ground?

• Dual-judge scoring.
Every output is evaluated by two independent AI judges from different providers. Scores are averaged. This eliminates single-provider bias — no model gets to mark its own homework.

• Reliability testing. Each scenario runs three times per model. A model that scores 95 once and 60 the next time gets flagged as unreliable. Consistency matters as much as peak performance.

• Continuous monitoring.
This runs on a schedule.
If a model provider pushes an update that degrades our safety output quality, we catch it before it reaches production. If a new model enters the market, we run it through the same suite and compare scores objectively.

Summary

A demo needs to work once.
A production system needs to work every time, prove it, and catch itself when it doesn’t.
That’s the difference between “we tested it” and “we have a testing system.”
Regulators, insurers, and enterprise clients can tell the difference. Your users’ safety depends on you knowing it too.

Join Ari on Peerlist!

Join amazing folks like Ari and thousands of other builders on Peerlist.