Most AI startups ship a model and call it done.

If you’re in a regulated industry, that’s the beginning, not the end.
Nic Romanos and I build AI that generates safety documents for high-risk construction work in Australia.
👷🏽♀️ SWMS — Safe Work Method Statements are legally required.
They protect workers from hazards.
Getting one wrong isn’t a bad user experience.
It’s a regulatory liability and, more importantly, a safety risk — we talked it about it here: https://lnkd.in/dzV3thby
So we built an evaluation framework and sharing it for FREE.
A continuous, automated system that scores every AI model we use, on every type of scenario our clients actually encounter, every time we run it.
Here are the internals:
• Real scenarios, not synthetic benchmarks.
We extracted test cases from production — demolition near load-bearing structures, asbestos discovery mid-renovation, electrical work near live services, confined space entry, excavation near gas mains. These are the exact situations that show up on Australian job sites.
• Known correct answers.
Each scenario has a reference output derived from WHS regulations, Safe Work Australia Codes of Practice, and state-specific requirements. Not “generally good safety advice.” The actual regulatory standard.
• Five scoring dimensions.
Completeness (30%) — did it cover all required hazards and controls?
Accuracy (30%) — are risk ratings and reg references correct?
Format (15%) — does it match the structure our application needs?
Domain relevance (15%) — does it cite real Australian standards?
Clarity (10%) — could a site supervisor use this on the ground?
• Dual-judge scoring.
Every output is evaluated by two independent AI judges from different providers. Scores are averaged. This eliminates single-provider bias — no model gets to mark its own homework.
• Reliability testing. Each scenario runs three times per model. A model that scores 95 once and 60 the next time gets flagged as unreliable. Consistency matters as much as peak performance.
• Continuous monitoring.
This runs on a schedule.
If a model provider pushes an update that degrades our safety output quality, we catch it before it reaches production. If a new model enters the market, we run it through the same suite and compare scores objectively.
A demo needs to work once.
A production system needs to work every time, prove it, and catch itself when it doesn’t.
That’s the difference between “we tested it” and “we have a testing system.”
Regulators, insurers, and enterprise clients can tell the difference. Your users’ safety depends on you knowing it too.
0
6
0