Maintaining AI accuracy at scale with continuous evaluation
A leading UK hospitality group uses an LLM-powered computer vision system to analyze competitor menu pricing at scale. This system plays a critical role in pricing strategy and commercial decision-making.
Over time, as menu formats and data sources evolved, extraction quality began to degrade. Accuracy dropped, hallucinations increased, and issues were only being caught through manual review. What was once a high-impact AI system was becoming difficult to trust and costly to operate at scale.

The challenge
The company had no reliable way to continuously measure AI performance in production.
As model drift and data changes accumulated, accuracy steadily declined and hallucinations increased, while manual quality checks became a growing bottleneck. Without a way to detect issues early, safely deploy improvements, or reduce reliance on manual quality assurance, the system could not scale safely or reliably in production.
Solution
Making AI quality measurable, observable, and enforceable in production
We built and implemented a production-grade evaluation and monitoring layer that turns AI quality into a first-class system capability.
The platform continuously measures real-world extraction quality, detects drift and hallucinations, and enforces quality gates before and after deployment. This allows the client to:
- Trust AI outputs in business-critical workflows
- Improve models without risking regressions
- Catch failures early instead of relying on manual review
- Scale AI usage without scaling their Quality Assurance teams
How we delivered
Built automated evaluation pipelines to measure accuracy and detect hallucinations
Deployed continuous monitoring with alerts for drift and performance drops
Enabled model benchmarking and A/B testing before production release
Automated quality assessment workflows to reduce manual review
Delivered real-time dashboards for operational visibility
Technology stack: LLM-as-judge evaluation models, drift detection algorithms, and an observability platform with real-time dashboards and alerting.
Impact
Delivering measurable gains in accuracy, reliability, and operational efficiency
By introducing continuous evaluation in production, the company moved from reactive, manual quality checks to always-on, measurable AI quality control. This enabled faster iteration, greater confidence in AI-driven decisions, and significantly reduced operational overhead. The company can now operate and scale AI systems with continuous quality control built in, rather than relying on reactive, manual checks.
hallucination rate in production, reduced from approximately 10% through automated detection and real-time evaluation.
accuracy achieved in production, up from 82% following evaluation-led model improvements and continuous monitoring.
reduction in manual quality assurance effort by automating evaluation and exception handling workflows.
Ready to safeguard your AI investments?
At G10X, we help organizations implement production-grade AI evaluation frameworks that maintain accuracy, prevent drift, and ensure reliable performance over time.
related insights
Get in touch
Let’s build what’s next, together. From seamless integration to smarter systems, we’ll help you move faster and stay ready for what comes next.



