GenAIProduction

KPI Evidence Store

Hallucination rate, latency, task success and data quality dashboards for 13 production AI systems.

13
AI Systems Tracked
Halluc.
Rate Monitoring
p95
Latency Benchmarks
Audit
Trail for Claims

Centralized KPI evidence store tracking production metrics across 13 AI systems: hallucination rates, p50/p95 latency, task success rates, and data quality scores. Designed as a credibility layer — every claim in a resume bullet has a source record here.

Every metric is sourced. This repo exists so 'I improved latency by 40%' has an audit trail.
PythonDuckDBPandasEvidence.devSQLMarkdownGitHub Actions

Hallucination Tracking

Per-system hallucination rates measured with Ragas + manual spot checks. Trend over time.

Latency Benchmarks

p50/p95/p99 latency for each AI system. Before/after for every optimization.

Task Success Rates

Agent task completion rates, retry counts, fallback triggers per system.

Data Quality Scores

Great Expectations + custom validators tracking data quality across pipeline stages.

01
Capture
OpenTelemetry spans + custom metrics → DuckDB evidence store
02
Validate
Great Expectations assertions run on each metric batch
03
Report
Evidence.dev dashboard renders KPI trends from SQL queries
04
Audit
Each resume bullet links to a specific KPI record in this store