13
AI Systems Tracked
Halluc.
Rate Monitoring
p95
Latency Benchmarks
Audit
Trail for Claims
Centralized KPI evidence store tracking production metrics across 13 AI systems: hallucination rates, p50/p95 latency, task success rates, and data quality scores. Designed as a credibility layer — every claim in a resume bullet has a source record here.
Every metric is sourced. This repo exists so 'I improved latency by 40%' has an audit trail.
Tech Stack
PythonDuckDBPandasEvidence.devSQLMarkdownGitHub Actions
Features
Hallucination Tracking
Per-system hallucination rates measured with Ragas + manual spot checks. Trend over time.
Latency Benchmarks
p50/p95/p99 latency for each AI system. Before/after for every optimization.
Task Success Rates
Agent task completion rates, retry counts, fallback triggers per system.
Data Quality Scores
Great Expectations + custom validators tracking data quality across pipeline stages.
How It Works
01
Capture
OpenTelemetry spans + custom metrics → DuckDB evidence store
→
02
Validate
Great Expectations assertions run on each metric batch
→
03
Report
Evidence.dev dashboard renders KPI trends from SQL queries
→
04
Audit
Each resume bullet links to a specific KPI record in this store