
Tech Stack
Before / After
Before
Hope-Based DataThe data looks clean because nobody has shaken it hard enough. Duplicate report IDs and weird received-dates slide into the dashboard smiling.
Nobody can tell if the data is actually fresh. A table can be stale while the page still looks freshly baked.
Failures go quiet. The pipeline can die in the back room while the dashboard keeps serving confident numbers.
The metrics look polished, but there is no receipt. A manager asks where a number came from and everyone turns to a script like it is a fortune teller.
The agent sees more raw adverse-event detail than it should — free-text narrative and unredacted fields. That is not a feature; it is a privacy incident waiting for better lighting.
The team finds out after users complain. There is no real gate, just a well-dressed README and hope.
After
Data With ReceiptsBad Rows Stop At The DoorDuplicates, broken received-dates, missing fields, and drug-name drift are caught and quarantined before they can poison the dashboard or the agent.
Self-Healing — Detect → Recover → VerifyAn independent watchdog reads the durable BigQuery run-ledger, detects a stale end-to-end run, runs a bounded recovery (re-ingest → quarantine → gates → GE → dbt → freshness) against BigQuery, and advances the ledger watermark ONLY after verification passes — proven: a forced-stale run recovered + verified in ~134s and moved the watermark, while a failed recovery escalates and leaves it untouched.
Failures Get Loud EarlyGreat Expectations, custom checks, and CI make breakage show up before a user trusts the wrong number.
Every Metric Has A ReceiptCockpit numbers trace back to committed files, BigQuery tables, quality reports, or live API payloads. Not trust-me math.
The Agent Reads The Clean Layer/api/ask retrieves the redacted corpus (BM25), answers with [doc N] citations, and refuses when the evidence isn't there — verified: a grounded run cites a real report and an out-of-evidence question is declined, instead of crawling raw adverse-event narratives.
Features Are Discoverable + Leak-Free (Feast)A Feast feature view (openfda_drug_features, 4 features) over the openFDA fact — discoverable in the registry, with point-in-time-correct historical retrieval and online serving, so the model and any downstream consumer pull the same governed features without future leakage.
BigQuery Does More Than Store BoxesPartitioning and clustering cut scan size, materialized views pre-compute the hot path, and idempotent MERGE proves reruns do not duplicate the mart.
Batch + Event-Driven Ingestion · MEASUREDopenFDA's source is batch; a native Pub/Sub→BigQuery path ingests records as an event-driven feed (replay), proving the streaming architecture without a live-stream claim or a money leak.
Governed + Least-Privilege, ProvenRow counts reconcile source → BigQuery on every load; a versioned data contract, audit ledger, column masking, and retention/deletion cover the lifecycle. Least-privilege is proven empirically: a restricted identity reads the reduced 5-column authorized view (300 rows) but is DENIED 403 on the full base table.
Releases Face A GateIf quality, retrieval, grounding, or cost discipline regresses, the release should fail before a human has to smell smoke.