Home
Data EngProof of Work

Healthcare AI Data Engineer

The part of healthcare AI nobody sees until it breaks — a trusted BigQuery backbone where bad rows stop early, every number has a receipt, and the agent reads the clean layer.

🗄️ 300
Real openFDA FAERS reports
💊 156
Drugs tracked
🚨 68.3%
Serious adverse events
🔖 [doc N]
Grounded citations + refusal
Healthcare AI Data Engineer — demo
PythondbtBigQueryGreat ExpectationsFeastCloud SchedulerCloud MonitoringFastAPIVertex AIGemini 2.5 FlashBM25PandaspytestDockerGCP Cloud RunGitHub Actions
Before

Before

Hope-Based Data
openFDA pull
Cleanup scripts
Tables look fine
Dashboard + AI eat the mess
×

The data looks clean because nobody has shaken it hard enough. Duplicate report IDs and weird received-dates slide into the dashboard smiling.

×

Nobody can tell if the data is actually fresh. A table can be stale while the page still looks freshly baked.

×

Failures go quiet. The pipeline can die in the back room while the dashboard keeps serving confident numbers.

×

The metrics look polished, but there is no receipt. A manager asks where a number came from and everyone turns to a script like it is a fortune teller.

×

The agent sees more raw adverse-event detail than it should — free-text narrative and unredacted fields. That is not a feature; it is a privacy incident waiting for better lighting.

×

The team finds out after users complain. There is no real gate, just a well-dressed README and hope.

After

After

Data With Receipts
BigQuery load
dbt medallion
Quality gates
Grounded answer

Bad Rows Stop At The DoorDuplicates, broken received-dates, missing fields, and drug-name drift are caught and quarantined before they can poison the dashboard or the agent.

Self-Healing — Detect → Recover → VerifyAn independent watchdog reads the durable BigQuery run-ledger, detects a stale end-to-end run, runs a bounded recovery (re-ingest → quarantine → gates → GE → dbt → freshness) against BigQuery, and advances the ledger watermark ONLY after verification passes — proven: a forced-stale run recovered + verified in ~134s and moved the watermark, while a failed recovery escalates and leaves it untouched.

Failures Get Loud EarlyGreat Expectations, custom checks, and CI make breakage show up before a user trusts the wrong number.

Every Metric Has A ReceiptCockpit numbers trace back to committed files, BigQuery tables, quality reports, or live API payloads. Not trust-me math.

The Agent Reads The Clean Layer/api/ask retrieves the redacted corpus (BM25), answers with [doc N] citations, and refuses when the evidence isn't there — verified: a grounded run cites a real report and an out-of-evidence question is declined, instead of crawling raw adverse-event narratives.

Features Are Discoverable + Leak-Free (Feast)A Feast feature view (openfda_drug_features, 4 features) over the openFDA fact — discoverable in the registry, with point-in-time-correct historical retrieval and online serving, so the model and any downstream consumer pull the same governed features without future leakage.

BigQuery Does More Than Store BoxesPartitioning and clustering cut scan size, materialized views pre-compute the hot path, and idempotent MERGE proves reruns do not duplicate the mart.

Batch + Event-Driven Ingestion · MEASUREDopenFDA's source is batch; a native Pub/Sub→BigQuery path ingests records as an event-driven feed (replay), proving the streaming architecture without a live-stream claim or a money leak.

Governed + Least-Privilege, ProvenRow counts reconcile source → BigQuery on every load; a versioned data contract, audit ledger, column masking, and retention/deletion cover the lifecycle. Least-privilege is proven empirically: a restricted identity reads the reduced 5-column authorized view (300 rows) but is DENIED 403 on the full base table.

Releases Face A GateIf quality, retrieval, grounding, or cost discipline regresses, the release should fail before a human has to smell smoke.