AI DataProduction

Healthcare AI Data Engineer

L1 data backbone — dbt medallion + FastAPI + Vertex AI enrichment + 7-check L1 quality gate over 55K synthetic patient encounters.

55,500
encounters
7/7
quality checks pass
$0.0005
Vertex cost/row
40,235
unique patients

Healthcare data backbone: dbt medallion (bronze→silver→gold) star schema, FastAPI 11 endpoints over 55,500 synthetic encounters, LLM-augmented enrichment via Vertex AI gemini-2.5-flash (497 rows · $0.0005/row · 100% JSON-schema success), patient identity resolver (55K encounters → 40K patients), and a 7-check L1 quality gate that runs in CI and exits 1 on any critical failure.

Trusted L1 layer that catches the dumb-but-pipeline-killing failures BEFORE the GenAI layer hallucinates around bad input.
PythondbtFastAPIPandasVertex AIGemini 2.5 FlashPydanticpytestDockerGCP Cloud RunGitHub Actions

dbt medallion star schema

Bronze → silver → gold. fact_patient_encounters + 7 dim_*. Full schema.yml with not_null + unique + relationships (FK) + accepted_values for clinical enums.

7-check L1 quality gate

schema_drift · critical_nulls · duplicate_encounters · temporal_sanity · pii_in_narrative · patient_identity · audit_lineage. Runs in CI on every PR, exits 1 on failure.

Vertex AI enrichment

gemini-2.5-flash + response_schema → 100% JSON parse success on 497 rows. CC/HPI/vitals/labs/ESI ground-truth generated for $0.25 total. Scales to 1M rows ≈ $500.

Patient identity bridge

55K encounters → 40,235 unique patients via SHA256 short hash. Catches the 'same patient, 12 encounters' pattern that breaks cross-patient leak guards in eval.

01
Raw CSV
55K synthetic encounters → dbt staging (bronze) with PII hashing + type casting.
02
Enrichment
Stratified 497-row sample → Vertex AI gemini-2.5-flash with response_schema → CC/HPI/vitals/labs/ESI.
03
Mart
fact_patient_encounters + 7 dim_*. Full FK + accepted_values + unique tests in schema.yml.
04
Gate + API
7-check L1 quality gate in CI · FastAPI 11 endpoints + OpenAPI docs at /docs · live on Cloud Run.