Invisible and Immortal: The Dead in Clinical-AI Data

Florian Odi Stummer

doi:10.33774/coe-2026-tbpk6

Clinical artificial intelligence (AI) is trained on coded electronic health record (EHR) and claims data. Survivorship and informative-censoring bias are recognised, but usually framed as analytic problems. We ask a prior, structural question: is death, the outcome that matters most, even present in the coded substrate that models learn from? We examined two substrates: MIMIC-IV, a real hospital EHR with ICD-coded diagnoses and administrative death fields; and a Synthea cohort with SNOMED CT coded conditions, in which the generator knows every death exactly (ground truth). Death was essentially never coded as a diagnosis or condition: 0 of 4,506 diagnosis rows in MIMIC-IV and 0 of 118 Synthea concepts encoded death, despite ground-truth mortality of 31% and 31.3%. In MIMIC-IV, 51.6% of deaths were invisible to encounter data, knowable only via an external death-record link. Decedents carried more diagnoses than survivors in real data (18.8 vs 14.9 per admission) but not in synthetic data, locating the survivorship gradient in real-world coding rather than the generative model. Yet the dead are not deleted: in JSON and FHIR exports, decedents' full histories persisted (about 2,980 resources each) and 0 of 104,291 clinical resources carried a death flag, which was confined to one demographic field. The dead are therefore simultaneously invisible to care and immortal in storage, a permanence that documentation-retention duties mandate. A clinical AI trained on coded data has no death feature, learns the dead as if living, and cannot tell.

Invisible and Immortal: The Dead in Clinical-AI Data

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share

Invisible and Immortal: The Dead in Clinical-AI Data

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share