Invisible and Immortal: The Dead in Clinical-AI Data

05 June 2026, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Clinical artificial intelligence (AI) is trained on coded electronic health record (EHR) and claims data. Survivorship and informative-censoring bias are recognised, but usually framed as analytic problems. We ask a prior, structural question: is death, the outcome that matters most, even present in the coded substrate that models learn from? We examined two substrates: MIMIC-IV, a real hospital EHR with ICD-coded diagnoses and administrative death fields; and a Synthea cohort with SNOMED CT coded conditions, in which the generator knows every death exactly (ground truth). Death was essentially never coded as a diagnosis or condition: 0 of 4,506 diagnosis rows in MIMIC-IV and 0 of 118 Synthea concepts encoded death, despite ground-truth mortality of 31% and 31.3%. In MIMIC-IV, 51.6% of deaths were invisible to encounter data, knowable only via an external death-record link. Decedents carried more diagnoses than survivors in real data (18.8 vs 14.9 per admission) but not in synthetic data, locating the survivorship gradient in real-world coding rather than the generative model. Yet the dead are not deleted: in JSON and FHIR exports, decedents' full histories persisted (about 2,980 resources each) and 0 of 104,291 clinical resources carried a death flag, which was confined to one demographic field. The dead are therefore simultaneously invisible to care and immortal in storage, a permanence that documentation-retention duties mandate. A clinical AI trained on coded data has no death feature, learns the dead as if living, and cannot tell.

Keywords

clinical AI
electronic health records
mortality ascertainment
survivorship bias
data retention
FHIR
SNOMED CT
data quality
ghost records
EHR data quality framework

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.