Feb 16, 2026
Research
A review of the technologies, methodologies and endpoints of predictive methods for risk stratification in cardiovascular disease
Background and Rationale
The Evolution of Risk Stratification Technologies Cardiovascular disease (CVD) remains the leading cause of morbidity and mortality globally, driving the need for accurate risk stratification tools to guide primary prevention (Liu et al., 2025). The landscape of risk prediction has evolved into two distinct technological paradigms: conventional statistical modeling and emerging machine learning (ML) systems.
Conventional Statistical Models: The current clinical standard relies on hypothesis-driven risk scores derived from linear regression techniques, primarily Cox proportional hazards models (NHLBI, 2013). Prominent examples include the Framingham Risk Score (FRS), the Pooled Cohort Equations (PCE) used in US guidelines, and the QRISK algorithms utilized in the United Kingdom (Liu et al., 2025). These models estimate a 10-year risk based on a finite set of traditional risk factors (e.g., age, systolic blood pressure, lipids, smoking) (NHLBI, 2013). While these scores have been independently externally validated and implemented in clinical settings for years, they often underperform in diverse populations (Liu et al., 2025). Conventional models may oversimplify complex biological interactions due to their linearity, leading to the underestimation or overestimation of risk in specific subpopulations. (Liu et al., 2025).
Machine Learning (ML) Approaches: To address the analytic limitations of regression-based modeling, recent technological advancements have integrated ML algorithms with Electronic Health Records (EHRs) (Liu et al., 2025). Unlike conventional models, ML algorithms—such as Neural Networks (NN), Support Vector Machines (SVM), and gradient boosting methods—are capable of modeling non-linear relationships and complex interactions among variables (Liu et al., 2025). By leveraging the granularity of longitudinal patient data found in EHRs, including unstructured text and multi-modal inputs, these systems can facilitate more refined, precision-driven risk estimation (Liu et al., 2025). Systematic reviews indicate that ML models frequently demonstrate superior predictive performance compared to conventional statistical approaches in terms of both discrimination and calibration (Liu et al., 2025).
Methodological Divergence A core challenge in the field is the disparity in methodological rigor between established and emerging methods.
Performance Metrics The primary metric used to evaluate both traditional and ML-based models is the C-statistic, or Area Under the Receiver Operating Characteristic Curve (AUC), which measures discriminatory power (Liu et al., 2025; NHLBI, 2013). However, discrimination alone is insufficient for clinical utility; calibration—the agreement between predicted probabilities and observed event rates—is critical (Liu et al., 2025). Metrics such as the calibration chi-squared statistic and calibration slopes are employed to assess this reliability (NHLBI, 2013; Liu et al., 2025). Furthermore, to determine the added value of new models or biomarkers, studies utilize reclassification metrics, specifically the Net Reclassification Improvement (NRI) and the Integrated Discrimination Improvement (IDI) indices (NHLBI, 2013).
Validation Frameworks Methodological rigor in validation varies significantly between technologies. Traditional models like the PCE were derived from large, geographically and racially diverse community-based cohorts (e.g., Framingham, ARIC, CARDIA) and validated using techniques such as 10x10 cross-validation and external validation in separate cohorts like MESA and REGARDS (NHLBI, 2013). In contrast, ML models frequently suffer from a lack of independent external validation (Liu et al., 2025). Many ML studies rely on internal validation methods (e.g., data splitting or cross-validation) without testing on external, independent populations, which limits their generalizability and clinical translatability (Liu et al., 2025). The absence of standardized reporting frameworks, such as TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis), further complicates the reproducibility of ML methodologies (Liu et al., 2025).
Variability in Endpoints The definition of the primary endpoint is a fundamental variable in the design and utility of risk prediction technologies.
Hard vs. Soft Endpoints: The NHLBI (2013) guidelines advocate for "Hard ASCVD" (fatal coronary heart disease, non-fatal MI, fatal/non-fatal stroke) to reduce subjectivity. However, many legacy scores and newer ML models utilize broad composite outcomes or "Total CVD," which may include softer endpoints like angina, revascularization, or heart failure (Liu et al., 2025). This definition intentionally excludes "softer" endpoints such as angina, transient ischemic attack (TIA), and revascularization to minimize variability caused by subjective clinical decision-making and physician recommendation biases (NHLBI, 2013).
Temporal Differences: While conventional scores typically standardize on a 10-year risk horizon, ML and EHR-based methods often predict over shorter, variable timeframes (e.g., 1-year or 5-year risk) due to data censoring issues in electronic records (Liu et al., 2025).
Older risk scores and some modern ML applications utilize broader definitions of "Total CVD," which may include angina pectoris, heart failure, and peripheral artery disease (NHLBI, 2013). Recent reviews of ML applications highlight a trend toward predicting specific long-term outcomes, including incident heart failure and stroke, or composite Major Adverse Cardiovascular Events (MACE) (Liu et al., 2025). However, heterogeneity in how these CVD outcomes are defined across studies remains a significant challenge for comparing the efficacy of emerging technologies (Liu et al., 2025).
Study Aim and Objectives
Primary Aim
The primary aim of this review is to provide a narrative for the technological foundations, methodological rigor, and endpoint definitions of predictive methods—ranging from conventional risk scores to advanced AI/ML algorithms—used for cardiovascular disease risk stratification.
Specific Objectives
To achieve this aim, the review will address the following specific objectives:
To Compare Predictive Performance: To qualify and compare the predictive performance trends (C-statistic/AUC) and calibration metrics of AI/ML algorithms (e.g., Neural Networks, Random Forests, Gradient Boosting) and established conventional risk scores (e.g., Framingham Risk Score, Pooled Cohort Equations, QRISK) in adult populations without prior CVD.
To Evaluate Data Modalities: Assess the utility and incremental value of diverse data sources - specifically Electronic Health Records (EHRs), General Practice (GP) data, and longitudinal wearable technology data - in enhancing risk prediction accuracy compared to traditional risk factor assessment and how this impacts the design and performance of risk prediction tools.
To Critique Methodological Rigor: Analyze the validation strategies employed in current AI/ML studies, specifically identifying the prevalence of independent external validation versus internal validation (e.g., cross-validation), and evaluating adherence to reporting frameworks such as TRIPOD.
To Analyze Endpoint Heterogeneity: Investigate the variation in primary outcome definitions across studies, distinguishing between "Hard ASCVD" endpoints (fatal/non-fatal MI and stroke) and broader "Total CVD" or composite MACE endpoints, and determine how this heterogeneity affects model comparability.
To Assess Generalizability and Equity: Evaluate the performance of AI/ML models across diverse demographic subgroups, specifically looking for evidence of improved risk estimation in populations historically underserved or misclassified by conventional linear models (e.g., specific racial/ethnic groups, younger populations, or women).
Methodology
3.1 Study Design
This study is designed as a narrative review driven by a systematic search strategy. It will adhere to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for the identification, screening, and selection of studies. Due to the anticipated heterogeneity in model architectures (e.g., regression scores vs. deep neural networks) and outcome definitions, a quantitative meta-analysis is not feasible. Instead, a narrative synthesis will be employed to critically appraise the methodological landscape.
3.2 Search Strategy
A systematic search will be conducted across four major databases: MEDLINE (PubMed), Embase, Web of Science, and the Cochrane Library. The search strategy combines controlled vocabulary (MeSH/Emtree) and free-text terms covering three core concepts:
Cardiovascular Disease (Target Condition)
Predictive Technologies (Conventional Risk Scores & AI/ML Models)
Data Sources (EHR, GP Data, Cohorts, Wearables)
(See Section 4 for full Search Strings)
3.3 Selection Criteria
Studies will be selected based on the following inclusion and exclusion criteria:
Criterion | Inclusion | Exclusion |
Population | Adults (>18 years) generally free of baseline CVD (Primary Prevention). | Acute inpatient settings; Secondary prevention; Pediatric populations. |
Technology | 1. Conventional: Cox Models, Logistic Regression, Established Scores (Framingham, QRISK, PCE). | |
2. AI/ML: Neural Networks, Random Forests, Gradient Boosting, SVM. | Descriptive epidemiology without prediction; Single-biomarker association studies. | |
Outcome | CVD events: Hard ASCVD, MACE, or specific subtypes (Stroke, HF, MI). | "Soft" endpoints alone; All-cause mortality without CVD-specific data. |
Study Type | Observational cohort studies, validation studies, and randomized trials reporting performance metrics (AUC, Calibration). | Editorials, commentaries, conference abstracts without full text, reviews. |
3.4 Data Extraction and Quality Assessment
Data will be extracted using a standardized form capturing:
Model Characteristics: Type of algorithm (Linear vs. ML), input variables, and data source (EHR vs. Cohort).
Methodology: Validation method (Internal split vs. External validation), sample size, and missing data handling.
Outcomes: Definition of the primary endpoint (e.g., ICD codes used) and time horizon (10-year vs. other).
Performance: Discrimination (C-statistic/AUC) and Calibration measures.
Risk of Bias Assessment: Quality and risk of bias will be assessed using the PROBAST (Prediction model Risk Of Bias ASsessment Tool) framework, which is specifically designed for diagnostic and prognostic prediction models.
3.5 Data Synthesis (Narrative Approach)
A narrative synthesis will be conducted to map the evidence, structured around three thematic pillars:
Technological Performance: A qualitative comparison of discrimination and calibration between conventional scores and ML models.
Methodological Rigor: An analysis of validation practices, focusing on the prevalence of external validation and reporting standards (TRIPOD adherence).
Endpoint Heterogeneity: A taxonomy of outcome definitions used across studies and their impact on reported model efficacy.
Search Strategy and Methodology
("Cardiovascular Diseases"[Mesh] OR "cardiovascular disease*"[Title/Abstract] OR "heart disease*"[Title/Abstract] OR "coronary disease*"[Title/Abstract] OR "stroke"[Title/Abstract] OR "heart failure"[Title/Abstract] OR "myocardial infarction"[Title/Abstract] OR "ASCVD"[Title/Abstract]) AND ("Machine Learning"[Mesh] OR "Artificial Intelligence"[Mesh] OR "Proportional Hazards Models"[Mesh] OR "Logistic Models"[Mesh] OR "Risk Assessment"[Mesh] OR "machine learning"[Title/Abstract] OR "artificial intelligence"[Title/Abstract] OR "deep learning"[Title/Abstract] OR "neural network*"[Title/Abstract] OR "random forest"[Title/Abstract] OR "gradient boosting"[Title/Abstract] OR "logistic regression"[Title/Abstract] OR "Cox regression"[Title/Abstract] OR "risk score"[Title/Abstract] OR "risk prediction"[Title/Abstract] OR "prediction model*"[Title/Abstract] OR "Framingham"[Title/Abstract] OR "QRISK"[Title/Abstract] OR "Pooled Cohort Equations"[Title/Abstract]) AND ("Electronic Health Records"[Mesh] OR "Medical Records Systems, Computerized"[Mesh] OR "General Practice"[Mesh] OR "Epidemiologic Studies"[Mesh] OR "Wearable Electronic Devices"[Mesh] OR "electronic health record*"[Title/Abstract] OR "EHR"[Title/Abstract] OR "EMR"[Title/Abstract] OR "general practice"[Title/Abstract] OR "primary care"[Title/Abstract] OR "wearable*"[Title/Abstract] OR "CPRD"[Title/Abstract] OR "UK Biobank"[Title/Abstract] OR "cohort study"[Title/Abstract]) AND ("Area Under Curve"[Mesh] OR "Reproducibility of Results"[Mesh] OR "calibration"[Title/Abstract] OR "discrimination"[Title/Abstract] OR "C-statistic"[Title/Abstract] OR "AUC"[Title/Abstract] OR "validation"[Title/Abstract])
PRISMA (title)
Title screening results (PRISMA-ready)
Titles screened: 1,210
Included for abstract screening: 513
Excluded at title stage: 697
Exclusion reason | n |
Not clearly CVD risk prediction/stratification | 394 |
Outcome not CVD (arrhythmia/ECG) | 87 |
Diagnostic detection/classification (not risk prediction) | 44 |
Outcome not CVD (hypertension only) | 44 |
Study type: review/editorial/guideline | 39 |
Outcome not CVD | 35 |
CVD_title_screening_results.csv
Results
Performance and Calibration of Conventional Risk Algorithms
The landscape of cardiovascular disease (CVD) risk stratification is currently defined by the transition from established regression-based algorithms to novel equations derived from broader, more contemporary cohorts. The Pooled Cohort Equations (PCE) and Framingham Risk Score remain widely utilized benchmarks; however, their performance in modern populations is variable. In a large community-based cohort in Olmsted County, Minnesota, the PCE demonstrated robust discrimination (C-statistic 0.78) and was unaffected by statin initiation during follow-up. Conversely, evidence suggests significant heterogeneity in performance across different healthcare settings. In an analysis of the UK Biobank, the recently introduced American Heart Association PREVENT equations demonstrated superior calibration compared to the PCE, which substantially overestimated 10-year atherosclerotic cardiovascular disease (ASCVD) risk (mean predicted risk 12.18% vs observed 5.23% in men). However, this calibration advantage was not universal; across four major US integrated healthcare systems, PREVENT underestimated risk in specific subpopulations, including Black adults and those with diabetes, while the PCE consistently overestimated risk. In the United Kingdom, the QRISK algorithms have evolved to address these limitations. While QRISK2 showed improved discrimination over the Framingham equation in the QRESEARCH database and independent cohorts, recent external validations of QRISK3 have highlighted significant calibration drift. In the UK Biobank, QRISK3 demonstrated moderate discrimination (C-statistic ~0.72) but systematically over-predicted CVD risk, particularly in older adults, by as much as 20%. Furthermore, when accounting for competing mortality risks in older multimorbid populations, QRISK3 performance deteriorated significantly, leading to overestimation of risk. Similarly, in a Dutch primary care setting, the European SCORE2 model was found to underestimate the 10-year risk of cardiovascular events (observed 10.1% vs predicted 6.2%), potentially leaving 35% of high-risk patients untreated. Specific populations remain inadequately served by general population models; for example, the standard ACC/AHA PCE underestimated risk in women veterans, prompting the development of a tailored ‘VA Women CVD Risk Score’ which improved discrimination (C-statistic 0.70) compared to the standard model (C-statistic 0.61).
Efficacy of Machine Learning compared with Statistical Models
The application of machine learning (ML) to electronic health records (EHR) has yielded mixed results regarding the superiority of algorithmic prediction over traditional statistical methods. Deep learning approaches capable of modeling longitudinal patient trajectories have shown promise. The ‘BEHRT’ transformer model, for instance, outperformed QRISK3 and Framingham models in predicting heart failure (HF), stroke, and coronary heart disease, demonstrating greater resilience to temporal data shifts. Similarly, the ‘AutoPrognosis’ automated ML framework achieved a significantly higher AUC (0.774) than Framingham scores (0.724) in the UK Biobank by identifying non-linear interactions between variables. However, gains in discrimination are often marginal when compared to well-tuned statistical models. In the prediction of incident myocardial infarction, a deep neural network (DNN) achieved an AUC of 0.835, which was only incrementally superior to a regularized logistic regression (AUC 0.829), and both models suffered from poor calibration due to the rarity of the outcome. This finding was echoed in heart failure prediction, where Extreme Gradient Boosting (XGBoost) and Random Survival Forests performed similarly to traditional Cox proportional hazards models (C-indices ~0.79–0.83), although ML methods were able to identify novel high-ranking predictors such as spirometry measures. The generalizability of these advanced models remains a critical barrier; survival neural networks trained in UK cohorts significantly underpredicted risk when applied to Chinese populations, necessitating the development of novel population-based recalibration methods to restore accuracy. Incremental Value of Novel Digital and Biological Modalities A significant body of evidence supports the integration of non-traditional data sources—ranging from digital sensors to multi-omics—into risk prediction frameworks. Digital Phenotyping (AI-ECG and PPG): Artificial intelligence applied to standard 12-lead electrocardiograms (AI-ECG) has emerged as a potent digital biomarker. A deep learning model designed to detect left ventricular systolic dysfunction from ECG images successfully predicted future incident heart failure with hazard ratios ranging from 3.88 to 23.5 across diverse multinational cohorts, independent of traditional clinical risk factors. Furthermore, a novel ‘sex discordance score’—quantifying the difference between a patient’s biological sex and their AI-predicted sex from ECG morphology—identified females with a ‘male-like’ ECG phenotype who were at disproportionately higher risk of cardiovascular mortality. Beyond clinical settings, deep learning applied to photoplethysmography (PPG) signals collected via pulse oximeters was non-inferior to office-based risk scores requiring physical measurements, suggesting a viable pathway for scalable screening in low-resource environments.
Genomics and Multi-omics:
The utility of polygenic risk scores (PRS) has been enhanced through multi-ancestry integration. The ‘GPSMult’ score significantly improved risk discrimination over clinical factors alone, particularly for individuals of South Asian ancestry, and reclassified 7% of the population across decision thresholds. Integrating PRS with a ‘polysocial score’ (capturing social determinants of health) further improved net benefit and reclassification for coronary heart disease prediction compared to clinical calculators alone. In the domain of proteomics, a data-driven selection of 222 protein biomarkers improved the prediction of major adverse cardiovascular events (MACE) and dementia beyond the PREVENT risk score. Similarly, adding specific metabolomic biomarkers (e.g., lipid and amino acid clusters) to the SCORE2 and PCP-HF models yielded statistically significant improvements in C-statistics for cardiovascular risk and incident heart failure. Conversely, Apolipoprotein B (ApoB), while showing a dose-response relationship with risk, offered limited standalone predictive utility and did not significantly improve the discrimination of the SCORE2 model, although it aided in identifying low-risk individuals.
Discussion
This review highlights a paradigm shift in cardiovascular risk stratification, moving from static, regression-based equations toward dynamic, multimodal, and algorithmic prediction tools. The evidence suggests that while machine learning and novel biomarkers offer tangible improvements in discrimination, they concurrently introduce complex challenges regarding calibration, interpretability, and generalizability.
The Discrimination-Calibration Trade-off
A recurring theme across the literature is that while ML algorithms consistently yield higher C-statistics than linear models, they frequently fail to solve—and occasionally exacerbate—issues of calibration. As observed in the prediction of incident myocardial infarction, sophisticated deep learning models can be poorly calibrated in rare-event scenarios. In contrast, the PREVENT equations, which utilize modern statistical methods on traditional data, demonstrated superior calibration in the UK Biobank compared to the older PCE. This dissociation between discrimination and calibration implies that for population-level resource allocation (e.g., determining statin eligibility), well-calibrated statistical models may retain utility over ‘black-box’ ML models unless the latter undergo rigorous local recalibration.
The Emergence of Digital Phenotyping
The capacity of AI to extract prognostic signals from raw data sources represents a significant advancement. The ability of AI-ECG to predict future heart failure and of PPG to approximate office-based risk scores facilitates opportunistic screening. Unlike traditional scores requiring ordered laboratories and physical measurements, these digital biomarkers can be derived from data already resident in the electronic medical record or collected via ubiquitous consumer devices. This establishes a precedent for ‘passive’ risk stratification, potentially reducing barriers to entry for preventive care.
Redefining Risk Architectures
The integration of Polygenic Risk Scores (PRS) and Polysocial Scores (PSS) indicates a move towards precision medicine that acknowledges the additive nature of genetic and environmental risk. Importantly, the development of multi-ancestry scores like GPSMult addresses historical Eurocentric biases in genetic risk assessment, offering significant improvements for non-European populations. This is paralleled by the specific validation of risk scores in women veterans and the identification of sex-specific ECG risk markers, underscoring the necessity of risk models that account for biological sex and social context.
Limitations
The current literature remains heavily skewed toward data-rich environments, such as the UK Biobank and US academic health systems. The performance of these advanced models in fragmented or low-resource healthcare systems remains largely largely characterized by proxy. Furthermore, the ‘black box’ nature of deep learning algorithms raises interpretability concerns; while feature importance analyses provide some insight, the biological rationale for why an AI model predicts mortality from a normal-appearing ECG remains partially opaque.
Conclusion
The field of cardiovascular risk stratification is rapidly diversifying. While conventional statistical models such as the Pooled Cohort Equations and QRISK3 remain the clinical standard, they are increasingly challenged by evidence of miscalibration in modern cohorts and an inability to capture complex, non-linear risk determinants. Machine learning offers a solution to the discrimination ceiling, consistently identifying high-risk individuals missed by standard scores. However, the most significant advances appear to lie in the integration of novel data modalities—specifically AI-analyzed ECGs, polygenic risk scores, and passive digital sensing—which offer independent prognostic value on top of traditional clinical factors. Future implementation strategies must prioritize the rigorous recalibration of these algorithmic tools to local populations and the seamless integration of digital biomarkers into routine clinical workflows to transition from risk prediction to effective risk prevention.
