Understanding the Impact of Socioeconomic Status on Premature Deaths by Race

Authors

Saima Rahman - Colby College

Maximus Liu - University of Texas at Austin

Meris McElveen - University of North Carolina at Chapel Hill

Published

July 25, 2025


Introduction

Premature death, defined as death occurring before age 75, represents one of the most pressing public health challenges in the United States, with profound disparities across racial and geographic lines. In a 2017 study by the Centers for Disease Control and Prevention (CDC), it was revealed that in the early 2000s, annual age-adjusted death rates for the five leading causes of death in the US (heart disease, cancer, unintentional injury, chronic lower respiratory disease, and stroke) were higher in rural areas than in urban, emphasizing the significance in the growth of that gap(Garcia et al. 2017). Recent data from 2024 presents that Black, Hispanic, American Indian Alaska Native (AIAN) people face worse outcomes compared to white people across most examined measures of social health determinants, including infant, pregnancy-related, diabetes, and cancer mortality(Ndugga, Hill, and Artiga 2025).

The relationship between socioeconomic factors and premature mortality is particularly pronounced in rural communities, with increased rates of premature death in rural counties with higher unemployment rates. Conversely, lower premature mortality rates in counties with higher median incomes and more primary care physicians per capita demonstrate the critical role of socioeconomic and social determinants of health (SDOH). These disparities are further aggravated by race and ethnicity, as research has shown that rural U.S. counties with a majority of non-Hispanic Black and AIAN residents had up to twice the rates of premature death in comparison to rural counties with a majority of non-Hispanic white residents.

Understanding these disparities is crucial for countless reasons. Firstly, premature deaths represent individual tragedies as well as major economic losses to communities and healthcare systems. Additionally, many of these deaths may be preventable through targeted interventions addressing social determinants of health. Lastly, the high concentration of premature death rates in specific geographic and racial areas suggests that evidence-based interventions could contribute to public health benefits and advancements. By focusing on the intersection of geography, race, and socioeconomic status, this research aims to identify specific factors that contribute to health disparities and provide actionable recommendations for targeted public health interventions.

Data

This analysis utilizes two primary data sources to examine the relationship between socioeconomic factors and premature death rates across racial groups at the county level. The foundation of our dataset comes from the County Health Rankings Data, collected by the University of Wisconsin Population Health Institute, which ranks every county in each state on its health outcomes and health factors. The premature mortality data by race were obtained from the Health Disparities Data Portal, which provides county-level premature death rates disaggregated by racial and ethnic groups. This portal, maintained by the National Institute on Minority Health and Health Disparities (NIMHD), offers critical insights into health disparities across different populations.

Sampling

Our study focuses specifically on the Southern United States, which was chosen due to its predominantly higher rates of premature mortality and pronounced health disparities. Through systematic examination across all 50 states, we identified 17 Southern states that exhibited relatively high premature death rates most recently. Within these 17 states, we employed a purposive sampling strategy to address the significant challenge of missing data across racial categories. Many counties had incomplete data for certain racial groups, which would have limited the ability to conduct meaningful comparative analysis. To ensure sufficient data coverage across all racial groups of interest, we reviewed every county within our target states and selected those with the most complete data across all of the racial categories. This process resulted in a final sample of 4 to 5 counties per state, yielding a dataset of 70 counties that provided comprehensive data. This approach ensures that the analysis allows for important comparisons within and between counties.

Key Variables

The final analytical dataset was created by merging county-level premature mortality data by race with the corresponding County Health Rankings data using Federal Information Processing Standards (FIPS) codes to ensure accurate geographic alignment.

We focused on variables that served as metrics to determining a county’s SES (socioeconomic status).

Some key variables included within the analysis were as follows:

Socioeconomic Factors:

  • Median household income
  • Income ratio
  • Unemployment rate
  • High school graduation rate
  • Primary care physician rate

Environmental Factors:

  • Air pollution levels (PM2.5)
  • Presence of water violation
  • Food environmental index

Our exploratory data analysis (EDA) revealed a critical finding that rural counties consistently demonstrated significantly higher premature death rates compared to their urban counterparts. This exploration inspired the incorporation of a Rural factor variable in our analysis, created using the % Rural feature.

This choropleth maps displays Years of Potential Life Lost (a premature death metric) across the United States. This map was used to narrow down of scope of analysis, where we focused on counties from states with high YPLL rates.

Figure 1. Choropleth map showing Premature Death rates across the United States

We also conducted a correlation matrix in order to understand the relationship between a few selected socioeconomic factors. This matrix assumes paired data points, a linear relationship, and normal distribution. Due to the variability of our data, we used this matrix as a foundation point to understanding variable relationships and not to drive any statistical analyses.

Figure 2. Correlation Matrix of various socioeconomic variables.

Methods

To gain a better understanding of how various socioeconomic factors impact the premature death rate across racial groups, we used the following machine learning techniques to determine feature importance and perform county-level premature death rate predictions.

Random Forest

This algorithm involves the combination of multiple decision trees to output a single result for either regression or classification purposes. This method chooses features randomly to split and create trees. Due to the size and variability of our data set, we wanted to use random forest to gain an initial understanding of what socioeconomic features contribute the most to premature death. This method also makes minimal distributional assumptions about the data as it requires that the observations are independent while naturally handling various data types and missing values.

XGBoost

This algorithm performs similarly to random forest, but is optimized for efficiency, speed, and performance. This involves using a base decision tree to calculate residual errors. Additional decision trees are trained, taking into account the errors from the previous trees. XGBoost is optimized to prevent over-fitting, determine the best data split, prioritize important features, and handle missing data. Due to the nature of our data, specifically with missing data even after cleaning, XGBoost appeared to be the best method. This method assumes that the relationship between features and target variables can be approximated by weak learners (decision trees), and requires independence of observations in order to feature distributions while handling non-linear relationships.

We trained both of these models using identical feature sets to ensure fair comparison across the methods. Model performance was evaluated using Root Mean Square Error (RMSE) to measure prediction accuracy and R2 values to indicate the proportion of variance explained in premature death rates. Following this, we were able to determine the best-performing model based on the lowest RMSE and highest R2 values across racial groups and rank the features by importance scores from the optimal XGBoost model to identify the most influential socioeconomic and environmental factors. Lastly, we conducted separate analyses for each racial group to identify the variation in feature importance and patterns.

Following the machine learning analysis, we implemented multiple linear regression (MLR) using manually selected features identified as most important. This provided interpretable coefficients essential for policy recommendations and assessed linear relationships between socioeconomic factors and premature mortality. The MLR model conditions were verified and assume linear relationships between predictors and outcomes, independence of observations, normality of residuals, constant variance (absence of multicollinearity), and a representative sample.

All models were evaluated using consistent metrics across racial groups, including RMSE for prediction accuracy, R-squared for explained variance, five-fold cross-validation performance to assess generalizability, and feature stability analysis to ensure consistent variable importance across different data subsets. This comprehensive approach enabled both accurate prediction of premature death rates and interpretable insights into the most influential socioeconomic determinants affecting different racial groups.

Results

We conducted a model comparison to determine which method to use for our final prediction model.

Best Models per Group by RMSE and R²
Group Model RMSE R2 Best_By
AIAN XGBoost 311.62 0.52 Highest R2
AIAN XGBoost 311.62 0.52 Lowest RMSE
Asian XGBoost 63.64 0.61 Highest R2
Asian XGBoost 63.64 0.61 Lowest RMSE
Black Random Forest 83.48 0.71 Highest R2
Black XGBoost 81.94 0.70 Lowest RMSE
Hispanic XGBoost 71.74 0.46 Highest R2
Hispanic XGBoost 71.74 0.46 Lowest RMSE
White XGBoost 84.92 0.66 Highest R2
White XGBoost 84.92 0.66 Lowest RMSE

Figure 3a. Table of Model Comparison Results. This was used to obtain the final prediction model.

Across all races, XGBoost was the best model as it provided higher R2 and lower RMSE (Root Mean Squared Error) values. These metrics are used to measure predictive accuracy and model performance.

Random Forest vs XGBoost Performance by Race
Group RMSE (RF) R² (RF) RMSE (XGB) R² (XGB) Δ RMSE (RF - XGB) Δ R² (RF - XGB)
Black 83.48 0.71 81.94 0.70 1.54 0.01
White 93.26 0.56 84.92 0.66 8.34 -0.10
AIAN 344.76 0.41 311.62 0.52 33.13 -0.11
Asian 68.31 0.57 63.64 0.61 4.66 -0.04
Hispanic 83.03 0.10 71.74 0.46 11.29 -0.36

Figure 3b. Table of Model Comparison Results.

Final XGBoost Model Feature Selection

From the initial XGBoost model, we extracted the top 15 most important features. Across all races, feature importance does vary. We were able to gain some understanding of the physical health determinants related to premature death, including child mortality, firearm fatalities, and driving deaths with alcohol involvement. However, the initial model provided a limited understanding of premature death through a socioeconomic lens.

Figure 4. Top features influencing Premature Death rates across racial groups. Rural is consistently a top feature among all races.

There are some overlaps in features, such as food environment index, percent of some college completed, percent rural population, and primary care physician rate. The majority of the most important features are historically used as markers to assess an area’s socioeconomic status, as they may point to lack of access to resources or facilities involving life quality or expectancy.

We created a final XGBoost model, using the following features that were cross-selected from both the MLR and initial XGBoost models. Additionally, these features showed up at varying importance levels across all racial groups.

  • % Rural
  • Food Environment Index
  • Primary care physician rate
  • % Limited access to healthy food
  • % Some college
  • Average PM2.5
  • % Unemployed
  • Income Ratio

Premature Death Rate Predictions Across Race

A final XGBoost model using cross-validation was performed on all the racial groups. We determined that our features provided a somewhat strong predictive power in predicting premature death, as denoted by our performance metrics. Our predictions are consistently scattered around the identity line, indicating less accuracy and the presence of errors. With consideration of our sample size for each race, this model does provide solid predictive capabilities. However, limitations are still present.

Figure 5. XGBoost model predictions across all racial groups.

Figure 5. XGBoost model predictions across all racial groups.

Figure 5. XGBoost model predictions across all racial groups.

Figure 5. XGBoost model predictions across all racial groups.

Figure 5. XGBoost model predictions across all racial groups.

We found the rural population rate of a county was one of the most important features across all races. In our initial XGBoost model and MLR, percent rural was a consistent and significant variable. This suggests that rurality plays an important role in premature death.

Figure 6. Top features of XGBoost model across all racial groups.

Our feature importance results led us to shift our focus to rurality in America. We decided to conduct another predictive analysis on all the rural US counties. For practicality, we defined rural counties as any county with 30% or above percent rural, which is 10% above the national rural population average.

After predicting Years of Potential Life Lost (a metric that defines the impact of premature death), using our previously defined model features, we were able to produce a well-performing model, as the majority of our values are clustered around our identity line. However, variability does exist.

Figure 7. XGBoost model predictions across all rural counties in the United States.

We also applied this model to our selected counties from the southeastern United States and Alaska. There was a higher variability of our predictions as the values denoted by the scattered values around the identity line.

Figure 8. XGBoost model predictions across all rural counties in selected states.

After viewing our feature importance in both analyses, we saw that % rural was still included but not listed as a top feature impacting this model’s predictive power.

Figure 9. Top features of XGBoost model across all rural US counties.

Figure 10. Top features of XGBoost model across all rural counties in selected states.

Our selected features proved not to be the best fit for modeling premature death rates in rural America. To understand the socioeconomic features that play a role in rural counties, we performed a simple XGBoost model to find the most important ones to help build a better model for predicting premature death in rural America.

Figure 11. Table of all features from simple XGBoost model. This was used to obtain final features to predict Premature Death rates in rural counties.

We manually conducted our feature selection from a list of 405 variables, prioritizing gain, coverage, and frequency of each variable. Our final selection consisted of 11 features:

  • Not Proficient in English
  • % Enrolled in Free or Reduced Lunch
  • Segregation Index
  • Number of Unemployed Persons
  • Number of Uninsured Children
  • % Children in Poverty
  • Number of People with Some College
  • Presence of Water Violation
  • % Excessive Drinking
  • Income Ratio
  • % Limited Access to Healthy Foods
  • % Uninsured

Our results were far superior using a new set of features, as our values are closer to the identity line. Additionally, this model achieved improved model performance metrics, indicated by RMSE and R2 values.

Figure 12. Final XGBoost model predictions across all rural counties in selected states.

Associations does exist between rurality and various socioeconomic levels. Across urban and rural populations, the level of impact and specific factors impacting premature death does change.

Recommendations

Based on our analysis revealing that rural counties consistently demonstrate significantly higher premature death rates, with eleven key socioeconomic factors that drive these disparities, we recommend UnitedHealth Group/ Optum implement targeted interventions addressing critical determinants identified in our models.

The analysis identified unemployment, educational attainment, and economic restraints as significant predictors of premature mortality in rural areas. This means counties with higher unemployment rates and lower levels of college completion consistently showed elevated death rates across all of the racial groups. We would recommend developing comprehensive development programs that target rural communities, including partnerships with local community colleges to establish ideas to help graduation rates. Additionally, the model demonstrated that counties with fewer primary care physicians per capita and higher uninsured rates had significantly higher premature death rates. A rural health access initiative should focus on installing health clinics or programs with specialists to help multiple counties, possibly expanding telehealth access through partnerships and creating incentive programs to recruit primary care physicians in underserved communities. It is also important to recognize that language barriers, segregation indices, and child poverty rates impact premature mortality, with notable variations across racial groups in rural areas. Culturally responsive programs could help establish multilingual health navigation services, as well as maternal and child health programs addressing the higher rates of uninsured children. This can address the intersection of rurality, race, and socioeconomic status that is worrisome.

In a supplementary perspective, our data presented that limited access to healthy foods, air pollution levels, and water quality violations emerged as critical predictors across our models. An approach to partner with local grocery stores, farmers markets, and other options would improve food access in rural communities. Coordination with environmental agencies to address the water quality violations and air quality concerns could help create sustainable improvements in environmental health factors that directly impact mortality outcomes.

Discussion

Our comprehensive analysis of premature mortality across 70 counties in the Southern United States provides valuable insights into the relationship between socioeconomic factors and health outcomes, while revealing significant gaps in current health disparities research. Our primary research question focused on understanding how various socioeconomic determinants impact premature death rates across racial groups, yet our most compelling finding emerged from an unexpected direction: rurality consistently outweighed racial factors as a predictor of premature mortality across all demographic groups.

Our machine learning models revealed that rural counties face fundamentally different health challenges than their urban counterparts, with 11 key socioeconomic factors clustering around economic security, healthcare access, food security, and environmental quality emerging as critical predictors of premature death. The superior performance of our XGBoost model over Random Forest demonstrated that complex machine learning techniques can effectively identify patterns in health disparities data despite significant data quality challenges. However, while our models achieved strong predictive performance, we must acknowledge that the feature importance rankings from XGBoost cannot fully capture the complex interrelationships between socioeconomic factors that sociological research has established. These findings align with established literature identifying physician shortages, socioeconomic deprivation, and lack of health insurance as primary drivers of rural health disparities(Gong et al. 2019). Importantly, our analysis provides a data-driven framework for prioritizing interventions based on measurable county-level characteristics.

Our dataset proved to be our most significant limitation despite systematic data collection and purposive sampling. This may reflect broader challenges in health surveillance systems, where failure to collect accurate racial and ethnic data hurts efforts to improve care quality(Ulmer, McFadden, and Nerenz 2009). Additionally, our focus on Southern states limits generalizability to other regions where rural health challenges may manifest differently. Our cross-sectional design prevents establishing definitive causal relationships between socioeconomic improvements and mortality reductions.

Several critical next steps emerge from our analysis. Community-level analysis within rural areas could reveal important variations obscured by county-level aggregation. Understanding how local healthcare infrastructure, social capital, and community resources influence health outcomes would inform more targeted interventions. Most importantly, intervention research testing programs addressing our identified factors would provide crucial evidence for translating findings into practice, potentially demonstrating whether comprehensive rural health interventions yield greater benefits than single-factor approaches. Our analysis ultimately demonstrates that addressing premature mortality requires recognizing rurality as a critical social determinant of health while developing better data systems to understand how geographic, racial, and socioeconomic factors interact.

References

Garcia, M. C., M. Faul, G. Massetti, et al. 2017. “Reducing Potentially Excess Deaths from the Five Leading Causes of Death in the Rural United States.” MMWR Surveillance Summaries 66 (SS-2): 1–7. https://doi.org/10.15585/mmwr.ss6602a1.
Gong, Gang, Susan G. Phillips, Charles Hudson, Daniel Curti, and Brandi U. Philips. 2019. “Higher US Rural Mortality Rates Linked to Socioeconomic Status, Physician Shortages, and Lack of Health Insurance.” Health Affairs 38 (12): 2003–10. https://doi.org/10.1377/hlthaff.2019.00722.
Ndugga, Nambi, Lauren Hill, and Samantha Artiga. 2025. “Key Data on Health and Health Care by Race and Ethnicity.” https://www.kff.org/key-data-on-health-and-health-care-by-race-and-ethnicity/?entry=executive-summary-introduction.
Ulmer, Cheryl, Bernadette McFadden, and David R. Nerenz. 2009. Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement. Edited by Institute of Medicine (US) Subcommittee on Standardized Collection of Race/Ethnicity Data for Healthcare Quality Improvement. National Academies Press (US). https://www.ncbi.nlm.nih.gov/books/NBK219020/.