36-315 Project Report

Unlocking Opportunity: How Income, Scores, and Location Shape Access to Elite Colleges

Authors

  • Zaina Saif
  • Mia Cody
  • Layla Hurwitz
  • Adithi Jawahar

Introduction

This project examines a large-scale dataset on U.S. college admissions, attendance, and socioeconomic background, compiled from the CMU Statistics GitHub. The dataset tracks students entering 139 selective colleges—including Ivy League and other elite institutions—between 2010 and 2015. Each data point represents an individual student, linking their college outcomes to parental income, standardized test scores, college tier, and residency status. Key variables include the parental income bracket (par_income_bin), attendance rates (attend), relative admission rates conditional on applying (rel_att_cond_app), application rates (rel_apply), SAT-based attendance rates (attend_sat and rel_attend_sat), college tier classification (tier_name), and in-state versus out-of-state residency (in_state, out_state).

Guided by this dataset, our project investigates four main research questions. First, we examine how parental income influences overall access to elite colleges. Second, we take a closer look at Ivy Plus institutions to understand how income affects attendance decisions among admitted students with similar academic qualifications. Third, we study how residency status impacts attendance rates, particularly across public and private college tiers. Finally, we explore how attendance patterns differ across college tiers based on SAT performance. Together, these questions aim to uncover how socioeconomic and institutional factors shape college access and enrollment outcomes.

Analysis

The Influence of Income on Elite College Admissions

We first examine the question: How does parental income influence access to elite colleges overall? To address this, we analyze the distribution of students across parental income brackets and examine how attendance rates vary by income level and school tier.

Number of Elite College Students by Parental Income Group

dataset <- read.csv("/Users/zainasaif/Desktop/315/CollegeAdmissions_Data.csv")
library(dplyr)
library(ggplot2)
mutate(dataset, income_group = cut(par_income_bin,
                            breaks = seq(0, 100, by = 10),
                            labels = paste0(seq(0, 90, by = 10), "–", seq(10, 100, by = 10)),
                            include.lowest = TRUE)) %>%
  count(income_group) %>%
  ggplot(aes(x = income_group, y = n)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Number of Elite College Students by Parental Income Group",
    x = "Parental Income Percentile Group",
    y = "Number of Students"
  )

At first glance, this graph may seem misleading because it looks like the study selected the same number of students from each income group up to the 90th percentile, then added a disproportionately large number from the top 10%. However, the data is not a simple random sample. Each row represents a combination of income percentile and college tier, and while most bins appear uniform because of this structure, the top 10% includes more combinations to reflect how much more likely wealthy students are to attend elite colleges.

What the graph truly shows is that students from the highest income percentiles are significantly overrepresented in elite college populations. Even before adjusting for test scores or breaking things down by school type, we can already see that wealth is strongly linked to access. In the next plot, we examine this relationship more closely by looking at how attendance rates vary across income percentiles for each college tier, providing a clearer view of where and how income makes the biggest difference.

Parental Income vs. College Attendance Rate by School Tier

ggplot(dataset, aes(x = par_income_bin, y = attend, color = tier_name)) +
  geom_smooth(se = FALSE, method = "loess") +
  annotate("text", 
           x = 30, y = 0.015,
           label = "Ivy Plus schools show the steepest rise \nin attendance at high income levels.",
           size = 3.5, hjust = 0, color = '#00B200') +
  labs(
    title = "Parental Income vs. College Attendance Rate by School Tier",
    x = "Parental Income Percentile",
    y = "College Attendance Rate",
    color = "School Tier"
  )

The most notable pattern in the plot is the sharp upward curve for the Ivy Plus school tier, especially among the wealthiest families. This shows that access to Ivy Plus schools is overwhelmingly concentrated among top-income students. Interestingly, the Ivy Plus line remains very low and mostly flat—even slightly decreasing—for the bottom 80–90% of income percentiles. This suggests that students from lower- and middle-income families have similarly limited access; whether you are middle class or lower class does not have a huge effect on your attendance rate.

It is only at the very top (around the 90th percentile and above) where the line shoots up dramatically, highlighting how elite college access becomes almost exclusive to the wealthiest students. Other tiers also show increasing attendance rates for higher parental income percentiles, but the trend is less dramatic. The two tiers that do not show a significant rise are Highly Selective Public and Selective Public, suggesting that income matters less at public institutions.

Ivy Plus stands out both for the steepness of its upward trend and for consistently having the highest attendance rate at any income level (which we will examine thoroughly next). However, this does not mean that low-income students are highly likely to attend Ivy Plus schools—rather, among the relatively few low-income students in this sample, a slightly higher share go to Ivy Plus than to other tiers.

Overall, the visualization highlights a strong correlation between wealth and access to the most elite institutions, reinforcing the idea that socioeconomic status is a key driver of opportunity in higher education.

Analyzing Who Attends Ivies

Deeping diver into Ivy League admissions, we next explore the question: To what extent does parental income influence attendance given admission to Ivy Plus colleges, even when academic qualifications are similar? To address this, we focus only on students attending Ivy Plus institutions and analyze their parental income percentile, relative admission rates, and attendance rates.

Parental Income vs. Attendance Density Plot

ivy_data <- subset(dataset, tier_name == "Ivy Plus")
ggplot(ivy_data, aes(x = par_income_bin, y = rel_attend)) +
  stat_density2d(aes(fill = after_stat(density)), geom = "tile", contour = FALSE) +
  geom_point(alpha = 0.5) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(
    title = "Parental Income vs Attendance (Ivy Plus Colleges)",
    x = "Parental Income Percentile Bracket",
    y = "Relative Attendance Rate",
    fill = "Density"
  )

This plot shows the distribution of attendance rates across different parental income percentiles for Ivy Plus colleges. Density shading highlights where students are concentrated, while scatter points reveal variation. Most attendance is concentrated among students from the 90th percentile and above, with a sharp rise starting around the 80th percentile. Lower-income students show both lower attendance and less density, suggesting barriers beyond academic achievement.

The graph is clear, with well-labeled axes and effective color gradients. The scatter points improve visibility of the underlying spread, helping interpret the real-world significance: income strongly influences attendance even after controlling for qualifications.

To assess the relationship formally and even further, we conducted a Pearson correlation test between parental income percentile and relative attendance rate:

cor.test(ivy_data$par_income_bin, ivy_data$rel_attend)

    Pearson's product-moment correlation

data:  ivy_data$par_income_bin and ivy_data$rel_attend
t = 4.8587, df = 166, p-value = 2.717e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2128141 0.4786897
sample estimates:
      cor 
0.3528542 

The correlation coefficient was 0.353, with a 95% confidence interval of [0.213, 0.479] and a p-value of 2.7e-06. This statistically significant positive correlation confirms that higher parental income is associated with higher attendance rates among admitted students.

Parental Income vs. Attendance Given Acceptance ScatterPlot

dataset$tier <- as.numeric(dataset$tier)
df_ivy <- dataset |>
  filter(tier_name == "Ivy Plus", !is.na(rel_att_cond_app), !is.na(attend), !is.na(par_income_lab))
ggplot(df_ivy, aes(x = rel_att_cond_app, y = attend, color = par_income_lab)) +
  geom_point(alpha = 0.8, size = 3) +
  labs(
    title = "Admission vs Attendance at Ivy Plus Colleges by Parental Income",
    x = "Relative Admission Rate (Conditional on Application)",
    y = "Attendance Rate",
    color = "Parental Income Bracket"
  ) + annotate("text", x = 2.75, y = 0.045, label = "Top 1% Income", size = 3, color = "black")

This scatterplot provides a deeper view of how admission likelihood translates into attendance, segmented by parental income bracket. Students from the highest income brackets tend to maintain higher attendance rates across a range of relative admission rates. In contrast, students from lower income brackets are less likely to enroll even when admitted, showing that income influences final college attendance decisions beyond academic achievement alone.

Overall, both graphs highlight a clear socioeconomic divide: even among academically qualified students admitted to Ivy Plus colleges, those from wealthier backgrounds are more likely to ultimately attend.

How Residency Status Shapes Attendance

We also examine the question: How does residency status impact attendance rates across colleges that collect residency status data? We analyze differences between in-state and out-of-state students across various college tiers.

Mean Relative Attendance Rate: In-State vs Out-of-State by College Tier

library(tidyr)

residency_data <- dataset %>%
  select(tier_name, rel_attend_instate, rel_attend_oostate)

residency_long <- residency_data %>%
  pivot_longer(cols = c(rel_attend_instate, rel_attend_oostate), names_to = "residency_status", values_to = "relative_attendance") %>% mutate(residency_status = recode(residency_status, "rel_attend_instate" = "In-State", "rel_attend_outstate" = "Out-of-State")) %>% filter(!is.na(relative_attendance))
residency_summary <- residency_long %>%
  group_by(tier_name, residency_status) %>%
  summarise(mean_attendance = mean(relative_attendance, na.rm = TRUE), .groups = "drop")

ggplot(residency_summary, aes(x = residency_status,
                              y = mean_attendance,
                              fill = tier_name)) +
  geom_col(position = position_dodge(width = 0.8)) +
  labs(
    x = "Residency Status",
    y = "Mean Relative Attendance Rate",
    fill = "College Tier",
    title = "Mean Attendance Rate of In vs Out of State by College Tier"
  ) 

This plot displays the mean relative attendance rates by residency status for each college tier. The x-axis distinguishes between in-state and out-of-state students, while the y-axis shows the average attendance rate. Different college tiers are represented using color.

The graph is clearly labeled and uses grouped bar charts for easy comparison between residency statuses. From the plot, we observe that among public colleges, selective public schools show the greatest variation in mean attendance rates between in-state and out-of-state students. Highly selective public schools show a smaller but noticeable gap, while other elite colleges (including private institutions) have relatively smaller differences.

The plot suggests that residency status plays a stronger role in shaping attendance decisions for students at public institutions, particularly selective public colleges. This is consistent with the real-world understanding that in-state tuition discounts and admissions policies heavily influence public college attendance patterns.

Attendance Rates by College Tier and Residency Status

ggplot(residency_long, aes(x = tier_name, y = relative_attendance, fill = residency_status)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA, position = position_dodge(width = 0.8)) +
  labs(
    title = "Attendance Rates by College Tier and Residency Status",
    x = "College Tier",
    y = "Relative Attendance Rate",
    fill = "Residency Status" ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.title = element_text(size = 13),
        plot.title = element_text(face = "bold", size = 14))

The second plot uses boxplots to display the distribution of relative attendance rates by college tier and residency status. Each box represents the spread of attendance rates, making it easier to observe differences between groups and spot variability.

The graph is well-labeled and clean, with tilted axis text to improve readability. From the boxplots, we observe that out-of-state students generally have higher relative attendance rates compared to in-state students across all college tiers. Selective public colleges again show the largest variation between residency statuses, while highly selective public colleges show a smaller difference.

This visualization reinforces the conclusion that residency status is a meaningful factor in shaping college attendance, particularly for students attending public institutions. It highlights the financial and institutional factors that can influence enrollment decisions beyond just academic performance.

Examining SAT-Based Attendance Patterns Across College Tiers

We also wanted to explore: How does college tier affect the relative attendance rates of students from different SAT score bands? Specifically, we examine whether certain tiers show more consistency or variation across SAT bands in student attendance.

Relative SAT-Based Attendance by College Tier

dataset <- dataset %>%
  mutate(tier_name = factor(tier_name, levels = c(
    "Ivy Plus",
    "Other elite schools (public and private)",
    "Selective public",
    "Selective private",
    "Highly selective public",
    "Highly selective private"
  )))

ggplot(dataset, aes(x = tier_name, y = rel_attend_sat, fill = tier_name)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Relative SAT-Based Attendance by College Tier",
    subtitle = "Higher values = disproportionately high attendance for this SAT band",
    x = "College Tier",
    y = "Relative Attendance Rate (SAT Band)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  annotate("text", 
           x = 5.8, y = 6.85, 
           label = "Greater inconsistency\nin attendance by SAT band", 
           size = 3, color = "black") +
  annotate("segment", 
           x = 6, xend = 6, 
           y = 6.2, yend = 5, 
           arrow = arrow(length = unit(0.15, "inches")), 
           color = "black")
Warning: Removed 294 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The first graph is a boxplot showing the distribution of relative SAT-based attendance rates for each college tier, from “Ivy Plus” to “Highly Selective Private” schools. Each box represents how SAT bands are represented within that tier, with wider spreads and more outliers suggesting greater inconsistency in who attends based on SAT scores. The main takeaway is that Highly Selective Private colleges show the greatest variation in relative attendance across SAT bands. This suggests that students at these schools are drawn unevenly from different score ranges, with some bands being much more heavily represented than others. In contrast, Ivy Plus and Other Elite Schools have tighter distributions, indicating a more consistent representation across SAT score bands.

Elbow Plot of Mean Relative Attendance by College Tier

ggplot(dataset, aes(x = tier_name, y = rel_attend_sat, group = 1)) +
  stat_summary(fun = mean, geom = "point", size = 3) +
  stat_summary(fun = mean, geom = "line") +
  labs(
    title = "Elbow Plot of Mean Relative Attendance by College Tier",
    x = "College Tier",
    y = "Mean Relative Attendance Rate (SAT Band)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The elbow plot shows that mean relative SAT-based attendance rates vary noticeably across college tiers. Highly Selective Private colleges have the highest average relative attendance across SAT bands, suggesting that these schools disproportionately attract students from certain SAT ranges. As we move toward Highly Selective Public and Ivy Plus colleges, there is a sharp decline in mean attendance rates, creating a clear “elbow” in the trend. This indicates that the relationship between SAT band and attendance becomes weaker at these more elite tiers. After this initial drop, the trend levels off, with smaller differences among Selective Private and Selective Public colleges. Overall, the elbow plot highlights that college tier significantly influences how strongly SAT scores shape student attendance, with Highly Selective Private schools standing out as the most differentiated by SAT band.

Conclusions and Future Work

Our analysis highlights the strong influence of socioeconomic factors on college attendance at selective U.S. institutions. Across both elite colleges broadly and Ivy Plus colleges specifically, we find that higher parental income is consistently associated with higher attendance rates, even among academically qualified students. Additionally, residency status plays a significant role, particularly for public colleges, where out-of-state students often have higher attendance rates than in-state students. Finally, differences in attendance patterns based on SAT scores and college tier further reinforce how access to higher education is shaped by multiple structural factors.

While our findings shed light on these important relationships, there are still open questions that could be explored with future work. For example, we could not assess how financial aid packages influence attendance decisions, which could be a major factor for lower-income students. Another limitation is that the dataset focuses only on selective colleges, meaning we cannot generalize our conclusions to community colleges or less selective institutions. Further analysis could incorporate additional variables to better understand intersectional influences on the average student’s college access. More advanced statistical modeling could also help clarify the relative importance of income, test scores, and geography.