Sleep Duration VS GPA for Freshmen

Author

Tiffany Zhang, Bob Yang, Annie Yang, Steven Ma

Published

April 28, 2025

Introduction

The data we are using in this analysis is the Nightly Sleep Time and GPA dataset. This data consists of 634 participants from Carnegie Mellon University, Notre Dame University, and the University of Washinton who use Fitbit to track their sleep activity. GPAs were collected from the university registrar. There are 15 variables in total. Our graphs focus on the following 8 variables:

-cum_gpa: Cumulative GPA (out of 4.0)

-term_units: Number of course units carried in the term

-TotalSleepTime: Average time in bed, in minutes

-demo_gender: Gender of the students (male = 0, female = 1)

-daytime_sleep: Average daytime sleep, including short naps or sleep that occurred during the daytime, in minutes

-Zterm_units_ZofZ: Student’s units were Z-scored relative to the mean and standard deviation of all students in their study cohort

-Midpoint_sleep: Average midpoint of bedtime and wake time, in minutes after 11 pm

-bedtime_mssd: Mean successive squared difference of bedtime

Our study will focus on four research questions:

  1. How does students’ bedtime interact with academic course load to influence their term GPA?

  2. How does a student’s sleep time affect cumulative GPA for different genders?

  3. How do students’ sleep time relate to bedtime consistency?

  4. How does daytime sleep duration relate to cumulative GPA, and does this relationship differ across academic workloads?

Research Questions 1

In this problem, we aim to investigate how students’ bedtime, measured by sleep midpoint, interacts with academic course load to affect their term GPA.

We first performed a data cleaning step to remove observations with missing values in any of the three relevant variables: midpoint_sleep, Zterm_units_ZofZ (a standardized measure of course load), and term_gpa. This ensures all observations used in the analysis are complete and avoids biases from incomplete data.

To better understand the interaction effects, we created two categorical variables based on tertiles (33.3% quantiles) of midpoint_sleep and Zterm_units_ZofZ:

-Bedtime Category (bedtime_group): grouped as Early, Average, and Late sleepers.

-Course Load Level (load_group): grouped as Light, Average, and Heavy workloads.

These groups were constructed using the cut() function with quantile-based breakpoints, which ensures roughly equal-sized bins and interpretable groupings.

library(tidyverse)
sleep_data <- read_csv("~/Desktop/cmu-sleep.csv")
sleep_clean <- sleep_data %>%
  drop_na(midpoint_sleep, Zterm_units_ZofZ, term_gpa)

sleep_clean <- sleep_clean %>%
  mutate(
    bedtime_group = cut(midpoint_sleep,
                        breaks = quantile(midpoint_sleep, 
                        probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
                        labels = c("Early", "Average", "Late"),
                        include.lowest = TRUE),
    load_group = cut(Zterm_units_ZofZ,
                     breaks = quantile(Zterm_units_ZofZ,
                     probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
                     labels = c("Light", "Average", "Heavy"),
                     include.lowest = TRUE)
  ) %>%
  drop_na(bedtime_group, load_group)

sleep_clean$bedtime_group <- factor(sleep_clean$bedtime_group, 
levels = c("Early", "Average", "Late"))
sleep_clean$load_group <- factor(sleep_clean$load_group, 
levels = c("Light", "Average", "Heavy"))

ggplot(sleep_clean, aes(x = load_group, y = term_gpa, fill = bedtime_group)) +
  geom_violin(trim = FALSE, alpha = 0.6, position = position_dodge(0.9)) +
  geom_boxplot(width = 0.1, position = position_dodge(0.9), 
  outlier.shape = NA, color = "black") +
  labs(
    title = "GPA Distribution by Course Load and Bedtime Category",
    x = "Course Load Level",
    y = "Term GPA",
    fill = "Bedtime"
  ) +
  theme_minimal()

Shown in the figure above, we employed a violin-boxplot hybrid to visualize the distribution of term GPA across stratified bedtime and course load categories. The violin plots display kernel density estimations of GPA, providing a smoothed visualization of its distribution within each subgroup. Overlaid boxplots enhance the interpretability by clearly showing medians and interquartile ranges (IQR).

A clear pattern emerges: students classified as early sleepers consistently outperform their average and late-sleeping peers in GPA, across all levels of academic course load. Notably, under the heavy workload condition, the density curve for late sleepers is more vertically stretched and flattened—this reflects greater dispersion and lower kurtosis, suggesting that GPA outcomes are not only lower on average, but also less consistent. In contrast, the early sleepers exhibit tighter distributions with higher density near the upper GPA range, indicating both better and more stable academic outcomes.

Moreover, the increased vertical spread of the violins for late sleepers—particularly in the heavy course load group—suggests the presence of more extreme GPA values and potentially heavier tails. This may indicate that some late sleepers manage to perform well, but overall the variability and risk of low performance increase substantially.

The pattern across all load levels suggests a main effect of bedtime, rather than a strong interaction, aligning with our later regression findings. The visual evidence here underscores the potential protective effect of earlier sleep timing, particularly under conditions of elevated academic demand.

library(GGally)

pair_vars <- sleep_clean %>%
  select(Zterm_units_ZofZ, midpoint_sleep, term_gpa)

colnames(pair_vars) <- c("Course Load", "Bedtime (midpoint)", "GPA")

ggpairs(pair_vars,
        lower = list(continuous = wrap("points", alpha = 0.5, size = 1.5)),
        upper = list(continuous = wrap("cor", size = 4, method = "pearson")),
        diag = list(continuous = wrap("densityDiag"))) +
  theme_minimal()

To complement our earlier visualization, we constructed a pairwise correlation matrix using the ggpairs() function. This matrix enables simultaneous inspection of marginal distributions, pairwise scatterplots, and Pearson correlation coefficients among three key variables: Course Load, Bedtime (midpoint), and Term GPA.

The correlation between midpoint_sleep and GPA is -0.174 (p < 0.001), indicating a statistically significant negative association: students who go to bed later tend to achieve lower academic performance. This aligns with our previous observations from the violin-boxplot, reinforcing the conclusion that later sleep schedules may hinder GPA.

Interestingly, the correlation between course load and GPA is positive (r = 0.141, p < 0.01), albeit weaker. This suggests that students taking more demanding course loads tend to perform slightly better—possibly reflecting self-selection, where more capable or motivated students opt into heavier schedules.

On the other hand, the correlation between course load and bedtime is negligible (r = -0.025), implying that students’ sleep patterns are largely independent of their academic workload. This further supports our hypothesis that bedtime has an independent effect on GPA, rather than simply mediating the relationship between workload and academic outcomes.

From a distributional perspective, the density plots on the diagonals show that bedtime is right-skewed—indicating that a majority of students fall into the later sleep range. GPA, by contrast, is slightly left-skewed, with a noticeable mode around 3.7–3.8. The scatterplot between bedtime and GPA in the lower-left panel reveals a mildly downward-trending cloud, consistent with the negative correlation.

To formally test the relationships among bedtime, academic course load, and term GPA, we constructed a multiple linear regression model that includes both main effects and their interaction.

The outcome variable is term GPA, while the predictors include the midpoint of sleep (in minutes from midnight), standardized course load (Zterm_units_ZofZ), and their interaction term. This specification allows us to examine not only the independent effects of sleep and workload but also whether the relationship between bedtime and GPA differs across workload levels.

The regression results are summarized below:

model <- lm(term_gpa ~ midpoint_sleep * Zterm_units_ZofZ, data = sleep_clean)

summary(model)

Call:
lm(formula = term_gpa ~ midpoint_sleep * Zterm_units_ZofZ, data = sleep_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6513 -0.2501  0.1075  0.3610  0.8450 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.8613919  0.1261754  30.603  < 2e-16 ***
midpoint_sleep                  -0.0011855  0.0003084  -3.844 0.000137 ***
Zterm_units_ZofZ                 0.0113948  0.1352024   0.084 0.932869    
midpoint_sleep:Zterm_units_ZofZ  0.0001511  0.0003263   0.463 0.643528    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5237 on 483 degrees of freedom
Multiple R-squared:  0.04934,   Adjusted R-squared:  0.04344 
F-statistic: 8.357 on 3 and 483 DF,  p-value: 1.996e-05

The model reveals a statistically significant negative coefficient for midpoint_sleep (p < 0.001), meaning that later bedtimes are associated with lower GPA on average. Specifically, for each additional minute later that a student’s midpoint of sleep occurs, GPA is expected to decrease by approximately 0.00185 points, holding course load constant.

The main effect of course load is not statistically significant (p = 0.933), and neither is the interaction term (p = 0.643). This indicates that the effect of bedtime on GPA does not vary significantly across course load levels—in other words, the impact of later bedtimes is consistent, regardless of how many courses a student is taking.

Although the model’s R-squared is relatively low (4.9%), this is not unexpected in educational or behavioral datasets where many unobserved factors influence outcomes. Still, the statistical significance of bedtime confirms its predictive value.

In conclusion, our analysis shows that students who go to bed earlier tend to earn higher GPAs, regardless of how heavy their course load is. This pattern was evident in both our violin plots and pairwise correlation matrix, and confirmed by a regression model where later bedtimes significantly predicted lower GPA, while course load had no meaningful effect.

That said, our model explains only a small fraction of GPA variance, suggesting that other factors—such as stress, time management, or study habits—may also play important roles. Future research could explore these variables or apply more flexible models to capture non-linear patterns.

Overall, our findings highlight sleep timing as a consistent and important factor in academic success.

Research Questions 2

Now, we want to learn about how students’ sleep time affects cumulative GPA for different genders, which suggests we should examine variables TotalSleepTime, demo_gender, and cum_gpa.

sleep_gender_data <- sleep_data |> 
  mutate(
    gender = ifelse(demo_gender == 0, "Male", "Female"),
    gpa = case_when(cum_gpa < 2.5 ~ "GPA (<2.5)",
                    cum_gpa >= 2.5 & cum_gpa < 3.5 ~ "GPA (2.5-3.5)",  
                    cum_gpa >= 3.5 ~ "GPA (>=3.5)"))

sleep_gender_data <- sleep_gender_data |>
  filter(!is.na(gender))


sleep_gender_data |> 
  ggplot(aes(x = TotalSleepTime, fill = gpa)) +
  geom_histogram(alpha = 0.8, position = "stack") +
  facet_wrap(~ gender) +
  labs(title = "Plot of Sleep Time by Cumulative GPA and Gender",
    x = "Average Sleep Time (minutes)",
    y = "Number of Students")

The above graph shows a stacked histogram of sleep time, with GPAs color-coded. And it’s split by gender with females on the left and males on the right. This graph suggests that students with higher GPAs tend to sleep slightly more because we can see the green bar when a GPA greater than or equal to 3.5 is more concentrated on the right side of the distribution. This is consistent across both females and male, which means that longer sleep may be associated with better academic performance. There are also more female students with higher GPAs that clustered around 400–460 minutes of sleep time. For students with GPAs less than 2.5, there are no specific patterns as it is spread out in small numbers across different sleep times. Overall, we can see that students with higher GPA tend to sleep slightly more, especially among females.

In addition, we can further explore this relationship by plot a dendrogram to see whether students cluster into groups based on different sleep time, cumulative GPAs and gender and how different groups relate to GPAs.

library(dendextend)
library(dplyr)

sleep_cluster_data <- sleep_data |>
  filter(!is.na(TotalSleepTime), !is.na(cum_gpa), !is.na(demo_gender)) |>
  select(TotalSleepTime, cum_gpa, demo_gender)
sleep_cluster_data$demo_gender <- as.numeric(sleep_cluster_data$demo_gender)

sleep_scaled <- scale(sleep_cluster_data, center = TRUE, scale = TRUE)
sleep_dist <- dist(sleep_scaled)
hc_complete <- hclust(sleep_dist, method = "complete")
hc_complete_dend <- as.dendrogram(hc_complete)
hc_complete_dend <- set(hc_complete_dend, "branches_k_color", k=3)



gpa_colors <- ifelse(sleep_cluster_data$cum_gpa < 2.5, "red",
                     ifelse(sleep_cluster_data$cum_gpa < 3.5, "orange", "green"))

hc_dend <- set(hc_complete_dend, "labels_colors", 
order_value = TRUE, value = gpa_colors)


plot(hc_dend,
     main = "Dendrogram of Sleep Time, GPA, and Gender")

Above is a dendrogram that shows three main clusters that group based on the similarity across three variables: TotalSleepTime, demo_gender, and cum_gpa. GPAs are shown in different colors along the bottom of the x-axis, where red shows a GPA < 2.5, orange shows a GPA from 2.5 to 3.5, and green shows a GPA above or equal to 3.5. Based on the dendrogram, we can see that high GPAs students tend to group in the red cluster, which implies that there is a specific pattern of sleep time and gender that is tied to higher GPAs. For instance, common behavioral traits are associated with similar GPAs, and students who sleep more and belong to specific gender groups may have a higher likelihood of getting higher GPAs. This interpretation aligns with the stacked histogram, which shows more female students with GPAs greater than or equal to 3.5 are concentrated on longer sleep time.

In conclusion, based on the above two graphs, we can conclude that students who are females with longer sleep time tend to have a slightly higher GPA. This is further proof by the dendrogram that students who share common traits tend to get similar GPAs. In this case, it is when students are females with a longer sleep time.

Research Questions 3

We also want to see how do students’ sleep time relate to bedtime consistency. The first part of the analysis checks the link between average sleep time and bedtime irregularity, using a method called Mean of Squared Successive Differences (MSSD). A 2D density contour plot here determine if students who sleep later have more erratic sleep patterns.

library(ggplot2)    
sleep_data <- sleep_data %>%
  mutate(
    midpoint_sleep = as.numeric(midpoint_sleep),
    bedtime_mssd = as.numeric(bedtime_mssd)
  )
sleep_data_clean <- sleep_data %>%
  filter(!is.na(midpoint_sleep), !is.na(bedtime_mssd))

sleep_data %>%
  ggplot(aes(x = midpoint_sleep, y = bedtime_mssd)) +
  geom_density2d(color = "blue") +  
  geom_point(color = "gray", alpha = 0.6, size = 2) +  
  labs(
    title = "Midpoint of Sleep vs Bedtime Variability",
    x = "Midpoint of Sleep (minutes after 11PM)",
    y = "Bedtime Variability (MSSD)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12)
  )

This figure depicts the correlation between students’ sleep midpoint and the variability in their bedtimes, measured by the Mean of Squared Successive Differences (MSSD). A significant number of students exhibit a sleep midpoint ranging from 300 to 400 minutes post-11 PM (approximately between 4:00 and 6:00 AM) and show minimal bedtime variability, indicating that they maintain stable sleep patterns despite the late hour. The contour lines reveal that a majority of students are concentrated in this area of consistency. Nevertheless, as the sleep midpoint shifts to later times, a slight rise in bedtime variability is noted, suggesting that those who retire to bed significantly later are more likely to have irregular bedtime habits. In summary, while the majority of students uphold consistent sleep schedules, those who sleep later are somewhat more susceptible to erratic routines.

Now we can continue to look at the relationship between GPA and bedtime consistency, using a scatterplot with a linear regression line to see if better-performing students also have more regular sleep schedules.

ggplot(sleep_data, aes(x =term_gpa, y = bedtime_mssd)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") 

This graph analyzes the relationship between students’ term GPAs and their bedtime variability, measured by MSSD. Each point represents an individual student, with the X-axis showing their GPA for a specific term and the Y-axis depicting bedtime variability across different nights. A blue linear regression line, along with a confidence interval, highlights the observed trend. The analysis reveals a slight negative correlation, indicating that students with higher GPAs tend to have less variability in their bedtimes, suggesting that better academic performers generally maintain more consistent sleep schedules. Although the correlation is weak and individual variations are significant, the overall trend points to a modest connection between academic success and sleep regularity. It is crucial to note that this analysis does not imply causation, it simply examines the association between academic performance and bedtime consistency. Future research could further explore whether promoting stable sleep routines could support improved academic outcomes, or whether disciplined academic habits naturally extend to sleep behaviors.

In this section, we found interesting insights into college students’ sleep and study habits. The first graph shows that most students go to bed between 4:00 AM and 6:00 AM, maintaining a consistent sleep schedule despite late nights. However, those who sleep even later tend to have more irregular bedtimes, suggesting a chaotic routine. The second graph shows a slight negative correlation between GPA and bedtime variability, indicating that students with higher GPAs tend to have more regular bedtimes. This suggests that consistent sleep patterns may contribute to better academic performance, although individual differences exist and the relationship is modest. Overall, these findings highlight the importance of a regular sleep routine for both health and academic success. Encouraging students to maintain consistent sleep patterns could enhance their daily performance and academic outcomes, and thus by opening up discussions on the positive impact of good sleep habits on college student life.

Research Questions 4

Lastly, we wanted to learn how daytime sleep habits affect academic performance under different course loads.I explored this question with two complementary views of the same data. For the first plot, which is a Scatter Plot, I plot individual students’ daytime sleep vs. GPA, colored by workload. As for the second graph, I employ heatmap with filled 2-D density contours (with points) faceted by workload. Despite of the differences, both graphs focus on the following three variables: daytime sleep, cumulative gpa and term_units.

library(ggplot2)
library(readr)
library(dplyr)
sleep_data$term_units <- as.numeric(sleep_data$term_units)
sleep_data <- sleep_data %>%
  mutate(workload = case_when(
    term_units < 40 ~ "Light Load (<40)",
    term_units >= 40 & term_units <= 50 ~ "Standard Load (40-50)",
    term_units > 50 ~ "Heavy Load (>50)"
  ))
sleep_data$workload <- factor(sleep_data$workload, 
levels = c("Light Load (<40)", "Standard Load (40-50)", "Heavy Load (>50)"))
ggplot(sleep_data, aes(x = daytime_sleep, y = cum_gpa, color = workload)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Cumulative GPA vs. Daytime Sleep by Academic Workload",
    x = "Daytime Sleep (min)",
    y = "Cumulative GPA",
    color = "Term Workload"
  ) 

Starting with my first graph, this scatterplot maps each student’s daytime sleep (minutes per day) on the x-axis against their cumulative GPA on the y-axis, with point colors showing term workload—light, standard, heavy, or gray for missing. You can immediately spot a dense cluster of students who nap under about 60 minutes and maintain GPAs above 3.5 in the top-left. A few outliers nap much longer or have lower GPAs, and the gray points remind us to handle missing workload data carefully.Visually, this pattern supports our core claim that shorter daytime sleep tends to coincide with higher GPAs. The tighter cluster among standard and heavy-load students suggests that as course demands rise, consistent short naps become even more common among high achievers, while lighter-load students show greater GPA variation for similar nap times. Overall, the plot clearly answers our research question by showing that as daytime sleep increases, GPA tends to decrease. This pattern holds in every workload category—and under heavier course loads, the clustering in the low-sleep/high-GPA region becomes even tighter, highlighting how consistent short naps are linked to top performance when academic demands rise.

 sleep_data<- sleep_data %>%
  mutate(
    workload = case_when(
      term_units < 40               ~ "Light (<40 units)",
      term_units >= 40 & term_units <= 50 ~ "Standard (40–50)",
      term_units > 50               ~ "Heavy (>50)"
    ),
    workload = factor(workload, 
                      levels = c("Light (<40 units)",
                                 "Standard (40–50)",
                                 "Heavy (>50)"))
  )%>% filter(!is.na(workload))

ggplot(sleep_data, aes(x = daytime_sleep, y = cum_gpa)) +
  stat_density_2d_filled(contour_var = "ndensity", alpha = 0.8) +
  geom_point(alpha = 0.3, size = 0.8)+
  scale_fill_viridis_d(name = "Relative\ndensity") +
  facet_wrap(~ workload, nrow = 1) +
  labs(
    title = "Daytime Sleep vs. Cumulative GPA by Academic Workload",
    x = "Daytime Sleep (minutes per day)",
    y = "Cumulative GPA"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    panel.spacing.x = unit(1, "lines"),
    strip.text = element_text(face = "bold")
  )

This heatmap plots daytime sleep (x-axis, minutes per day) against cumulative GPA (y-axis), faceted by academic workload (Light, Standard, Heavy). The filled contours show relative student density—lighter yellows mark where most data points lie—while the overlaid dots reveal individual observations. Across all panels, the highest density sits in the 0–60 min daytime sleep & 3.5–4.0 GPA region, indicating that most students who nap briefly tend to have top GPAs. As you move right (more daytime sleep), the density fades and GPA values spread downward, showing a clear negative association between nap length and academic performance. The claim that shorter daytime sleep aligns with higher GPA is well supported by the dense yellow clusters in the upper-left of each panel, and the consistency across workload levels confirms the research question’s focus. The conclusions are tightly aligned with our goal of understanding how sleep habits interact with course load to affect academic outcomes.

Conclusion & Future Improvement

Despite the different research questions, our analyses demonstrate a consistent theme: sleep timing, duration, and regularity all matter for academic success. First, both violin plots and our regression model from the first research question indicate that students who go to bed earlier tend to earn higher GPAs, as later bedtimes significantly predict lower GPAs. Moreover, students with more regular bedtimes also tend to have higher GPAs, suggesting that maintaining a stable sleep routine enhances academic performance. Together, these results underscore sleep timing and consistency as robust factors.

Moving on to the next research question, we observed that sleep duration interacts with student characteristics. Female students who sleep longer at night had slightly higher GPAs, as shown by our dendrogram analysis that when GPA and sleep hours are used to segment students, females with more sleep consistently form the cluster with higher GPAs. This suggests that gender-specific sleep patterns may further affect academic outcomes.

Finally, daytime napping adds another dimension: across light, standard, and heavy course loads, shorter daytime sleep duration (under 60 minutes) is reliably linked to higher GPAs. Both the scatterplot and heatmap show dense clustering in the low-sleep/high-GPA region for every workload category, with heavier workloads producing an even tighter concentration and fewer outliers. As course demands rise, maintaining brief daytime sleep appears increasingly important for higher performance.

Taken together, these show the importance of taking different factors into account when considering the impact of sleep habits on student achievement suggesting a strong foundation for future work on environmental and personal factors that shape both nighttime and daytime sleep.

While our study links sleep timing, duration, and consistency to GPA, there’s more to uncover. Next, we’d gather data on stress and daily routines to see what drives performance, use technologies or surveys to measure sleep quality and experiment with flexible models to spot things like gender differences. These steps need new data and tools, but they’ll give us a clearer picture of how different sleep habits shape academic success.