Introduction

We analyzed the Estimation of Obesity Levels dataset, originally from UCI, which contains information on 2,111 individuals from Mexico, Peru, and Colombia. Each individual is represented by 17 features and each row represents one individual, with columns capturing demographics, eating habits, physical activity, and a labeled obesity class (NObeyesdad). The target variable is NObeyesdad and categorizes the individuals into seven different obesity groups.

Key Variables

library(tibble)
library(knitr)

var_tbl = tribble(
  ~Variable, ~`Description (survey wording abridged)`, ~Type, ~`Levels / Units`,
  "Age", "Age (years)", "Quantitative", "numeric",
  "BMI", "Body‑mass index (Weight / Height²) — derived", "Quantitative", "numeric, kg/m²",
  "FAF", "Days/week with physical activity", "Ordered categorical", "0 = None, 1 = 1‑2 d, 2 = 2‑4 d, 3 = 4‑5 d",
  "TUE_cat", "Daily tech‑device use", "Ordered categorical", "0‑2 h / 3‑5 h / >5 h",
  "FCVC", "Vegetable frequency in meals", "Ordered categorical", "1 = Rarely, 2 = Sometimes, 3 = Always",
  "FAVC", "Eat high‑calorie food frequently?", "Binary categorical", "no / yes",
  "CH2O", "Daily water intake", "Ordered categorical", "1 = <1 L, 2 = 1‑2 L, 3 = >2 L",
  "Gender", "Self‑reported sex", "Binary categorical", "Female / Male",
  "MTRANS", "Primary transport mode", "Unordered categorical", "Automobile / Motorbike / Bike / Public Transportation / Walking",
  "NObeyesdad", "Expert‑labelled obesity class", "Ordered categorical", "7 levels (Insufficient → Obesity III)"
)

kable(
  var_tbl,
  caption  = "Variables used in the analysis",
  booktabs = TRUE,
  align    = c("l", "l", "l", "l")
  )
Variables used in the analysis
Variable Description (survey wording abridged) Type Levels / Units
Age Age (years) Quantitative numeric
BMI Body‑mass index (Weight / Height²) — derived Quantitative numeric, kg/m²
FAF Days/week with physical activity Ordered categorical 0 = None, 1 = 1‑2 d, 2 = 2‑4 d, 3 = 4‑5 d
TUE_cat Daily tech‑device use Ordered categorical 0‑2 h / 3‑5 h / >5 h
FCVC Vegetable frequency in meals Ordered categorical 1 = Rarely, 2 = Sometimes, 3 = Always
FAVC Eat high‑calorie food frequently? Binary categorical no / yes
CH2O Daily water intake Ordered categorical 1 = <1 L, 2 = 1‑2 L, 3 = >2 L
Gender Self‑reported sex Binary categorical Female / Male
MTRANS Primary transport mode Unordered categorical Automobile / Motorbike / Bike / Public Transportation / Walking
NObeyesdad Expert‑labelled obesity class Ordered categorical 7 levels (Insufficient → Obesity III)

Research Questions

  1. Physical Activity & Screen Time

    How does weekly physical activity (FAF) relate to BMI, and is that relationship moderated by technology use (TUE)?

    Regular exercise is a cornerstone of weight management, but its effectiveness can be undermined by prolonged sedentary behaviors, especially leisure screen‐time. By examining how days‐per‐week of physical activity relate to BMI and whether that slope changes across low, moderate, and high device‑use groups, we can identify whether screen‑time weakens the benefits of exercise. This informs whether public‐health guidance should pair exercise prescriptions with screen‐time reduction for optimal obesity prevention.

  2. Vegetables and High‑Calorie Food

    Do vegetable‑eating frequency (FCVC) and high‑calorie‑food habit (FAVC) interact in their effect on BMI?

    A diet rich in vegetables is broadly linked to lower caloric density and improved weight outcomes, yet many people consume both produce and calorie‑dense snacks or meals. Investigating the joint effect of vegetable frequency (FCVC) and the frequency of high‑caloric food intake (FAVC) on BMI allows us to see if greens alone are enough, or if they must be coupled with limiting indulgent foods.

  3. Hydration and Age

    Is higher water intake (CH2O) associated with lower BMI, and does that association vary across age groups?

    Water intake has been proposed to support weight regulation through satiety, metabolism, or as a substitute for sugary beverages. However, few studies examine how its association with BMI may change across different life stages. By mapping BMI distributions by hydration level and then overlaying age, we can determine if heavier water drinkers consistently carry less weight and whether that pattern holds for both younger and older adults.

  4. Mobility

    How do primary transport modes (MTRANS) relate to obesity risk, both in terms of BMI distributions and obesity class proportions?

    Some daily transport choices can act as built‑in sources of physical activity, such as walking or biking. Motorized commutes, on the other hand, do not. Examining both BMI distributions and formal obesity class breakdowns by primary transport mode reveals how active versus sedentary mobility patterns play onto weight status.

library(ggplot2)
library(ggridges)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
obesity = read.csv("/Users/ava/Desktop/36-315/hw/Project/ObesityDataSet_raw_and_data_sinthetic.csv") |>
  mutate(Gender = factor(Gender, levels = c("Female","Male")),
         FAVC  = factor(FAVC, levels = c("no","yes")),
         FCVC  = factor(FCVC, levels = c("1","2","3")),
         CH2O  = factor(CH2O, levels = c("1","2","3")),
         MTRANS = factor(MTRANS, levels = c("Automobile","Motorbike","Bike",
                                            "Public_Transportation","Walking")),
         NObeyesdad = factor(NObeyesdad, levels = c("Insufficient_Weight",
                                                    "Normal_Weight",
                                                    "Overweight_Level_I",
                                                    "Overweight_Level_II",
                                                    "Obesity_Type_I","Obesity_Type_II",
                                                    "Obesity_Type_III")),
         Age = as.numeric(Age),
         Height = as.numeric(Height),
         Weight= as.numeric(Weight),
         BMI = Weight / Height^2,
         FAF = as.numeric(FAF),                     
         FAF_lab = factor(c("None","1-2 days","2-4 days","4-5 days")[FAF + 1],
                          levels = c("None","1-2 days","2-4 days","4-5 days")),
        TUE = as.numeric(TUE),  
        TUE_cat = factor(c("0-2 h","3-5 h",">5 h")[TUE + 1],
                         levels = c("0-2 h","3-5 h",">5 h")),
        NCP_lab = factor(c("1-2 meals","3 meals",">3 meals")[as.numeric(NCP)],
                         levels = c("1-2 meals","3 meals",">3 meals"))
        )

RQ 1: Physical Activity, Screen Time, and BMI

ggplot(obesity, aes(FAF, BMI, colour = Gender)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = 0:3,
                     labels = c("None","1-2d","2-4d","4-5d")) +
  labs(x = "Weekly physical activity (FAF categories)", y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows a modest but consistent downward trend in BMI as weekly exercise frequency increases. Both the female (red) and male (blue) trendlines slope gently downward from “None” to “4–5 days”, indicating that people who report more days of physical activity tend to have slightly lower BMI. The nearly parallel lines suggest that this inverse relationship is of similar strength for women and men. The broad vertical spread of points at each FAF category highlights significant individual variability, meaning that many individuals who exercise frequently still register high BMIs and vice versa. In conclusion, while increased exercise frequency is associated with lower BMI, the gentle slope and wide variance make it clear that physical activity frequency alone only partly explains differences in body mass.

mod = lm(BMI ~ FAF * Gender, data = obesity)
summary(mod)
## 
## Call:
## lm(formula = BMI ~ FAF * Gender, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7900  -5.4078  -0.8372   6.2775  21.8777 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     31.6121     0.3469  91.139  < 2e-16 ***
## FAF             -1.7496     0.2907  -6.018 2.08e-09 ***
## GenderMale      -0.5506     0.5424  -1.015    0.310    
## FAF:GenderMale   0.2267     0.4112   0.551    0.581    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.888 on 2107 degrees of freedom
## Multiple R-squared:  0.03205,    Adjusted R-squared:  0.03067 
## F-statistic: 23.25 on 3 and 2107 DF,  p-value: 8.312e-15

This model demonstrates that each one category increase in weekly physical activity (e.g. from “None” to “1–2 days,” or “1–2 days” to “2–4 days”) is associated with an average 1.75 kg/m² decrease in BMI (t = –6.02, p <.001). The baseline (intercept) of 31.6 kg/m² represents the estimated BMI for women reporting no exercise. Men start about 0.6 kg/m² lower than women when FAF = 0, but that gender difference is not statistically significant (p = .31), nor is the interaction term (p = .58), indicating that the exercise–BMI slope is virtually the same for both genders. Overall, the model’s R² of 0.032 shows that physical activity and gender together explain about 3.2% of the variation in BMI, which confirms that while exercise frequency is a highly significant predictor of lower BMI, most of the individual differences in BMI are driven by other factors.

ggplot(obesity, aes(FAF, BMI, colour = Gender)) +
  geom_point(alpha = .15) +
  geom_smooth(method = "loess", se = TRUE) +
  facet_wrap(~TUE_cat, nrow = 1) +
  scale_x_continuous(breaks = 0:3,
                     labels = c("None","1-2d","2-4d","4-5d")) +
  labs(x = "Physical activity (FAF categories)")
## `geom_smooth()` using formula = 'y ~ x'

We split the exercise–BMI relationship by screen‑time category, which is the amount of time spent on technological devices such as cell phones, video games, television, computers, etc. In the 0–2 h screen time section, women experience a substantial drop in BMI when they move from no exercise to 1–2 days of exercise per week, followed by a modest rebound at 2–4 days and then a renewed decline by 4–5 days. Men in this low screen‐time group, by contrast, see a slight increase in BMI at 1–2 days and again at 2–4 days before their BMI steadily falls through 4–5 days of activity. In the 3–5 h section, both genders display a remarkably smooth, almost linear decrease in BMI as exercise frequency rises. This suggests that moderate screen‐time is compatible with the most consistent benefit from exercise. Finally, in the >5 h section, the curves become far more erratic with small bumps and dips replacing the clear downward trend. This indicates that very high recreational screen time disrupts the otherwise inverse exercise–BMI relationship.

Across all three panels, more exercise generally corresponds to lower BMI, but the magnitude of that benefit shrinks as screen‐time increases. The drop in BMI from “None” to “4–5 days” is largest in the 0–2 h group, more modest in the 3–5 h group, and barely noticeable (and uneven) in the >5 h group. Overall, while exercise remains a powerful tool for lowering BMI, excessive recreational screen‐time limits its effectiveness, underscoring the importance of pairing regular physical activity with limits on leisure device use to achieve the greatest health gains.

RQ 2: Vegetables, high‑calorie food, and BMI

ggplot(obesity, aes(FCVC, BMI, fill = FAVC)) +
  geom_violin(trim = FALSE, alpha = .55) +
  geom_boxplot(width = .08, outlier.shape = NA) +
  scale_x_discrete(labels = c("Rarely","Sometimes","Always")) +
  labs(x = "Vegetable frequency (FCVC)", fill = "High calorie food?")

In the “Rarely” vegetable category, both groups display wide BMI distributions, but those who report eating high‑calorie foods (teal) have a noticeably higher median and a heavier upper tail compared to non–high‑calorie eaters (red). In “Sometimes”, the distributions tighten: non–high‑calorie eaters cluster around a median of ~24 kg/m², while high‑calorie consumers center closer to 27 kg/m², again with a broader spread toward higher BMIs. In the “Always” category, vegetable eaters who avoid high‑calorie foods show the lowest median BMI (~22 kg/m²), whereas those who still consume high‑calorie items have a median around 40 kg/m² and a long upper tail pushing into the low‑50s. The “NA” group follows the same pattern: teal violins sit above red ones. Overall, more frequent vegetable consumption corresponds with lower and tighter BMI distributions for both diet groups, yet at every frequency level—even among “Always” eaters—those who also consume high‑calorie foods maintain substantially higher BMIs. This suggests that while upping vegetable intake is linked to lower body weight, avoiding high‑calorie foods is an equally important partner strategy for managing BMI.

obesity |>
  mutate(VegFreq_num = as.numeric(as.character(FCVC))) |>
  ggplot(aes(x = VegFreq_num, y = BMI)) +
    geom_point(alpha = 0.18) +                  
    stat_density2d(aes(colour = ..level..), size = 0.9, show.legend = FALSE) +
    facet_wrap(~ FAVC, nrow = 1,
               labeller = as_labeller(c(`no` = "No high-cal food",
                                        `yes` = "Yes high-cal food"))) +
    scale_x_continuous(breaks = 1:3, labels = c("Rarely", "Sometimes", "Always")) +
    labs(x = "Vegetable frequency (FCVC)", y = "BMI",
         title = "BMI vs vegetable frequency"
    )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 826 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 826 rows containing missing values or values outside the scale range
## (`geom_point()`).

In the “No high‐calorie food” panel, the 2D density contours show that BMIs sit roughly between 19 and 26 kg/m² for those who rarely eat vegetables. The densest contour rises slightly to about 26–27 kg/m² for sometimes eaters, and among always eaters the densest contour shifts to around 24 kg/m². In other words, once high‐calorie foods are off the table, upping your vegetable intake all the way to “always” corresponds with the leanest profiles. In the “Yes high‐calorie food” panel, by contrast, BMI increases with vegetable frequency. The cluster for “rare” vegetable‐eaters sits around 20-26 kg/m², shifts up to about 27-30 kg/m² for “sometimes”, and, in the “always” group, shifts much higher to around 40–45 kg/m². Overall, eating more vegetables only translates into lower BMI when you are not also loading up on high‐calorie foods. Once high‐calorie eating enters the picture, increasing vegetable intake alone can not overcome its “BMI penalty,” and, in fact, the heaviest people often report the highest vegetable frequency, perhaps because they are already attempting to compensate.

RQ 3: Hydration and Age

ggplot(obesity, aes(BMI, CH2O, fill = CH2O)) +
  geom_density_ridges(scale = 1.1, alpha = .7, rel_min_height = .01) +
  scale_y_discrete(labels = c("<1 L","1-2 L",">2 L")) +
  labs(y = "Daily water group (CH2O)")
## Picking joint bandwidth of 1.71

The ridgeline plot shows how BMI is distributed across daily water‐intake groups. People who report drinking less than 1 L per day (bottom ridge) exhibit the widest range of BMIs, with a pronounced peak in the mid‑20s but a substantial tail extending well above 30. This indicates that many low‑hydration individuals carry higher BMIs. Those in the 1–2 L group have a tighter, more unimodal distribution centered around the high‑20s, with far fewer extremely high BMI values. Finally, the >2 L drinkers (third ridge) show a similar but slightly lower and narrower peak (mid‑20s), suggesting that heavier water consumption corresponds to both lower average BMI and reduced variability. The topmost “NA” ridge represents missing or unreported data and can be set aside. Overall, the pattern suggests an inverse relationship between water intake and both the level and spread of BMI: higher hydration tends to coincide with lower, more consistent BMI.

ggplot(obesity, aes(x = Age, y = BMI)) +
  geom_point(alpha = 0.15) +
  stat_density2d(color = "blue") +
  facet_wrap(~ CH2O,
             labeller = as_labeller(c(
               "1" = "<1 L",
               "2" = "1–2 L",
               "3" = ">2 L"))
             ) +
  labs(x = "Age (years)")

Across all four panels, there is no clear, monotonic association between age and BMI within any water‐intake category. BMI remains highly variable at every age. In the <1 L group, the densest cluster sits around 20 years and BMI is around 20, with a trailing cloud of higher‑BMI individuals. The 1–2 L panel shifts slightly older and heavier: most participants are in their mid‑20s with BMIs around 25–30. In the >2 L drinkers, the tightest contour appears at younger ages (around 20) but with BMIs clustered more in the mid‑20s rather than lower, suggesting that heavier hydration in this sample does not correspond to lower BMI among older adults. The “NA” facet simply reflects missing data and displays a diffuse spread. Overall, the overlapping contours and scattered points indicate that within each hydration group, age alone does little to predict BMI.

RQ 4: Transport Mode and Obesity Risk

ggplot(obesity, aes(MTRANS, fill = NObeyesdad)) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Primary transport", y = "% within mode", fill = "Obesity class")

In the filled‐bar chart, each transport mode shows a distinct obesity‐class profile. Walking stands out as the leanest: over half of walkers fall into “Normal Weight,” with the remainder split between “Overweight I,” “Overweight II,” and “Insufficient Weight” and very few in other categories. Biking is similar but with a slightly larger slice of “Overweight I” and a modest presence of “Obesity Type II.” Public Transportation users exhibit a much broader mix, with significantly less percentage of “Normal Weight,” while sizable segments occupy each overweight and obesity class, including the highest share of “Obesity Type III.” Motorbike riders resemble bike riders in the “Normal Weight” category and have significant percentage of “Obesity Type I” and some in “Overweight Level I” and “Overweight Level II.” Automobile drivers show the highest proportion of obesity overall, particularly in the “Obesity Type I & II” brackets, with a relatively small “Normal Weight” slice.

ggplot(obesity, aes(x = MTRANS, y = BMI)) +
  geom_boxplot(fill = "blue", alpha = 0.6) +
  coord_flip() +
  labs(x = "Transport mode", y = "BMI")

The boxplots reinforce these patterns in raw BMI distributions. Walkers have the lowest median BMI (~23 kg/m²), the tightest IQR, and minimal outliers above 30. Likewise, bikers cluster around the mid‑20s with a compact spread. Motorbike riders shift upward to a median around 24 kg/m² and display a broader IQR, while Public Transportation users have the widest spread (IQR spanning mid‑20s to mid‑30s). Automobile drivers top the list with a median near 28 kg/m², a wide IQR. Together, these visuals suggest that active modes (walking, biking) are associated with lower and more consistent BMIs, whereas motorized and mass‐transit modes correlate with higher and more variable BMI profiles.

Conclusion

Across our four research questions, we found that no single behavior fully determines BMI, but a constellation of lifestyle factors helps explain variation in obesity risk among adults in Mexico, Peru, and Colombia:

Taken together, these findings underscore the importance of integrated lifestyle interventions, combining regular physical activity, limited screen‑time, a plant‑forward diet, adequate hydration, and active transportation, to deliver the strongest protection against increased BMI. Despite these insights, several questions remain unanswered and encourage future research. Longitudinal or intervention studies are needed to establish whether increasing exercise or cutting screen‑time actually drives BMI changes, rather than merely reflecting preexisting differences. We also need to pinpoint dose–response thresholds for hydration, vegetable intake, and active commuting, and incorporate richer measures of diet quality (beyond frequency) and screen‑time context (e.g., passive TV vs. interactive gaming). Finally, linking in sociodemographic and data, such as income, education, and neighborhood walkability, could reveal key moderators of these lifestyle–BMI relationships.