36315 Final Report: Global Food Wastage

Author

Miao Lei, Sophia Mou, Zi Liu

Data Overview

The data analyzed in this report comes from the Kaggle dataset titled Global Food Wastage Dataset (2018-2024). The dataset contains 5,000 observations covering food wastage across 20 countries, spanning seven years and eight food categories. There are eight variables in total, including two categorical variables and six quantitative variables. Each observation records food waste data for a specific food category within a given country and year. The variables of interest are listed below:

Country: Name of the country
Food Category: Type of food wasted (Fruits & Vegetables, Prepared Food, Dairy Products, Beverages, Meat & Seafood, Grains & Cereals, Frozen Food, Bakery Items)
Year: Year of data collection (2018-2024)
Total Waste: Total amount of food wasted in tons
Economic Loss: Estimated financial loss from food waste in million dollars
AvgWaste per Capita: Average waste per person in kilograms
Population: Population of the country in millions
Household Waste: Percentage of food waste from households

Led by our passion for sustainability, we proposed the following three research questions along our investigation:

Is there a consistent pattern in the ranking of the most wasted food categories over time?
Are there any factors that distinguish food categories?
How does a country’s population relate to economic loss from food waste, and how does this vary by food category?

Question 1: Is there a consistent pattern in the ranking of the most wasted food categories over time?

Motivated by our interest in global food waste solutions, we wanted to explore which food category is most wasted and whether the ranking remains consistent over time. To begin, we examined the most recent data from 2024. We created a bar plot where the x-axis shows eight food categories and the y-axis indicates the total waste amount (in tons) aggregated across all countries. To highlight the rankings, we ordered the bars in descending order based on waste amount.

Code

total_2024 <- data |>
  filter(Year == 2024) |>
  group_by(Food.Category) |>
  summarize(Total_Wastage = sum(Total.Waste..Tons., na.rm = TRUE))

total_2024 |>
  ggplot(aes(x = reorder(Food.Category, -Total_Wastage), y = Total_Wastage)) +
  geom_bar(stat = "identity", fill = "darkblue") +
  theme(axis.text.x = element_text(angle = 25, size = 8)) +
  labs(title = "Global Wasage Amount For Different Food Categories in 2024",
       x = "Food Category", y = "Total Wastege Amount (Tons)")

From this graph, we observe that Prepared Food, Fruits & Vegetables, and Frozen Food are the top three most wasted categories, while Meat & Seafood ranks the lowest. This result is somewhat surprising: since prepared and frozen foods are generally long-lasting, one might expect them to be wasted less than more perishable items like meat. This finding could suggest issues such as overproduction or inefficient consumption of long-shelf-life food products. Additionally, the graph provides insight into the global scale of waste for each category, with the most wasted category, Prepared Food, reaching around 2.7 million tons, while the least wasted, Meat & Seafood, is around 1.9 million tons.

To examine whether the ranking remains consistent across years, we created a time-series plot covering 2018 to 2024. The x-axis represents the year, and the y-axis shows the total waste amount, with each colored line tracking a different food category over time.

Code

data_2024 <- data_summary |>
  filter(Year == 2024)

data_summary |>
  ggplot(aes(x = Year, y = Global_Wastage, color = Food.Category)) +
  geom_line() +
  geom_line(data = subset(data_summary, Food.Category == "Prepared Food"),
            aes(x = Year, y = Global_Wastage), linewidth = 1.5) + 
  # Add the labels:
  geom_text_repel(data = data_2024,
                  aes(label = Food.Category),
                  size = 3, 
                  # Drop the segment connection:
                  segment.color = NA, 
                  # Move labels up or down based on overlap
                  direction = "y",
                  # Try to align the labels horizontally on the left hand side
                  hjust = "left") +
  scale_x_continuous(breaks = unique(data_summary$Year),
                     labels = unique(data_summary$Year),
                     # Update the limits so that there is some padding on the
                     # x-axis but don't label the new maximum
                     limits = c(min(data_summary$Year),
                                max(data_summary$Year) + 1)) +
  theme_bw() +
  # Drop the legend
  theme(legend.position = "none") +
  labs(title = "Global Food Wastage Trends by Category (2018–2024)",
       subtitle = "Prepared Food appears in the top 3 across all years",
       x = "Year", y = "Total Wastage Amount (Tons)",
       color = "Food Category") +
  theme(plot.title = element_text(face = "bold", size = 13))

The lines intersect frequently, and each category shows large year-to-year fluctuations, suggesting there is no consistent ranking pattern across years. However, Prepared Food stands out by consistently remaining within the top three categories each year, showing a steady increase after 2020 and peaking at over 2.75 million tons in 2023. In 2024, its wastage amount is noticeably higher than all other categories. In contrast, categories like Grains & Cereals and Bakery Items display more erratic patterns. For example, Grains & Cereals generally ranks low but briefly peaks around 2021, while Bakery Items experience a sharp decline in 2021 followed by a substantial increase through 2023.

Overall, the ranking of most wasted food categories shifts over time, but Prepared Food consistently contributes a high amount of waste, pointing to a potential focus area for reducing global food waste.

Question 2: Are there any factors that distinguish food categories?

Because there are no obvious patterns in the total waste amount among the food categories across time from the previous research question and this box plot below, we expanded our research scope to include more variables to compare these food categories.

Code

library(dplyr)
library(ggplot2)
food = read.csv("global_food_wastage_dataset.csv")

tf = food[food$Year == 2024, ]
tf |>
  ggplot(aes(x = Food.Category, y = Total.Waste..Tons.)) + 
  geom_boxplot(width = .2) +
  coord_flip() + 
  ggtitle("Distribution on Total Waste across Food Categories")

This led to our second research question: Are there any factors that distinguish food categories? We decided to investigate all numerical variables other than population, including average waste per capita, household waste, total waste, and economic loss, so we made a PCA plot to look for obvious trends in all of them at once. To make the potential patterns more obvious, we decided to only focus on the food waste of the United States for this question.

We first made the following elbow plot to assess how many principal components we should analyse. The result based on the graph is that we should focus on the first three principal components because the first two components are above the average line, meaning that they explain a greater amount of variance than the average level, and the third principal component is slightly below the average line, which is still a relatively important proportion to explain to variation in the data. Hence, the first three components would do a good job of explaining most of the variation in the data.

Code

# elbow plot
us = food[food$Country == "USA",]
us = select(us, -"Population..Million.")

food_quant <- us[,4:7]
food_pca <- prcomp(food_quant, center = TRUE, scale. = TRUE)

library(factoextra)
fviz_eig(food_pca, addlabels = TRUE) +
  geom_hline(yintercept = 100 * (1 / ncol(food_quant)), 
             linetype = "dashed", color = "darkred")

We then looked into the principal components through their linear combinations. Based on the following table, the first component has a relatively strong positive correlation with the total waste (linear combination = 0.690) and the economic loss (linear combination = 0.680). And the second component has a strong positive correlation with average waste per capita (linear combination = 0.615), and a strong negative correlation with population (linear combination = - 0.760). The third component has a strong negative correlation with average waste per capita (linear combination = - 0.765) and household waste (linear combination = 0.639).

Code

# linear combination table
comb = data.frame(food_pca$rotation[,1:3])

library(knitr)
kable(comb, caption = "Linear Combination between Original Numerical Variables and the First Three PCs")

Linear Combination between Original Numerical Variables and the First Three PCs
	PC1	PC2	PC3
Total.Waste..Tons.	0.6897859	-0.1455000	-0.0492398
Economic.Loss..Million…	0.6889130	-0.1520055	-0.0540996
Avg.Waste.per.Capita..Kg.	0.1900456	0.6148281	0.7654006
Household.Waste….	-0.1160903	-0.7600727	0.6393830

Next, we created three biplots on the first two principal components, the first and third principal components, and the second and third principal components to see if there are obvious clusters in food categories related to these principal components. However, none of the biplots showed any obvious clusters of food categories, which suggests that there are no obvious differences between different food categories in terms of different principal component combinations.

Code

# pca's

library(factoextra)

# pc
food_pc_matrix <- food_pca$x
us = us |>
  mutate(pc1 = food_pc_matrix[, 1], pc2 = food_pc_matrix[, 2], pc3 = food_pc_matrix[, 3], pc4 = food_pc_matrix[, 4])

Code

fviz_pca_biplot(food_pca, label = "var", repel = TRUE,
                habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC1 & PC2") + labs(color = "Food Categories")

Code

fviz_pca_biplot(food_pca, axes = c(1, 3), label = "var", repel = TRUE,
                habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC1 & PC3") + labs(color = "Food Categories")

Code

fviz_pca_biplot(food_pca, axes = c(2, 3), label = "var", repel = TRUE,
                habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC2 & PC3") + labs(color = "Food Categories")

Thus, even though there are clear correlations between the principal components and the numerical variables, the PCA plots suggest no obvious differences between food categories based on the numerical variables we are analyzing.

Question 3: How does a country’s population relate to economic loss caused by food waste, and how does this vary by food category?

Our previous PCA analysis investigated all numerical variables related to waste amount, but we did not reveal any clear patterns distinguishing different food categories. Then, we turned our attention to a different factor: a country’s population size. Intuitively, one might expect that population influences the financial impact of food waste. Motivated by this assumption, we hypothesized that population might meaningfully predict the economic losses caused by food waste. Understanding this relationship could inform global policies aimed at reducing food waste, particularly in densely populated regions. However, as our analysis reveals, population alone is not a strong predictor of economic loss.

To explore this question, we first created a faceted scatterplot, where each point represents a country and is colored by total waste volume in tons. Separate linear regression lines were fit for each food category. We chose scatterplots with regression overlays because they allow us to visually assess both the direction and strength of linear relationships across multiple categories simultaneously.

Code

#| warning: false
#| message: false

food_data <- read.csv("global_food_wastage_dataset.csv")
library(ggplot2)



food_data |>
  ggplot(aes(x = Population..Million., 
             y = Economic.Loss..Million..., 
             color = Total.Waste..Tons.)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "#e0d4f7", high = "#5e3c99") + 
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.8) +
  facet_wrap(~ Food.Category) +
  labs(
    title = "Population is Not a Strong Predictor of Food Waste Economic Loss",
    subtitle = "Across all food categories, black regression lines are flat or weak, 
suggesting other factors matter more",
    x = "Population (millions)",
    y = "Economic Loss (USD, millions)",
    color = "Waste Volume (tons)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(size = 12),
    legend.title = element_text(size = 11),
    legend.text = element_text(size = 9),
    axis.text.x = element_text(angle = 0)
  )

The regression lines are nearly flat across all categories, indicating no strong association between population size and economic loss. For instance, in categories such as Bakery Items and Dairy Products, the regression lines exhibit slight downward trends, while in Frozen Food and Grains & Cereals, the slopes are slightly positive. However, in all cases, the slopes are weak. This suggests that factors other than population size are likely driving food waste-related economic costs. Additionally, the color gradient reveals that countries with larger total waste volumes tend to suffer higher economic losses, regardless of population size.

Code

model <- lm(Economic.Loss..Million... ~ Population..Million., data = food_data)
summary(model)


Call:
lm(formula = Economic.Loss..Million... ~ Population..Million., 
    data = food_data)

Residuals:
   Min     1Q Median     3Q    Max 
-24811 -12530   -678  12070  34155 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.447e+04  4.177e+02  58.584   <2e-16 ***
Population..Million. 8.082e-01  5.136e-01   1.574    0.116    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14620 on 4998 degrees of freedom
Multiple R-squared:  0.0004952, Adjusted R-squared:  0.0002952 
F-statistic: 2.476 on 1 and 4998 DF,  p-value: 0.1156

To formally quantify these relationships, we conducted simple linear regression analysis with Economic Loss as the response variable and Population as the predictor. The estimated coefficient for Population was 0.81 (Standard Error = 0.51, p = 0.116), suggesting that an increase of 1 million people is associated with an average increase of only $0.81 million in economic loss — a relatively small magnitude given the overall scale of losses. Moreover, the p-value indicates that this relationship is not statistically significant at the 5% level. The model’s multiple R-squared value was 0.00049, meaning that population size explains less than 0.05% of the variation in economic loss. Together, these results confirm that population size is a very poor predictor of economic loss due to food waste.

To further investigate potential non-linear group patterns, we grouped countries into four population quartiles from smallest to largest populations and calculated the mean economic loss within each quartile for every food category. A heatmap summarizes these group-level patterns.

Code

food_data |>
  mutate(pop_bin = ntile(Population..Million., 4)) |>   
  group_by(pop_bin, Food.Category) |>                   
  summarize(mean_loss = mean(Economic.Loss..Million..., na.rm = TRUE), .groups = "drop") |>  
  ggplot(aes(x = factor(pop_bin), y = Food.Category, fill = mean_loss)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "#5e3c99") +      
  labs(
    title = "Mean Economic Loss by Population Quartile and Food Category",
    x = "Population Quartile (1 = Smallest Population, 4 = Largest Population)",
    y = "Food Category",
    fill = "Mean Loss (Million USD)"
  ) +
  theme_minimal()

The heatmap reveals that economic loss is not consistently highest among the most populous countries. For example, in categories such as Frozen Food and Beverages, Quartile 4 (the largest population) does show elevated average losses. However, in categories such as Bakery Items, Prepared Food, and Fruits & Vegetables, smaller population quartiles (Quartile 1 or 2) often exhibit comparable or even higher economic losses, highlighting that population size is not a consistently reliable predictor.

Our final analysis shows that a country’s population size is not a strong predictor of economic loss from food waste. Scatterplots, linear regression, and heatmap analyses all indicated weak or inconsistent relationships across different food categories. While some large-population countries showed higher losses in specific categories, overall patterns were inconsistent, with smaller countries sometimes experiencing comparable losses. These results suggest that category-specific inefficiencies, rather than population size, more strongly drive economic losses from food waste.

Conclusion

Based on our analysis, there is no consistent pattern in the ranking of the most wasted food categories from 2018 to 2024. However, prepared food consistently ranks among the top three most wasted categories, accounting for a significant portion of global food waste. Additionally, we did not find any clear factor within our dataset that distinguishes between the categories, suggesting that food wastage patterns are likely influenced by variables not captured here. Moreover, contrary to our initial hypothesis, population size does not appear to be a strong indicator of the economic loss caused by food waste, further pointing to the existence of other potential factors affecting food wastage trends.

Future Work

While our analysis explored food waste patterns across food categories, waste amount variables, and the potential influence of population size, several important country-level factors not recorded in our dataset remain to be explored. Future work could incorporate additional variables such as GDP per capita, food supply chain efficiency, urbanization rates, and infrastructure development into multivariate regression models. Exploring non-linear relationships or applying machine learning methods could also help uncover more complex patterns. Investigating these factors may lead to a deeper understanding of the drivers behind the economic burden of food waste across different countries and food categories.

--- title: "36315 Final Report: Global Food Wastage" author: "Miao Lei, Sophia Mou, Zi Liu" format: html: toc: true toc-location: left code-tools: true code-fold: true self-contained: true editor: visual --- ## Data Overview The data analyzed in this report comes from the Kaggle dataset titled [Global Food Wastage Dataset (2018-2024)](https://www.kaggle.com/datasets/atharvasoundankar/global-food-wastage-dataset-2018-2024). The dataset contains 5,000 observations covering food wastage across 20 countries, spanning seven years and eight food categories. There are eight variables in total, including two categorical variables and six quantitative variables. Each observation records food waste data for a specific food category within a given country and year. The variables of interest are listed below: - Country: Name of the country - Food Category: Type of food wasted (Fruits & Vegetables, Prepared Food, Dairy Products, Beverages, Meat & Seafood, Grains & Cereals, Frozen Food, Bakery Items) - Year: Year of data collection (2018-2024) - Total Waste: Total amount of food wasted in tons - Economic Loss: Estimated financial loss from food waste in million dollars - AvgWaste per Capita: Average waste per person in kilograms - Population: Population of the country in millions - Household Waste: Percentage of food waste from households Led by our passion for sustainability, we proposed the following three research questions along our investigation: 1. Is there a consistent pattern in the ranking of the most wasted food categories over time? 2. Are there any factors that distinguish food categories? 3. How does a country’s population relate to economic loss from food waste, and how does this vary by food category? ## Question 1: Is there a consistent pattern in the ranking of the most wasted food categories over time? Motivated by our interest in global food waste solutions, we wanted to explore which food category is most wasted and whether the ranking remains consistent over time. To begin, we examined the most recent data from 2024. We created a bar plot where the x-axis shows eight food categories and the y-axis indicates the total waste amount (in tons) aggregated across all countries. To highlight the rankings, we ordered the bars in descending order based on waste amount. ```{r, include=FALSE} data <- read.csv("global_food_wastage_dataset.csv") library(ggplot2) library(dplyr) library(ggcorrplot) library(patchwork) library(ggrepel) data_summary <- data |> group_by(Year, Food.Category) |> summarize(Global_Wastage = sum(Total.Waste..Tons., na.rm = TRUE)) ``` ```{r} total_2024 <- data |> filter(Year == 2024) |> group_by(Food.Category) |> summarize(Total_Wastage = sum(Total.Waste..Tons., na.rm = TRUE)) total_2024 |> ggplot(aes(x = reorder(Food.Category, -Total_Wastage), y = Total_Wastage)) + geom_bar(stat = "identity", fill = "darkblue") + theme(axis.text.x = element_text(angle = 25, size = 8)) + labs(title = "Global Wasage Amount For Different Food Categories in 2024", x = "Food Category", y = "Total Wastege Amount (Tons)") ``` From this graph, we observe that Prepared Food, Fruits & Vegetables, and Frozen Food are the top three most wasted categories, while Meat & Seafood ranks the lowest. This result is somewhat surprising: since prepared and frozen foods are generally long-lasting, one might expect them to be wasted less than more perishable items like meat. This finding could suggest issues such as overproduction or inefficient consumption of long-shelf-life food products. Additionally, the graph provides insight into the global scale of waste for each category, with the most wasted category, Prepared Food, reaching around 2.7 million tons, while the least wasted, Meat & Seafood, is around 1.9 million tons. To examine whether the ranking remains consistent across years, we created a time-series plot covering 2018 to 2024. The x-axis represents the year, and the y-axis shows the total waste amount, with each colored line tracking a different food category over time. ```{r} data_2024 <- data_summary |> filter(Year == 2024) data_summary |> ggplot(aes(x = Year, y = Global_Wastage, color = Food.Category)) + geom_line() + geom_line(data = subset(data_summary, Food.Category == "Prepared Food"), aes(x = Year, y = Global_Wastage), linewidth = 1.5) + # Add the labels: geom_text_repel(data = data_2024, aes(label = Food.Category), size = 3, # Drop the segment connection: segment.color = NA, # Move labels up or down based on overlap direction = "y", # Try to align the labels horizontally on the left hand side hjust = "left") + scale_x_continuous(breaks = unique(data_summary$Year), labels = unique(data_summary$Year), # Update the limits so that there is some padding on the # x-axis but don't label the new maximum limits = c(min(data_summary$Year), max(data_summary$Year) + 1)) + theme_bw() + # Drop the legend theme(legend.position = "none") + labs(title = "Global Food Wastage Trends by Category (2018–2024)", subtitle = "Prepared Food appears in the top 3 across all years", x = "Year", y = "Total Wastage Amount (Tons)", color = "Food Category") + theme(plot.title = element_text(face = "bold", size = 13)) ``` The lines intersect frequently, and each category shows large year-to-year fluctuations, suggesting there is no consistent ranking pattern across years. However, Prepared Food stands out by consistently remaining within the top three categories each year, showing a steady increase after 2020 and peaking at over 2.75 million tons in 2023. In 2024, its wastage amount is noticeably higher than all other categories. In contrast, categories like Grains & Cereals and Bakery Items display more erratic patterns. For example, Grains & Cereals generally ranks low but briefly peaks around 2021, while Bakery Items experience a sharp decline in 2021 followed by a substantial increase through 2023. Overall, the ranking of most wasted food categories shifts over time, but Prepared Food consistently contributes a high amount of waste, pointing to a potential focus area for reducing global food waste. ## Question 2: Are there any factors that distinguish food categories? Because there are no obvious patterns in the total waste amount among the food categories across time from the previous research question and this box plot below, we expanded our research scope to include more variables to compare these food categories. ```{r, warning=FALSE, message=FALSE} library(dplyr) library(ggplot2) food = read.csv("global_food_wastage_dataset.csv") tf = food[food$Year == 2024, ] tf |> ggplot(aes(x = Food.Category, y = Total.Waste..Tons.)) + geom_boxplot(width = .2) + coord_flip() + ggtitle("Distribution on Total Waste across Food Categories") ``` This led to our second research question: Are there any factors that distinguish food categories? We decided to investigate all numerical variables other than population, including average waste per capita, household waste, total waste, and economic loss, so we made a PCA plot to look for obvious trends in all of them at once. To make the potential patterns more obvious, we decided to only focus on the food waste of the United States for this question. We first made the following elbow plot to assess how many principal components we should analyse. The result based on the graph is that we should focus on the first three principal components because the first two components are above the average line, meaning that they explain a greater amount of variance than the average level, and the third principal component is slightly below the average line, which is still a relatively important proportion to explain to variation in the data. Hence, the first three components would do a good job of explaining most of the variation in the data. ```{r, warning=FALSE, message=FALSE} # elbow plot us = food[food$Country == "USA",] us = select(us, -"Population..Million.") food_quant <- us[,4:7] food_pca <- prcomp(food_quant, center = TRUE, scale. = TRUE) library(factoextra) fviz_eig(food_pca, addlabels = TRUE) + geom_hline(yintercept = 100 * (1 / ncol(food_quant)), linetype = "dashed", color = "darkred") ``` We then looked into the principal components through their linear combinations. Based on the following table, the first component has a relatively strong positive correlation with the total waste (linear combination = 0.690) and the economic loss (linear combination = 0.680). And the second component has a strong positive correlation with average waste per capita (linear combination = 0.615), and a strong negative correlation with population (linear combination = - 0.760). The third component has a strong negative correlation with average waste per capita (linear combination = - 0.765) and household waste (linear combination = 0.639). ```{r, warning=FALSE, message=FALSE} # linear combination table comb = data.frame(food_pca$rotation[,1:3]) library(knitr) kable(comb, caption = "Linear Combination between Original Numerical Variables and the First Three PCs") ``` Next, we created three biplots on the first two principal components, the first and third principal components, and the second and third principal components to see if there are obvious clusters in food categories related to these principal components. However, none of the biplots showed any obvious clusters of food categories, which suggests that there are no obvious differences between different food categories in terms of different principal component combinations. ```{r, warning=FALSE, message=FALSE} # pca's library(factoextra) # pc food_pc_matrix <- food_pca$x us = us |> mutate(pc1 = food_pc_matrix[, 1], pc2 = food_pc_matrix[, 2], pc3 = food_pc_matrix[, 3], pc4 = food_pc_matrix[, 4]) ``` ```{r, warning=FALSE, message=FALSE} fviz_pca_biplot(food_pca, label = "var", repel = TRUE, habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC1 & PC2") + labs(color = "Food Categories") fviz_pca_biplot(food_pca, axes = c(1, 3), label = "var", repel = TRUE, habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC1 & PC3") + labs(color = "Food Categories") fviz_pca_biplot(food_pca, axes = c(2, 3), label = "var", repel = TRUE, habillage = us$Food.Category, pointshape = 19, alpha = 0.7) + ggtitle("Biplot of PC2 & PC3") + labs(color = "Food Categories") ``` Thus, even though there are clear correlations between the principal components and the numerical variables, the PCA plots suggest no obvious differences between food categories based on the numerical variables we are analyzing. ## Question 3: How does a country’s population relate to economic loss caused by food waste, and how does this vary by food category? Our previous PCA analysis investigated all numerical variables related to waste amount, but we did not reveal any clear patterns distinguishing different food categories. Then, we turned our attention to a different factor: a country's population size. Intuitively, one might expect that population influences the financial impact of food waste. Motivated by this assumption, we hypothesized that population might meaningfully predict the economic losses caused by food waste. Understanding this relationship could inform global policies aimed at reducing food waste, particularly in densely populated regions. However, as our analysis reveals, population alone is not a strong predictor of economic loss. To explore this question, we first created a faceted scatterplot, where each point represents a country and is colored by total waste volume in tons. Separate linear regression lines were fit for each food category. We chose scatterplots with regression overlays because they allow us to visually assess both the direction and strength of linear relationships across multiple categories simultaneously. ```{r, warning=FALSE, message = FALSE} #| warning: false #| message: false food_data <- read.csv("global_food_wastage_dataset.csv") library(ggplot2) food_data |> ggplot(aes(x = Population..Million., y = Economic.Loss..Million..., color = Total.Waste..Tons.)) + geom_point(alpha = 0.5) + scale_color_gradient(low = "#e0d4f7", high = "#5e3c99") + geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.8) + facet_wrap(~ Food.Category) + labs( title = "Population is Not a Strong Predictor of Food Waste Economic Loss", subtitle = "Across all food categories, black regression lines are flat or weak, suggesting other factors matter more", x = "Population (millions)", y = "Economic Loss (USD, millions)", color = "Waste Volume (tons)" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 13), plot.subtitle = element_text(size = 11), axis.title = element_text(size = 12), legend.title = element_text(size = 11), legend.text = element_text(size = 9), axis.text.x = element_text(angle = 0) ) ``` The regression lines are nearly flat across all categories, indicating no strong association between population size and economic loss. For instance, in categories such as Bakery Items and Dairy Products, the regression lines exhibit slight downward trends, while in Frozen Food and Grains & Cereals, the slopes are slightly positive. However, in all cases, the slopes are weak. This suggests that factors other than population size are likely driving food waste-related economic costs. Additionally, the color gradient reveals that countries with larger total waste volumes tend to suffer higher economic losses, regardless of population size. ```{r} model <- lm(Economic.Loss..Million... ~ Population..Million., data = food_data) summary(model) ``` To formally quantify these relationships, we conducted simple linear regression analysis with Economic Loss as the response variable and Population as the predictor. The estimated coefficient for Population was 0.81 (Standard Error = 0.51, p = 0.116), suggesting that an increase of 1 million people is associated with an average increase of only \$0.81 million in economic loss — a relatively small magnitude given the overall scale of losses. Moreover, the p-value indicates that this relationship is not statistically significant at the 5% level. The model's multiple R-squared value was 0.00049, meaning that population size explains less than 0.05% of the variation in economic loss. Together, these results confirm that population size is a very poor predictor of economic loss due to food waste. To further investigate potential non-linear group patterns, we grouped countries into four population quartiles from smallest to largest populations and calculated the mean economic loss within each quartile for every food category. A heatmap summarizes these group-level patterns. ```{r} food_data |> mutate(pop_bin = ntile(Population..Million., 4)) |> group_by(pop_bin, Food.Category) |> summarize(mean_loss = mean(Economic.Loss..Million..., na.rm = TRUE), .groups = "drop") |> ggplot(aes(x = factor(pop_bin), y = Food.Category, fill = mean_loss)) + geom_tile() + scale_fill_gradient(low = "white", high = "#5e3c99") + labs( title = "Mean Economic Loss by Population Quartile and Food Category", x = "Population Quartile (1 = Smallest Population, 4 = Largest Population)", y = "Food Category", fill = "Mean Loss (Million USD)" ) + theme_minimal() ``` The heatmap reveals that economic loss is not consistently highest among the most populous countries. For example, in categories such as Frozen Food and Beverages, Quartile 4 (the largest population) does show elevated average losses. However, in categories such as Bakery Items, Prepared Food, and Fruits & Vegetables, smaller population quartiles (Quartile 1 or 2) often exhibit comparable or even higher economic losses, highlighting that population size is not a consistently reliable predictor. Our final analysis shows that a country's population size is not a strong predictor of economic loss from food waste. Scatterplots, linear regression, and heatmap analyses all indicated weak or inconsistent relationships across different food categories. While some large-population countries showed higher losses in specific categories, overall patterns were inconsistent, with smaller countries sometimes experiencing comparable losses. These results suggest that category-specific inefficiencies, rather than population size, more strongly drive economic losses from food waste. ## Conclusion Based on our analysis, there is no consistent pattern in the ranking of the most wasted food categories from 2018 to 2024. However, prepared food consistently ranks among the top three most wasted categories, accounting for a significant portion of global food waste. Additionally, we did not find any clear factor within our dataset that distinguishes between the categories, suggesting that food wastage patterns are likely influenced by variables not captured here. Moreover, contrary to our initial hypothesis, population size does not appear to be a strong indicator of the economic loss caused by food waste, further pointing to the existence of other potential factors affecting food wastage trends. ## Future Work While our analysis explored food waste patterns across food categories, waste amount variables, and the potential influence of population size, several important country-level factors not recorded in our dataset remain to be explored. Future work could incorporate additional variables such as GDP per capita, food supply chain efficiency, urbanization rates, and infrastructure development into multivariate regression models. Exploring non-linear relationships or applying machine learning methods could also help uncover more complex patterns. Investigating these factors may lead to a deeper understanding of the drivers behind the economic burden of food waste across different countries and food categories.