36-315 Final Project

Author

Aaryan Lalwani, Arya Mane, Diva Shah, and Eli Forcucci

Published

April 28, 2025

Introduction

According to the UN, by 2050, three out of four people worldwide could face drought impacts. Current drought costs already exceed $307 billion annually. Water scarcity can be attributed to several reasons such as overuse, lack of appropriate infrastructure, insufficient rainwater, etc. Identifying the exact cause can help address the problem directly.

Climate change and population growth has triggered increased demand on freshwater supplies. We aim to investigate the relationship between rainfall and water use, explore regional consumption trends, analyzes how water use is divided among sectors, and evaluates how per capita usage responds to changing climate patterns. Together, these research questions aim to uncover patterns to help inform policy and identify causes and solutions of the scarcity.

Data

Set Our data comes from the the Global Water Consumption Dataset (2000-2024) by Atharva Soundankar, on Kaggle. The dataset spans 25 years (2000-2024) and provides insights into water consumption trends across countries. Key categorical variables include Country and Water Scarcity Level (Low, Medium, High), while numerical variables include Total Water Consumption, Per Capita Water Use, Agricultural, Industrial, and Household Water Use (%), Rainfall Impact, and Groundwater Depletion Rate (%). This mix allows us to analyze regional disparities in water use, the impact of rainfall on consumption, and how different sectors contribute to water demand. With growing concerns over sustainability and resource management, we are interested in examining how different regions manage their water resources and identifying potential trends in water scarcity.

Reseach Questions: The key research questions that we aim to answer in this report are:

What is the relationship, if there is one, between rainfall and water consumption and scarcity levels across countries?
How has total water consumption changed over time across different continents?
How does the distribution of water usage between agriculture, industry, and households vary by country, and what are the global patterns or sectoral dependencies that can be identified in 2024?
How does per capita water consumption change with yearly rainfall across low and high rainfall countries?

Loading libraries and the data

library(ggplot2)
library(tidyr)
library(readr)
library(dplyr)
library(dendextend)
library(factoextra)
library(countrycode)
library(ggseas)
library(ggimage)
library(ggpubr)

water_data <- read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/6857710/11349350/cleaned_global_water_consumption.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250427%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250427T021421Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=9cd91a77ec219407996e5c1de319608e34bcff59620fd03b65ac73da0edf348bd4ceacd446806ef2b074caa86077953a489770ca3029508b9530713c6ac022e0f8d096e4b8785dcc9d943b420fc4a89aa6e051d25c983dccd969258c0284930ee47dc51520fb58e3c74a732577e1f01e0ae3c4b3ed77bbeceb90b7888180d8bb2b564de257c47cda20ed6c276a7a6dd4a0fe3e2bcb62242909562c5855e0edf26586cfa264178f8f2a2be265d17e09220a5620b3ca3e4a0f5b909bcee0d99d05e9f242fe9d558a30da37f22255b0008218508788ad6670e9d8c774d2c5d7c9d31364714a959f73370e2a4a067475f148fd3b2f5db1cf9097203288ab45fe1ffd")

Rows: 500 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, Water Scarcity Level
dbl (8): Year, Total Water Consumption (Billion Cubic Meters), Per Capita Wa...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 1: What is the relationship, if there is one, between rainfall and water consumption and scarcity levels across countries?

water_data <- water_data %>%
  mutate(Scarcity_Level = case_when(
    `Groundwater Depletion Rate (%)` >= 3 ~ "Severe",
    `Groundwater Depletion Rate (%)` >= 2 ~ "High",
    `Groundwater Depletion Rate (%)` >= 1 ~ "Moderate",
    TRUE ~ "Low"
  ))


ggplot(water_data, aes(x = `Rainfall Impact (Annual Precipitation in mm)`, 
                       y = `Per Capita Water Use (Liters per Day)`, 
                       color = Scarcity_Level)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  annotate("text", x = 1500, y = 500, label = "Slight positive trend", size = 4, hjust = 0) +
  labs(
    title = "Rainfall vs. Water Use Colored by Scarcity Level",
    x = "Annual Rainfall (mm)",
    y = "Per Capita Water Use (Liters per Day)",
    color = "Water Scarcity Level"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

model <- lm(`Per Capita Water Use (Liters per Day)` ~ `Rainfall Impact (Annual Precipitation in mm)`, 
            data = water_data)
summary(model)


Call:
lm(formula = `Per Capita Water Use (Liters per Day)` ~ `Rainfall Impact (Annual Precipitation in mm)`, 
    data = water_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-161.762  -25.739    0.842   24.966  128.075 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                    2.587e+02  1.024e+01  25.268
`Rainfall Impact (Annual Precipitation in mm)` 1.122e-02  6.511e-03   1.723
                                               Pr(>|t|)    
(Intercept)                                      <2e-16 ***
`Rainfall Impact (Annual Precipitation in mm)`   0.0855 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.59 on 498 degrees of freedom
Multiple R-squared:  0.005925,  Adjusted R-squared:  0.003929 
F-statistic: 2.968 on 1 and 498 DF,  p-value: 0.08553

In order to answer weather or not there is a relationship between rainfall and water scarcity levels we used the scatter plot above. In this plot, the x-axis shows Annual Rainfall (mm) and the y-axis shows Per Capita Water Use (Liters per Day). Each point represents a country, and the color shows the country’s Water Scarcity Level. This makes it easy to spot patterns across different types of countries. We can see that there’s a slight upward trend in the data, which might suggest that countries with more rainfall tend to use a bit more water per person but it’s not a very strong pattern. To check this more formally, we ran a linear regression and looked at the t-test for the slope. The result gave us a p-value of 0.0855, meaning the relationship is not statistically significant at the alpha level of 5%. So, while rainfall might play a small role, it’s probably not the only factor influencing water use. The plot above also shows that there is no immediate relationship between rainfall and ground depletion/water scaricy rates, the colored points don’t cluster together and don’t seem to fit a pattern.

In order to adress and visualize the second part of the question and to identify if there is a relationship across countries we can use a dendrogram.

water_cluster_data <- water_data %>%
  select(Country,
         `Rainfall Impact (Annual Precipitation in mm)`,
         `Per Capita Water Use (Liters per Day)`,
         `Groundwater Depletion Rate (%)`,
         `Agricultural Water Use (%)`,
         `Industrial Water Use (%)`,
         `Household Water Use (%)`) %>%
  na.omit()

# Save country labels
country_names <- water_cluster_data$Country
scaled_data <- scale(water_cluster_data[, -1])  # drop country before scaling

# 2. Compute distance and clustering
distance_matrix <- dist(scaled_data, method = "euclidean")
hc <- hclust(distance_matrix, method = "ward.D2")

# 3. Convert to dendrogram object
dend <- as.dendrogram(hc)
dend <- set(dend, "labels", country_names)

# 4. Color branches by cluster
dend_colored <- color_branches(dend, k = 4)  # choose k clusters to color

# 5. Plot dendrogram
fviz_dend(hc,
          k = 4,  # number of clusters
          cex = 0.5,  # label size
          k_colors = c("#D55E00", "#009E73", "#0072B2", "#CC79A7"),
          main = "Dendrogram of Countries Based on Water Use & Rainfall",
          ylab = "Height (Dissimilarity)",
          rect = TRUE,  # add rectangles around clusters
          horiz = FALSE)

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

The dendrogram above helps us visualize clusters of countries that have similar water usage profiles so we can see if countries with similar rainfall levels also tend to have similar water consumption and scarcity characteristics. The dendrogram uses hierarchical clustering to group observations based on similarity across multiple variables and not just rainfall so to better understand what variables drive similarity between countries, and if rainfall is a key factor, we need to cut the dendrogram into 4 clusters and calculated average rainfall and groundwater depletion for each group. By comparing these averages and visualizing distributions, we can infer whether rainfall is a major factor influencing how countries are grouped.

# Select only numeric columns for scaling
numeric_data <- water_data %>%
  select(where(is.numeric))  # This excludes 'Country' or other character/factor columns

# Scale the numeric columns
scaled_data <- scale(numeric_data)

#Perform hierarchical clustering
distance_matrix <- dist(scaled_data, method = "euclidean")
hc <- hclust(distance_matrix)

# STEP 3: Cut into clusters
clusters <- cutree(hc, k = 4)
water_data$Cluster <- as.factor(clusters)

# STEP 4: Summarize cluster-level stats
cluster_summary <- water_data %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Rainfall = mean(`Rainfall Impact (Annual Precipitation in mm)`),
    Avg_Depletion = mean(`Groundwater Depletion Rate (%)`),
    Avg_PerCapitaUse = mean(`Per Capita Water Use (Liters per Day)`),
    n = n()
  )

print(cluster_summary)

# A tibble: 4 × 5
  Cluster Avg_Rainfall Avg_Depletion Avg_PerCapitaUse     n
  <fct>          <dbl>         <dbl>            <dbl> <int>
1 1              1485.          2.78             260.   110
2 2              1587.          2.53             281.   322
3 3              1228.          2.84             274.    25
4 4              1568.          2.26             285.    43

# Barplot - Avg Rainfall by Cluster
ggplot(cluster_summary, aes(x = Cluster, y = Avg_Rainfall, fill = Cluster)) +
  geom_col() +
  labs(title = "Average Rainfall by Cluster",
       y = "Average Annual Rainfall (mm)") +
  theme_minimal()

#Boxplot - Rainfall by Cluster
ggplot(water_data, aes(x = Cluster, y = `Rainfall Impact (Annual Precipitation in mm)`, fill = Cluster)) +
  geom_boxplot() +
  labs(title = "Rainfall Distribution by Cluster",
       y = "Annual Rainfall (mm)") +
  theme_minimal()

#Scatterplot - Rainfall vs. Depletion
ggplot(water_data, aes(x = `Rainfall Impact (Annual Precipitation in mm)`, 
                                y = `Groundwater Depletion Rate (%)`, 
                                color = Cluster)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Rainfall vs. Groundwater Depletion by Cluster",
       x = "Annual Rainfall (mm)",
       y = "Groundwater Depletion Rate (%)") +
  theme_minimal()

# ANOVA: Does rainfall differ significantly across dendrogram-based clusters?
anova_result <- aov(`Rainfall Impact (Annual Precipitation in mm)` ~ Cluster, data = water_data)

# View the ANOVA summary
summary(anova_result)

             Df   Sum Sq Mean Sq F value   Pr(>F)    
Cluster       3  3500833 1166944   14.74 3.31e-09 ***
Residuals   496 39275434   79184                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Bar plot above shows that Cluster 3 has the least average rainfall, followed by Cluster 1, and then Clusters 2 and 4 with the highest amount of rain. The box plot above shows us that there are clear differences in rainfall across clusters. We can also run a more formal ANOVA test to see if rainfall differs across counties. In the ANVOA test we are testing:

Null Hypothesis (H₀): Mean rainfall is the same across all clusters Alternative Hypothesis (H₁): At least one cluster has a different mean rainfall

The results above show that, F-value is 14.74 and the p-value is less than 0.001, so we can reject the null hypothesis. This shows us that at least one cluster has a significantly different mean rainfall level. This supports our earlier visual findings and confirms that rainfall plays a meaningful role in how countries are grouped based on water-related characteristics.

Based on the graphs and the statistical tests we can conclude that rainfall contributes to how countries are grouped. Additionally, the scatter plot of rainfall vs. groundwater depletion colored by cluster shows that countries in Cluster 3 tend to lie farther to the right, reinforcing their higher rainfall levels. Although depletion rates vary across all clusters, the consistent separation in rainfall patterns suggests that rainfall meaningfully contributes to the similarity between countries in this clustering method.

Overall, the analysis and graphs above suggest that while there is no strong direct relationship between rainfall and water consumption or scarcity levels, the dendrogram reveals that countries with similar rainfall patterns often share similar water consumption and groundwater depletion rates, and can be meaningfully grouped together. So there is a relationship between how these variables varies across countires but not directly between the variables themselves. This shows how the relationship between rainfall and water resource patterns is more complex than we might have originally thought. Rainfall contributes to differences across countries but is not the sole or primary driver when considered in isolation.

Question 2: How has total water consumption changed over time across different continents?

water_data$continent <- countrycode(sourcevar = water_data$Country,
                            origin = "country.name",
                            destination = "continent")

ggplot(water_data, aes(x = Year, y = continent, 
                      fill = `Total Water Consumption (Billion Cubic Meters)`)) +
  geom_tile(color = "white")  +
  scale_fill_gradient(low = "lightblue", high = "navy") +
  labs(title = "Total Water Consumption (Billions Cubic Meters)",
       x = "Year",
       y = "Continent",
       fill = "Water Use") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This is a heatmap of total water consumption (in billions of cubic meters) by continent from the year 2000 to 2025. The x-axis represents the year, and the y-axis shows the continent. The color fill represents the intensity of water use, where darker blue indicates higher water consumption and lighter blue suggests lower water consumption.

This plot helps visually detect changes and trends in water use by continent over the last 25 years. You can see temporal fluctuations, but no continent shows an obvious, consistent increase or decrease.

For Africa, we see fairly stable usage over time as evident by the consistent medium shades. Overall, there are no strong increases or decreases (barring 2013), suggesting steady growth or consistent water needs. For the Americas, we see generally dark shades, especially in 2009–2011, indicating high water usage. There are slight dip in some years around 2014–2015 and 2021. We see some fluctuation, suggesting sensitivity to domestic factors.In Asia, there are noticeable lighter periods, especially early 2000s and around 2024, suggesting dips in consumption. The heatmap shows darker tiles in mid-years (2005–2010), suggesting higher water use due to domestic factors. Overall, there is high variation, implying implies an overall changing consumption landscape. In Europe, there is mostly consistent medium-to-dark shades, with minor fluctuations. This suggests moderately high water use that’s relatively stable over time. For Oceania, we see generally lighter shades, with one sudden spike in 2023–2024.

There’s no clear upward or downward trend globally, but rather regional variation and year-to-year fluctuation. This supports the idea that water consumption is shaped by domestic factors more than global ones.

acf_data <- data.frame()
for (Continent in unique(water_data$continent)) {
  temp_data <- water_data %>%
    filter(continent == Continent) %>%
    arrange(Year)
  acf_result <- acf(temp_data$`Total Water Consumption (Billion Cubic Meters)`, 
                    plot = FALSE, lag.max = 10)

  acf_df <- data.frame(
    Lag = acf_result$lag[, 1, 1],
    ACF = acf_result$acf[, 1, 1],
    Continent = Continent
  )
  acf_data <- rbind(acf_data, acf_df)
}
acf_all <- acf(water_data$`Total Water Consumption (Billion Cubic Meters)`,
               plot = FALSE, lag.max = 10)
acf_all_df <- data.frame( Lag = acf_all$lag, ACF = acf_all$acf, Continent = "All")
acf_data <- bind_rows(acf_data, acf_all_df)

acf_data$Continent <- factor(
  acf_data$Continent,
  levels = c("All", "Africa", "Americas", "Asia", "Europe", "Oceania")
)

annotation_data <- data.frame(
  Continent = "All", 
  Lag = 4.5, 
  ACF = 0.1,
  label = "   High initial\nautocorrelation"
)

acf_data <- acf_data |> filter(Lag != 0)

ggplot(acf_data, aes(x = Lag, y = ACF)) +
  geom_col(aes(fill = Continent), show.legend = FALSE) +
  facet_wrap(~ Continent) +
  labs(
    title = "ACF of Water Consumption by Continent",
    x = "Lag (Years)", y = "Autocorrelation"
  ) +
  theme_minimal() +
  geom_hline(yintercept = 0, color = "gray") +
  geom_text(
    data = annotation_data,
    aes(x = Lag, y = ACF, label = label),
    hjust = 0, vjust = 0, size = 3
  ) +
  theme(
    panel.grid.major = element_blank(),
    panel.spacing = unit(1, "lines"),           # Increases space between facets
    panel.border = element_rect(color = "black", fill = NA, size = 0.5)  # adds box around each facet
  ) +
  scale_x_continuous(breaks = seq(min(acf_data$Lag), max(acf_data$Lag)))

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

This is an Autocorrelation Function (ACF) plot, showing the relationship between current water consumption and historical values over different time lags (in years). The aim of this is to assess if previous years have any effects on future years in terms of water consumption. Lag 0 was removed because it always equals 1 (perfect correlation with itself), and doesn’t give insight into trends or persistence over time. Here, the x-axis is the time lag, that is how many years back we’re comparing, given in years, for which the ACF is measured. The y-axis shows the autocorrelation coefficient which ranges between -1 and 1. A positive coefficient suggests persistence, that is that past values are able to predict current ones. Negative values imply an inverse pattern, that is, high water consumption is followed by low water consumption. The subplots are faceted by Continent, given the countries in the dataset.

The “All” plot shows positive initial autocorrelation at lag 1, which declines quickly, indicating that global water consumption is somewhat persistent, but the correlation fades fast. For Asia, we see moderate positive autocorrelation for lags 2–4, suggesting some short-term consistency. For Americas and Oceania, the plot fluctuates between positive and negative suggesting less predictable trends. For Africa and Europe, we see mostly negative or weak autocorrelation indicating more randomness in usage. Overall, we see that globally, water consumption is somewhat stable from year to year, but patterns don’t persist strongly beyond 1–2 years. Asia shows the most persistent patterns, suggesting stable or slowly shifting usage trends. Oceania and the Americas are more volatile, indicating that there are more shocks and short-term trends which might affect usage more dramatically.

This information can help assess whether forecasting water use based on past values would be effective. Based on the data, the trends are more consistent in Asia, but less so in Africa or Oceania.

Overall, the graphs reinforces the idea that water consumption is not highly volatile. It’s relatively stable, though slight patterns vary by continent.Using the both the plots together, we can assess whether any volatility is in the same direction, ie are there any long term trends. For example, even though the amount of water used in Asia fluctuates year to year, the direction or structure of those changes is predictable.

Question 3: How does the distribution of water usage between agriculture, industry, and households vary by country, and what are the global patterns or sectoral dependencies that can be identified in 2024?

water_df <- read.csv("cleaned_global_water_consumption.csv")
colnames(water_df) <- c(
  "Country",
  "Year",
  "Total_Water_Consumption_BCM",
  "Per_Capita_Use_LPD",
  "Agricultural_Use_Percent",
  "Industrial_Use_Percent",
  "Household_Use_Percent",
  "Annual_Rainfall_mm",
  "Groundwater_Depletion_Percent"
)

sector_df <- water_df |>
  filter(Year == 2024) |>
  filter(Country %in% c("Argentina", "Australia", "Brazil", "Canada", "China", "France", "Germany",
                        "India", "Indonesia", "Italy", "Japan", "Mexico", "Russia", "Saudi Arabia",
                        "South Africa", "South Korea", "Spain", "Turkey", "UK", "USA")) |>
  select(Country, Agricultural_Use_Percent, Industrial_Use_Percent, Household_Use_Percent)

sector_long <- sector_df |>
  pivot_longer(cols = -Country, names_to = "Sector", values_to = "Percent") |>
  mutate(Sector = gsub("_Use_Percent", "", Sector),
         Sector = gsub("Agricultural", "Agriculture", Sector),
         Sector = gsub("Industrial", "Industry", Sector),
         Sector = gsub("Household", "Household", Sector)) |>
  group_by(Country) |>
  mutate(Percent = Percent / sum(Percent) * 100) |>
  ungroup()

# Add flags
sector_long$iso2 <- tolower(countrycode(sector_long$Country, "country.name", "iso2c"))
sector_long$image <- paste0("https://flagcdn.com/w80/", sector_long$iso2, ".png")

flag_labels <- sector_long |>
  group_by(Country) |>
  summarize(image = first(image), .groups = "drop")

# Plot
ggplot(sector_long, aes(x = reorder(Country, -Percent), y = Percent, fill = Sector)) +
  geom_col(width = 0.75) +
  geom_image(data = flag_labels, aes(x = Country, y = 0, image = image),
             size = 0.06, asp = 1.5, inherit.aes = FALSE) +
  scale_fill_manual(
    values = c(
      "Agriculture" = "#E58606", 
      "Household" = "#5D69B1",   
      "Industry" = "#52BCA3"    
    )
  ) +
  labs(
    title = "Water Use by Sector (2024)",
    subtitle = "Stacked by share of total water use in each sector\n\n",
    x = NULL,
    y = "Percent of Total Water Use",
    fill = "Sector"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    legend.position = "right",
    plot.margin = margin(10, 10, 10, 10),
    legend.title = element_text(margin = margin(b = 10)) 
  )

The graph shows the distribution of water use across agriculture, industry, and household sectors for different countries in 2024. Each country is listed on the x-axis, and the percent of water use by each sector is stacked on the y-axis. The research question asks how sectoral water usage varies across countries and what global patterns exist, which is motivated by real-world concerns around water management and resource planning. It shows that agriculture is the dominant water user in most countries, especially in countries like India, Indonesia, and Saudi Arabia, where agricultural use takes up a very large majority of total water use. In contrast, countries like Germany and South Korea have a much larger share of water use in industry compared to agriculture. Household use remains fairly consistent across countries but is slightly higher in developed nations like the USA, Canada, and the UK. There is a clear pattern where less industrialized countries rely more heavily on agriculture, while highly industrialized economies show a more even split between sectors or a stronger industry presence. For statistical analysis, we can summarize that agriculture often accounts for more than 50 percent of water use across most countries, while industry and household sectors usually split the remainder, with some variation. The main conclusion is that agriculture is the leading sector in global water consumption, but the degree of dependency on agriculture versus industry and household sectors strongly reflects a country’s stage of economic development. For future work, we could explore how climate, GDP, irrigation technology, or population characteristics influence sectoral water use, which would require more detailed data beyond just usage percentages.

hist_df <- water_df |>
  filter(Year == 2024) |>
  select(Country, Agricultural_Use_Percent, Industrial_Use_Percent, Household_Use_Percent) |>
  pivot_longer(cols = -Country, names_to = "Sector", values_to = "Percent") |>
  mutate(Sector = gsub("_Use_Percent", "", Sector),
         Sector = gsub("Agricultural", "Agriculture", Sector),
         Sector = gsub("Industrial", "Industry", Sector),
         Sector = gsub("Household", "Household", Sector))



ggplot(hist_df, aes(x = Sector, y = Percent, fill = Sector)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  geom_jitter(width = 0.15, size = 2, alpha = 0.6) +
  stat_summary(fun = median, geom = "point", shape = 18, size = 4, color = "red") +
  coord_flip() +
  scale_fill_manual(values = c("Agriculture" = "#66c2a5", "Industry" = "#fc8d62", "Household" = "#8da0cb")) +
  labs(
    title = "Distribution of Water Use by Sector (2024)",
    subtitle = "Violin plot shows density + country-level values",
    x = "Sector",
    y = "Percent of Total Water Use"
  ) +
  theme_minimal()

The graph shows the distribution of water use across agriculture, industry, and household sectors for different countries in 2024 using a violin plot. The x-axis shows the percent of total water use and the y-axis separates the three sectors. Each black dot represents a country’s value while the red diamond shows the average (median) for each sector. The research question asks how sectoral water usage varies across countries and what global patterns exist, which is important for understanding resource management. The plot shows that agriculture has the highest average share of water use compared to industry and households, and that agricultural water use is more tightly clustered around 50 percent across countries. Household water use is generally between 20 and 30 percent but has more spread compared to agriculture. Industry shows the widest variation, with countries ranging from about 15 percent to nearly 40 percent water use. This suggests that agriculture dominates water use consistently across most countries, while the share of water used by industry varies much more based on a country’s level of industrialization. The violin shapes also highlight that household water use is relatively stable across countries but still shows some variability. The main conclusion is that agriculture remains the main water-consuming sector globally, but industrial and household water uses depend more heavily on a country’s economic structure and development stage.

Question 4: How does per capita water consumption change with yearly rainfall across low and high rainfall countries?

As can be seen with the black regression line below, the average trend across all countries was essentially flat, so we hoped to see if the trends differed across low and high rainfall countries. In order to address this, we subset the data into the 5 countries that saw the most and least average annual precipitation, and graphed those against the overall 20-country average. The five highest rainfall countries were the United States, Spain, Brazil, India, and the U.K., in that order. The five lowest rainfall countries were Saudi Arabia, South Korea, Australia, South Africa, and Mexico, also in that order. We used locally estimated scatterplot smoothing (LOESS) to regress the per capita water usage values against the annual precipitation, giving us the trend lines show in the figure below.

sum_rain <- water_data %>% 
  group_by(Country) %>% 
  summarize(mean_rain = mean(`Rainfall Impact (Annual Precipitation in mm)`))

low <- which(water_data$Country %in% c(head((sum_rain[order(sum_rain$mean_rain),])[,1], 5))$Country)

high <- which(water_data$Country %in% c(tail((sum_rain[order(sum_rain$mean_rain),])[,1], 5))$Country)

outliers <- c(low, high)

water_data$`Rainfall Level` <- vector(length = 500)
water_data$`Rainfall Level`[low] <- "Low"
water_data$`Rainfall Level`[high] <- "High"

ggplot(water_data[outliers, ], aes(x = `Rainfall Impact (Annual Precipitation in mm)`, y = `Per Capita Water Use (Liters per Day)`)) +
  geom_smooth(aes(color = `Rainfall Level`), se = FALSE, method = "loess") +
  geom_smooth(data = water_data[outliers,], aes(x = `Rainfall Impact (Annual Precipitation in mm)`, y = `Per Capita Water Use (Liters per Day)`, color = "Mean"), se = FALSE, method = "loess") +
  geom_point(aes(color = `Rainfall Level`),alpha = 0.2, show.legend = FALSE) +
  labs(title = "Per Capita Water Use over Varying Annual Precipitation Amounts", subtitle = "Between High-Rainfall and Low-Rainfall Countries") +
  theme_minimal() + 
  xlab("Annual Precipitation (mm)") + 
  scale_color_manual(values = c("Mean" = "black", "High" = "orange", "Low" = "blue"), labels = c("Mean" = "Average of All Countries", "High" = "5 Highest Rainfall Countries \n (USA, Spain, Brazil, India, UK) \n", "Low" = "5 Lowest Rainfall Countries \n (Saudi Arabia, S. Korea, \n Australia, S. Africa, Mexico)\n")) +
  stat_cor(aes(group = `Rainfall Level`),
           method = "pearson")

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

While the graph initially seems to show differing trends among low and high rainfall countries, particularly at higher annual precipitation levels, there is not a lot of data, so it may just be noise.

To test significance of the trends, we looked at the Spearman’s R and its p-value for each trend line. Spearman’s R is valid under an assumption of monotonicity (the data consistently trending in either the positive or negative directio) without requiring a linearity assumption, which is reasonable in a relationship between precipitation and water usage.

What we found, though was that neither trend line showed a significance R value, as both p-values were greater than 0.05. This leads us to conclude that we cannot say that the annual precipitation seen in a high or low rainfall country has any relationship with the per capita water use in that country in any given year.

Exploring per capita water use further, we considered that, while there may not be a relationship between it and annual precipitation, it could still have some difference in distribution between low and high annual rainfall countries. The figure below graphs the probability density curve of the observed values with a default bandwidth.

water_data[outliers,] %>% 
  ggplot(aes(x = `Per Capita Water Use (Liters per Day)`,color = `Rainfall Level`)) +
  geom_density(size = 1) +
  labs(title = "Density of Per Capita Water Use Between High and Low Rainfall Countries") +
  scale_color_manual(values = c("High" = "orange", "Low" = "blue")) +
  theme_minimal() +
  ylab("Probability Density")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

From this graph, we can see that there appears to be less variation in per capita water usage among low-rainfall countries, with a greater clustering around the mean and smaller tails. However, the distributions’ shapes and centers appear very similar between both high and low rainfall countries. This suggests that, while the distributions are similar, we may actually see different variation in per capita water use in countries with particularly low or high average yearly rainfalls.

Conclusion and Further Research:

The analysis in this report reveals a number of interesting conclusions. While rainfall alone does not strongly predict water consumption or scarcity levels, it plays a meaningful role in shaping broader water usage patterns across countries. Countries with similar rainfall levels tend to share similarities in groundwater depletion and sectoral water use, as shown through clustering analysis. Over time, total water consumption has remained relatively stable within continents, with short-term fluctuations driven by regional factors. The autocorrelation analysis highlights that Asia shows the most persistent patterns, suggesting a more stable trend in water use, while other continents display more volatile usage. In 2024, agriculture remains the dominant sector for water use globally, especially in less industrialized nations, while household and industrial water use vary more by economic development. Finally, no significant relationship was found between per capita water use and yearly rainfall, regardless of whether a country is typically wet or dry.

This analysis has also opened up ideas for future questions that could be answered with further research, including how economic factors like GDP influence sectoral water use, water scarcity, and per capita water use and if there are threshold effects in rainfall, where extremely low or high levels start to impact consumption patterns (sectoral and per capita) more sharply? The first question requires economic data not present in the data set we have, and the second question would require more advanced statistical techniques to isolate non-linear patterns at extreme ends of the data. These future questions could deepen our understanding of how countries manage water under varying environmental and economic conditions that we explored in this analysis.