Final Report

Author

Sarah Wu, Jess Zheng, John Wang

Published

April 29, 2025

Description of Dataset

This dataset contains coffee bean sales from cities of Saudi Arabia in 2023 and 2024, covering 100 customers and 5 types of coffee beans. Sales trend, sales amount, product types, customer locations, and consumer sensitivity to price, coffee’s total sales vary by city and product type, and differences among final sales amounts made for each coffee type.

The categorical variables are: City, Product, Category, Used Discount. The quantitative variables are: Unit Price, Quantity, Sales Amount, Discount_Amount, Final Sales.

The source of the data is https://www.kaggle.com/datasets/halaturkialotaibi/coffee-bean-sales-dataset.

Research Question 1

Are there significant differences among average final sales amounts made for each coffee type, regardless of discount? If there are, which coffee types significantly differ from others in the group.

The relevant variables are: Product, Final Sales, and Used Discount.

Graph 1

coffee <- read.csv("/Users/johnwang/Desktop/DatasetForCoffeeSales2.csv")
library(tidyverse)

coffee_marginal <- coffee |> 
  group_by(Product, Used_Discount) |> 
  summarize(average_sales = sum(Final.Sales) / n(), .groups = "drop") |> 
  mutate(Discount_Label = ifelse(Used_Discount == 0, "Not Discounted", "Discounted"))

coffee_marginal |> 
  ggplot(aes(x = Product, y = average_sales)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black", width = 0.5) +
  facet_grid(~Discount_Label) +
  labs(title = "Bar Chart of Average Sales by Coffee Type & Discount",
       x = "Coffee Type", y = "Average Sales Amount") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Description and Interpretation for Graph 1

From the graph above, we can see significant differences among certain pairs of coffee types. For sales that did not utilize discounts, we can see a high discrepancy between Ethiopian coffee, which sits at an average sales of 1,200, and Brazilian coffee, at ~750. For sales that utilized discounts, although the gaps were not as obvious, we can still see differences among the average sales made for each coffee type. To further test our hypothesis, we performed a pairwise t-test to determine which coffee types are significantly different from each other.

Statistical Analysis for Graph 1

summary(aov(Final.Sales ~ Product, coffee))

             Df    Sum Sq Mean Sq F value   Pr(>F)    
Product       4  10531074 2632768    10.7 1.97e-08 ***
Residuals   725 178362958  246018                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pairwise.t.test(coffee$Final.Sales, coffee$Product)


    Pairwise comparisons using t tests with pooled SD 

data:  coffee$Final.Sales and coffee$Product 

           Brazilian Colombian Costa Rica Ethiopian
Colombian  0.00014   -         -          -        
Costa Rica 0.03836   0.32378   -          -        
Ethiopian  1.5e-07   0.32565   0.00698    -        
Guatemala  0.32565   0.00978   0.32565    4.4e-05  

P value adjustment method: holm

We performed ANOVA and pairwise t-tests to test our hypothesis. The null hypothesis is that there are no significant differences among the average final sales amounts for each type of coffee. The ANOVA test yielded a p-value of 1.97e-08 < 0.05, leading us to reject the null hypothesis and conclude that there are significant differences. Now, we perform a few pairwise t-tests to determine which coffee types are significantly different. The pairs (Colombian, Brazilian), (Costa Rica, Brazilian), (Ethiopian, Brazilian), (Ethiopian, Costa Rica), (Guatemala, Colombian), and (Guatemala, Ethiopian) have average final sales that are significantly different. These differences can be used for future sales purposes. For example, coffee sellers should focus on producing and selling more Ethiopian coffee due to its popularity.

Graph 2

coffee |> 
  ggplot(aes(x = Final.Sales, y = Product, fill = Product)) +
  geom_violin(fill = "skyblue", alpha = 0.5) +
  geom_boxplot(width = 0.2, alpha = 0.5) +
  labs(title = "Violin plot and Boxplot of Final Sales")

Description and Interpretation for Graph 2

We used a violin plot to further visualize the distribution of final sales data for each coffee type. There does not appear to be any clustering around the median. Rather, all the final sales amounts are fairly evenly distributed. We can see the purchasing patterns of customers here: instead of buying at a set quantity, customers prefer to purchase at varying amounts to accommodate their needs. Therefore, for coffee sellers, it is important to have flexibility when it comes to selling coffee beans.

Research Question 2

Are coffee beans customers sensitive to price? In other words, how do discounts affect quantity purchased?

The relevant variables are: Date, Unit price, Discount, Sales Quantity, and Final sales.

Graph 3

coffee2 <- coffee %>%
  group_by(Unit.Price, Used_Discount) %>%
  summarize(Total.Quantity = sum(Quantity), .groups = "drop")

coffee2 %>%
  ggplot(aes(x = factor(Unit.Price), y = Total.Quantity, fill = Used_Discount)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Consumer Purchase Amount vs Price", x = "Price", y = "Quantity Sold",
       fill = "Used Discount")

Description and Interpretation for Graph 3

It seems that 35-dollar-priced coffee beans were the most popular and 45-dollar-priced coffee beans were the least popular. This indicates that most of the customers may be the most satisfied with medium level coffee beans. Approximately half of the customers used discounts and half did not. There are two takeaways: either consumers were not really sensitive to discounts (20% off), or there may be certain bars to which customers get discounts. If all customers were provided with the same condition to receive a discount, then we see that customers’ choices were inelastic and were not reactive to small changes in prices.

Graph 4

coffee3 <- coffee %>%
  group_by(Date, Used_Discount) %>%
  summarize(Total.Amount = sum(Final.Sales), .groups = "drop")

coffee3$Date <- as.Date(coffee3$Date, format = "%m/%d/%Y")
coffee3$Unit.Price <- coffee$Unit.Price
coffee3$Group <- paste(coffee3$Unit.Price, coffee3$Used_Discount, sep = "_")

coffee3 |>
  ggplot(aes(x = Date, y = Total.Amount, color = Used_Discount)) +
  geom_line() +
  scale_x_date(date_breaks = "2 month", date_labels = "%b %d") +
  labs(y = "Total Sales", 
  title = "Total Sales Amount of Discounted and Non-Discounted Coffee Beans Over Time")

Description and Interpretation for Graph 4

In general, products that were purchased without discounts had a higher total sale throughout the two years. There are two reasons for this: the non-discounted purchases had larger quantities, and the unit price was higher. We are certain that the unit price was higher, so a further area to study is those who were less sensitive to price buying at larger quantities. We see that throughout the year, the quantity sold oscillates with no clear pattern. But in general, we see that the total sales amount of non-discounted products increased from 2023 to 2024.

Research Question 3

How do coffee’s total sales vary by city and product type over time?

The relevant variables are: Date, Final Sales, Product, and City.

Graph 5

# convert dates into uniform format
coffee1 <- coffee %>%
  mutate(Date = mdy(Date),
         Month = format(Date, "%Y-%m"))

# calculate final sales data, sorted by variables month, city, and product
grouped_coffee <- coffee1 %>%
  group_by(Month, City, Product) %>%
  summarise(Final_Sales = sum(`Final.Sales`, na.rm = TRUE))

# plot
ggplot(grouped_coffee, aes(x = Month, y = Final_Sales, color = City, shape = Product)) +
  geom_point(size = 3) +
  labs(title = "Monthly Final Sales, Sorted by City and Product",
       x = "Month",
       y = "Final Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Description and Interpretation for Graph 5

This plot showcases how final sales of coffee vary over time across different cities and different types of coffee products. For example, certain types of products, such as Colombian and Ethiopian coffee often show heightened final sales volume. The differentiation by product shape helps identify which products are most popular in which regions over time. This visualization is informative because it allows us to detect coffee sales trends, regional preferences, and potential seasonality.

Graph 6

heat_data_simple <- coffee1 %>%
  group_by(City, Product) %>%
  summarise(Total_Sales = sum(Final.Sales, na.rm = TRUE), .groups = "drop")

# Plot the heat map
ggplot(heat_data_simple, aes(x = Product, y = City, fill = Total_Sales)) +
  geom_tile() +
  labs(
    title = "Total Coffee Sales by City and Product",
    x = "Product Type",
    y = "City",
    fill = "Total Sales")

Description and Interpretation for Graph 6

The heat map shows how total coffee sales vary across cities and product types. Lighter tiles represent higher sales volumes, and darker tiles represent lower sales volumes. Across product types, we observe that Ethiopian coffee has notably high sales in Riyadh, Mecca, and Jeddah, while Colombian coffee performs particularly well in Dammam. On the other hand, Brazilian coffee seems to be of less interest across all cities compared to other product types. Overall, this visualization highlights clear regional preferences, suggesting that certain cities have strong demand for specific coffee types.

Conclusion

In summary, our analysis of 2023 and 2024 coffee bean sales across five major bean types in Saudi cities reveals three key aspects: regional product preferences, price sensitivity and discount effects, and statistical differences in average sales by bean types. These sets of information could provide further insight for those that are in the coffee beans industry with promotion strategies. Ethiopian and Colombian coffee beans consistently outperform others in Riyadh, Mecca, Jeddah, and Dammam, while Brazilian beans are behind on sales across all cities. Time series and heat map visualizations show that Ethiopian coffee, in particular, drives the highest total sales volumes, signaling that retailers should prioritize stocking these popular products to meet strong local demand.

On pricing and promotions, we observe that mid-tier beans (around $35 per unit) achieve the highest sales volumes, indicating customer preference for moderate price points. Moreover, simple uniform discounts (20% off of the original price) have only a marginal effect on revenue. Non-discounted sales still generate more income due to higher per unit prices, and even potentially out-perform in volumes. This suggests that more information on the incentive structures, such as terms and conditions for the discount, may be needed to meaningfully shift purchasing behavior.

Finally, our statistical tests confirm significant differences in mean final sale amounts across bean types (ANOVA), with t-tests identifying specific pairs (such as Ethiopian vs. Brazilian, Colombian vs. Brazilian) that account for these disparities. Violin plot distributions further reveal that customers buy variable quantities rather than fixed pack sizes, implying the importance of flexible packaging and bundle options in boosting sales.

Together, these insights can guide inventory decisions, pricing strategies, and promotional designs to optimize revenue and better align offerings with customer preferences.

Future Work

While the above graphical analyses have provided valuable insights into coffee sales, there are several important questions that remain unanswered and could be explored in future work. One of the key questions involves understanding the impact of seasonal demand on coffee sales. Although we have analyzed sales by city and product type, we did not account for potential seasonal fluctuations in consumer behavior. Future work could include incorporating seasonal data to analyze how coffee sales vary during different times of the year, which may reveal trends such as increased coffee consumption during colder months or holidays.

Another area for future research is exploring customer price sensitivity in more depth. While we identified that non-discounted products generally had higher sales, a more in-depth analysis could assess how different price ranges and discount levels influence purchase volume across various coffee types. For instance, incorporating a broader set of data points, such as customer demographics or specific promotional campaigns, would help determine if certain customer segments are more price-sensitive than others. Additionally, exploring the effect of larger quantities or bulk purchases could also provide more insights into consumer behavior.

These unanswered questions are well-motivated by the work completed thus far. By extending the analysis in these areas, future research could provide deeper insights into the dynamics of coffee sales, which can allow for more targeted marketing strategies and better forecasting models.