36-315 Final Project Report

Author

Hariharan Manikandamurthy, Mahi Saraf, Fahad Hamdan, Saahil Lathi

Published

April 28, 2025

1 Introduction

In an increasingly digital world, cybersecurity threats have become one of the most pressing risks facing governments, businesses, and individuals. Our project seeks to explore the patterns, severity, and sources of cyberattacks globally over the last decade. Through exploratory data analysis, we aim to better understand how the nature of these threats has evolved, which industries and countries have been most impacted, and what types of vulnerabilities are most often exploited. This work can inform risk management strategies and future cybersecurity investments.

2 Dataset Description

We used the Global Cybersecurity Threats (2015–2024) dataset, from Kaggle. This dataset was created by compiling real-world information from cybersecurity reports, threat intelligence platforms, and government archives, and was cross-verified with cybersecurity experts and firms.

Each row in the dataset represents a unique cybersecurity incident that occurred between 2015 and 2024. The key variables captured include:

Country: Where the attack occurred
Year: When the attack occurred
Attack.Type: Type of cyberattack (e.g., Phishing, Ransomware, DDoS)
Target.Industry: Industry affected (e.g., Finance, Healthcare, IT)
Financial.Loss..in.Million...: Estimated financial impact of the attack measured in millions of dollars
Number.of.Affected.Users: How many users were impacted
Attack.Source: Who perpetrated the attack (Hacker Group, Nation-state, Insider)
Security.Vulnerability.Type: Type of vulnerability exploited (e.g., weak passwords, social engineering)
Defense.Mechanism.Used: Defensive tools or methods used (e.g., VPN, Firewall, AI-based Detection)
Incident.Resolution.Time..in.Hours.: How long it took to resolve the incident measured in hours.

The dataset is rich enough to enable cross-sectional and time-series exploration across different threat vectors, industries, and countries.

3 Research Questions

In our project, we aimed to explore the following four research questions:

How has the severity of cybersecurity threats changed over time (2015–2024)?
We want to examine whether incidents have become more financially damaging or harder to resolve over the past decade.
Which industries have suffered the highest financial losses from cyberattacks?
We want to investigate which sectors experience the greatest financial impacts from cyber incidents.
How do the sources of cyberattacks vary across countries, and which nations face the highest volume of insider, nation-state, or hacker group threats?
We want to explore the global distribution of attack sources, uncovering regional differences in the nature of cybersecurity risks.
What are the fastest vs. slowest resolved attack types — and what do they cost?
We want to analyze which types of attacks take the longest to resolve, and whether longer resolution times correlate with higher financial losses.

To address each of these research questions, we will create a series of visualizations that explore the relevant variables, highlight key trends, and provide meaningful insights into global cybersecurity threats.

4 Question 1

The line chart maps out how many “critical” cybersecurity incidents were logged each year between 2015 and 2024. Starting at 277 incidents in 2015, the numbers climb each year and hit a peak of 319 by 2017, which is about a 15% increase in just two years. That upward stretch flattens in 2018, when the count drops slightly to 310 (-2.8%), and then dips much more sharply in 2019 to 263 incidents, about 15% below the previous year and the lowest point on the graph. It is worth noting that the trough is followed by an equally sharp rebound, as in 2020 it jumps back up to 315 incidents, almost a 20% surge that pretty much erases the 2019 drop.

From 2021 through 2024 the pattern seems to be less volatile. Annual counts seem to be clustered in a fairly tight range, between roughly 299 and 318 incidents, showing slight ups and downs rather than big swings. The smoothed blue trend line makes this clear as it rises through 2017, bottoms out in 2019, climbs again in 2020, and then flattens with a slight taper after 2022. Overall, the decade ends about 8% higher than it began, but the volatility suggests that risk over a year to year time period can change quickly.

In addition to freqeuncy, we wanted to understand whether the financial severity of incidents also changed, so we made a graph that showed the average financial loss over time.

We used a 3 year moving average to be long enough to smooth out the data while also being short enough to show the change from year to year. From the 3 year average, the financial loss decreased from 2016 to 2019, and increased from 2019 to 2024, resulting in an overall slight increase from 2015 to 2024. Like with the number of incidents, there was a spike from 2020 to 2021, due the pandemic, but the 3-year average does smooth this out, to better show the overall trend rather than emphasizing a 1 year spike from an outside event. From these two graphs, the overall conclussion is that from 2015 to 2024, cyberattacks are becoming slightly more frequent and costly. After looking the overall time trend, we wanted to see if there were patterns between years, and if past years financial loss and number of incidents predict future years.

To explore this, we created an Autocorrelation of Average Financial Losses and and Incident Count, with the y-axis showing autocorrelation and x-axis showing lag, which represents the number of years between the observations. The dashed blue lined represents the bounds of a 95% confidence interval. Lag 1 shows a slightly negative autocorrelation for financial loss and number of incidents, meaning that if 1 year had high financial loss and number of incidents, the next year’s was slightly lower. At lag 2, the autocorrelation stays slightly negative for incident count, but slightly positive for financial loss. Throughout all lags, the autocorrelation seems random, and is within the bounds of the dashed lines, meaning it is not statistically significant. For financial loss, it does seem like the autocorrelation alternates between negative and positive, however the autocorrelation bars are too small to be significant. Overall, we concluded that there is no strong pattern, meaning cyberattack severity (financial loss and number of incidents) is somewhat random between every year.

5 Question 2

To explore which industries have been most financially affected by cyberattacks, we created a box plot that shows the distribution of financial loss for each industry. The graph shows the interquartile range and the median financial loss of each industry, and the dotted line represents the average financial loss for all industired. From the graph, we can see that the Government and IT industries suffer the most median financial loss from cyberattacks.

To understand whether the differences were statistically significant from the overall mean, I ran a one-sample t-test, with the null hypothesis being that the mean of each individual industry was equal to the overall mean. The p-value of all the industries were above 0.05 - specifically, the p-values of the Government and IT industries were 0.13 and 0.27, meaning that while they do show relatively higher financial loss from the graph, the difference in mean from the overall mean is not statistically significant. Although the differences are not statistically significant, it is still clear that Government and IT suffer the most from cyberattacks, and there should be heavy investment in cybersecurity in both of these industries

However, we couldn’t tell what type of attacks the investments should be focused on preventing from this graph. To understand this better, we created a heat map showing a breakdown of the financial loss suffered from each industry by attack type.

From the heat map, each rectangle shows the financial loss suffered from each attack type. We can see that DDOS accounts for the highest financial loss in IT at around $4.75M, followed closely by Man-in-the-Middle and phishing. The financial loss for Government is more evenly distributed with DDOS, SQL injections, and ransomware causing significant losses. From this, we can see that it is smart to make targeted investments into IT for cybersecurity against DDOS, Man-in-the-Middle, and phishing, because they account for the majority of the financial loss. In the government sector, it might be smarter to invest more broadly across different areas of cybersecurity, as no attack type overwhelmingly dominates the financial losses.

Looking at the other industries, phishing causes the most financial loss in the retail, and banking industries, DDOS causes the most in telecommunications, and malware causes the most in healthcare. Like for Government and IT, looking at the distributions for these industries is also useful. For example, the majority of financial loss in banking is caused by phishing, so there should be a lot of investment targeted towards cybersecurity against phishing, while education is more evenly distributed, so there should be more broad investment.

6 Question 3

To explore how the sources of cyberattacks vary across countries, we created a stacked bar chart which displays the distribution of attack sources across ten major countries. Each bar represents the total number of attacks recorded in a country from 2015 to 2024, segmented by attack source: hacker groups, insiders, nation-states, and unknown sources. We chose a stacked bar chart because it allows us to compare both the overall volume of cyberattacks and the proportional breakdown of attack sources within each country in a single visualization.

From the chart, we observe that hacker groups remain one of the most significant sources of cyberattacks across many countries, although in some cases, nation-state and insider attacks also contribute heavily to the overall threat landscape. For example, the United Kingdom shows a particularly high concentration of hacker-driven attacks, while India exhibits a relatively larger share of insider-driven threats compared to other nations. These findings suggest that while hacker groups represent a widespread global risk, different countries face distinct profiles of cybersecurity vulnerabilities.

One limitation of the stacked bar chart is that smaller categories like insider or nation-state attacks can be harder to compare precisely across countries. However, it still provides a strong overall understanding of the dominant threat sources by region. To build on this high-level view, we next apply Principal Component Analysis (PCA) to uncover deeper patterns and groupings among countries based on their cyberattack profiles.

Importance of components:
                          PC1    PC2    PC3     PC4
Standard deviation     1.3612 1.1493 0.7632 0.49392
Proportion of Variance 0.4632 0.3302 0.1456 0.06099
Cumulative Proportion  0.4632 0.7934 0.9390 1.00000

After performing Principal Component Analysis (PCA) on the attack source data across countries, we find that the first two principal components capture a substantial amount of the variability in the dataset. Specifically, the first principal component (PC1) accounts for approximately 46% of the total variance, while the second principal component (PC2) explains about 33%. Together, PC1 and PC2 capture roughly 79% of the overall variability in attack source patterns across countries. This indicates that a two-dimensional representation preserves most of the important structure in the data.

Given this strong coverage of variance, we proceed to create a scatter plot of the first two principal components. This will allow us to visualize how countries cluster based on the similarities and differences in their cybersecurity threat profiles.

From the plot we can visualize the first two principal components of the country-level attack source data, capturing roughly 79% of the total variance. Countries that are positioned closer together on the plot share more similar distributions of cyberattack sources, while those farther apart differ more significantly in their threat profiles.

From the visualization, we observe that Australia, Russia, and China cluster relatively close together, suggesting that they experience similar mixes of hacker group, insider, and nation-state attacks. In contrast, India is positioned distinctly away from the main cluster, indicating a unique threat composition, potentially due to a higher share of insider-driven attacks. The United Kingdom and Brazil are also positioned farther from the central grouping, suggesting that they face comparatively different cybersecurity threat landscapes.

We decided to overlay density contours to help highlight these groupings more clearly, showing regions of higher concentration of similar countries. By adding contours, we can better identify where countries naturally group together and where outliers emerge.

Overall, the PCA plot provides a highly useful, reduced-dimension view of how countries differ in their cybersecurity risks. It complements the stacked bar chart that we plotted earlier, by uncovering hidden structural patterns that would have been harder to detect through simple visual comparisons alone.

7 Question 4

We wanted to explore the relationship between the financial impact and resolution time of different types of cyberattacks, we first constructed a bubble chart. This visualization mapped the average resolution time (in hours) against the average financial loss (in millions of dollars) for each attack type, with bubble size representing the frequency of each attack.

From the bubble chart, we observed that certain attacks, such as DDoS and Phishing, tend to involve higher financial losses despite relatively moderate resolution times. In contrast, attacks like Malware and SQL Injection appeared to involve slightly longer resolution periods with varying levels of financial impact.

Motivated by these observations, we next shifted our focus toward understanding how operational response to cyberattacks has evolved over time. Specifically, we investigated whether the average resolution time for each attack type has systematically changed from 2015 to 2024. To examine this, we applied separate simple linear regressions for each attack type, modeling resolution time as a function of the year of occurrence.

# A tibble: 6 × 4
# Groups:   Attack Type [6]
  `Attack Type`       slope p_value r_squared
  <chr>               <dbl>   <dbl>     <dbl>
1 Malware            0.395   0.213  0.00321  
2 DDoS               0.114   0.718  0.000247 
3 Man-in-the-Middle  0.0814  0.810  0.000127 
4 Phishing           0.0700  0.829  0.0000887
5 Ransomware        -0.347   0.278  0.00240  
6 SQL Injection     -0.579   0.0706 0.00651

After fitting the separate linear regression models for each attack type, we examined whether resolution times have systematically changed over the years. In each model, Incident Resolution Time (in Hours) was regressed against Year for that attack type.

From the regression result that most attack types have small slopes near zero, with both positive and negative directions. Notably, Ransomware and SQL Injection attacks exhibit negative slopes, suggesting a slight decrease in resolution times over the years. However, the p-values for all attack types exceed the α = 0.10 significance threshold, apart form SQL Injection. and the R-squared values are extremely low, indicating that year-to-year variation explains very little of the observed changes in resolution times.

These results suggest that while there are directional trends for certain attack types, we cannot conclude that resolution times have significantly improved for any specific attack based on this data alone. To better visualize these patterns and the relative magnitudes of these trends, we next present a bar plot of regression slopes for each attack type.

The bar plot above summarizes the estimated slope of resolution time over year for each attack type. Positive slopes indicate that resolution times have increased over the years, while negative slopes indicate decreasing resolution times. A red dashed vertical line at zero helps to easily distinguish between these two trends.

From the visualization, we observe that SQL Injection and Ransomware attacks exhibit negative slopes, suggesting that average resolution times for these attack types have declined slightly between 2015 and 2024. Conversely, attack types such as Malware, DDoS, Man-in-the-Middle, and Phishing show positive slopes, indicating modest increases in resolution time over the same period.

However, it is important to interpret these trends cautiously. As previously discussed, none of the regression slopes were statistically significant at the α = 0.10 level, except for SQL Injection and all R-squared values were extremely low. This suggests that the year-to-year variation explains little of the changes in resolution times, and the observed trends could simply reflect random variation rather than systematic improvements or deteriorations.

Overall, while there are directional patterns that warrant further exploration, our analysis does not provide strong evidence that resolution times for cyberattacks have meaningfully improved or worsened over the past decade.

8 Conclusion

Our analysis of global cybersecurity threats from 2015 to 2024 revealed several important insights. First, we found that the number of cybersecurity incidents and the average financial losses have generally increased over the past 10 years, and spiked between 2019 and 2021 from the COVID-19 pandemic. Despite this rise, autocorrelation analysis showed little predictability from year-to-year, which suggest that cyberattack threats remain mostly random.

After looking at the overall trend of cyberattacks, we examined the impacts on specific industries, and saw that the Government and IT industries suffered the highest financial losses, although the differences were not statistically significant in comparison to the overall average. Through the heat map analysis, we found that targeted cybersecurity investments for IT should focus on DDoS, Man-in-the-Middle, and phishing attacks, while investments for Government should be broader, as financial losses were more evenly spread across multiple attack types.

We then wanted to analyze the differences in cyberattacks among countries. This indicated that while cyberattacks are a major threat all over the world, the attack profiles vary significantly between countries as the PCA showed clusters among nations like Russia, China, and Australia, and outliers like India and the UK. Finally, when looking at operational efficiency, we didn’t find strong evidence that resolution times for different cyberattack types have improved from 2015-2024, although times have improved for some types like SQL injection.

Overall, our findings emphasize that cybersecurity risks are growing and becoming more volatile, requiring targeted investment into defenses for dominant threat types and broad planning to address the the fast evolving future of cyberattacks.

9 Future Research

While our report made important conclusions, there were several questions that we didn’t answer that could be answered in the future. First, we didn’t explore which defense mechanisms are most effective. Analyzing effectiveness of various defenses could help better determine how we should invest in cybersecurity in the future. However, we need more data on the effectiveness of defenses, as there are more variables to this than just “incident response time”, like the severity of the attack and if the attack was prevented entirely. Second, we saw if cyberattacks were related year by year through the autocorrelation plot, but we could also look if cyberattacks influenced each other within a shorter time frame. We could use this to even find potential coordinated attacks. This would require more data on the specific dates. Third, we could look at how different users, such as individual consumers vs larger public companies, are impacted. Our report was more focused overall trends and looking at specific industries and countries, but this is an interesting question to answer in the future. Lastly, while we looked at trends over the past 10 years, we didn’t forecast future cyber risk, and different investment scenarios. This is more complex, and we would need more data for this.