36-315_Final_Report - Identifying Trends in Patients with Brain Tumors

Shruti Srinivasan, Adrian Lee, Audrey Yang, Nicholas Mesa-Cucalon

Published

April 25, 2025

Data Overview

Our dataset is one about brain tumors in patients. It is from Kaggle and was created by Arif Miah. The dataset contains 20,000 observations of 19 variables that span quantitative and qualitative values about the tumors, patients, and treatments. It was originally created with the intent of testing Machine Learning models and is comprised of synthetic data that lines up with past observations about brain tumors. Thus, it does not contain real patient information and is publicly available and can be used with no HIPAA concerns.

Core Research Questions

In this analysis, we will be investigating various questions in each section. Section 1 will focus on how tumor characteristics, such as tumor type and growth, affect survival rate. We also will see how tumor growth rate is correlated with survival rate. Section 2 will focus on the efficacy of different treatment types depending on the location of the tumor in the Brain. It will also focus on if the number of treatments affect survival rates in a statistically significant manner. Section 3 will focus on if there are different tumor growth rates for a given treatment combination, given a particular stage. Furthermore, it will focus on if there is any relation between treatment, tumor growth rate and stage. Finally, section 4 will be investigation how radiation treatment affects tumor sizing, both for each tumor type and for all tumors in aggregate. The last research point is whether there is a relation between age, treatment type and survival rate for the dataset.

Section 1

We wanted to learn about the impact of characteristics of a tumor, specifically relating to its aggressiveness, on a patient’s estimated survival rate. Thus, we focused on the variables tumor growth rate, surival rate, and tumor type to answer this question.

Contour Map

The above graph shows us no clear correlations between tumor growth rate and estimated survival rate, both with and without the facets of tumor type. After the data points were plotted on the scatterplot, we added a contour plot and linear model line on top to see if there were any clusters or relationships we were missing with the scatterplot but that provided no additional information.

PCA

In this PCA plot of the variables tumor growth rate and survival rate, we see that the vectors for the two variables are at a right angle. This suggests that they are uncorrelated, supporting the conclusion of the first graph that the variables do not share a relationship for this dataset.

Linear Model

                    Estimate Std. Error     t value  Pr(>|t|)
(Intercept)       70.3499197  0.2569701 273.7669143 0.0000000
Tumor_Growth_Rate -0.1409965  0.1460852  -0.9651663 0.3344732

Non-Parametric Model


Regression Data: 20000 training points, in 1 variable(s)
              Tumor_Growth_Rate
Bandwidth(s):         0.8587004

Kernel Regression Estimator: Local-Constant
Bandwidth Type: Fixed
Residual standard error: 17.26969
R-squared: 0.0001241382

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

NULL

We also can analyze the relationship between the tumor’s growth rate and the patient’s survival rate by fitting one linear and one non-parametric model to the data with these variables in mind.

With the linear model, we can see from the p-value of the slope coefficient of the tumor growth rate variable that there is no statistically significant linear correlation between the tumor growth rate and estimated survival rate, further supporting our conclusion from the plots.

The \(R^2\) value from the non-parametric model is nearly zero, meaning tumor growth rate explains nearly none of the variance of estimated survival rate.

T-Test


    Welch Two Sample t-test

data:  Survival_Rate by Tumor_Type
t = 0.38225, df = 19998, p-value = 0.7023
alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
95 percent confidence interval:
 -0.3853842  0.5721153
sample estimates:
   mean in group Benign mean in group Malignant 
               70.17852                70.08516

We also ran a two sample t-test between the tumor type and survival rate and found a p-value of 0.7, which is much higher than any reasonable alpha value. Thus, we can conclude that there is no significant difference in average survival rate between Benign and Malignant Tumor

Both these statistical models’ summary results and the t-test support our conclusion from the visualizations that there is no significant correlation between tumor growth rate and estimated survival rate, both conditioning on tumor type and not, as well as tumor type and estimated survival. We only have null conclusions to draw in order to answer the research question of the impact of characteristics of a tumor, specifically relating to its aggressiveness, on a patient’s estimated survival rate.

Section 2

For our next research question, we wanted to examine the relationship between the different types of treatment (Radiation, Chemotherapy, and Surgery) and its effectiveness (which was determined by survival rate) compared to the location of the tumor. Thus, we decided to focus on the following variables: Location (Temporal, Parietal, Occipital, Frontal), Radiation_Treatment (Yes/No), Chemotherapy (Yes/No), Surgery_Performed (Yes/No), and Survival_Rate.

Heatmap

We first grouped the 3 treatment types utilizing the mutate() function to add a new column, Treatment_Combo(), pasting the three Yes/No fields together into the 8 different combinations of the 3 treatments. From there, we computed the mean survival rate for each group to create a heatmap displaying each cell’s color encoded with the group’s mean survival rate (blue ≈ 68% up to red ≈ 71%).

You can see that the brightest red cell appears at Temporal tumors treated with both radiation and surgery but no chemo, indicating the highest average survival (~71%). This may be because temporal lobe tumors may be more surgically accessible, making surgery plus radiation highly effective. Chemotherapy, which often struggles to cross the blood-brain barrier, may add limited benefit and more side effects—explaining why the no-chemo combination yields the highest survival in this group

In contrast, combinations without surgery or without radiation fall into purpler/bluer tones around 69–70%. Combinations without surgery or radiation tend to show lower survival—around 69 to 70%—because they leave more of the tumor behind. Surgery and radiation are the most direct, aggressive treatments for brain tumors: surgery removes the bulk of the tumor, and radiation targets what’s left. Without one or both, the cancer has more opportunity to grow or return, which likely contributes to the lower survival rates seen in those treatment groups

To further test the validity of our findings, we also ran a Welch Two Sample t-test to identify whether our hypothesis is significant, especially for patients with temporal tumors receiving radiation and surgery treatment - as you can see on the top right corner of the graph.


    Welch Two Sample t-test

data:  target and others
t = 1.6069, df = 656.49, p-value = 0.05427
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -0.02785864         Inf
sample estimates:
mean of x mean of y 
 71.21017  70.09743

We saw a p-value that is not greater than 0.05, indicating that patients with temporal tumors receiving radiation + surgery (no chemo) do not have a significantly higher mean survival rate than other treatment/location combinations.

Trendlines

So, the heatmap helped us look at what treatments were particularly effective given the location of the tumor. However, another question that has still not been answered within the scope of our research question is: Do more treatments lead to a higher survival rate?

To highlight and identify any potential trends, we decided to use a simple line‐and‐point plot of the mean survival against number of treatments (0–3).

A few patterns jumped out from the trend line: first, temporal-lobe tumors show a steady, almost linear gain in mean survival—from about 69.3% with no treatment up to 71.0% when all three treatments are administered, indicating that each additional therapy adds meaningful benefit. On the other hand, occipital-lobe tumors start with a higher mean survival rate (~70.5%), stays somewhat constant to ~70.4% at two treatments, but then drop to ~69.4% at three treatments, suggesting diminishing or even negative returns when all treatments are combined. Lastly, the other two lobes (frontal and parietal, shown in grey) exhibit much smaller upward shifts, with means rising only around 0.5–1 percentage point across the full treatment range. Therefore, while more treatments generally raise survival, the size and consistency of that benefit vary by location—temporal tumors gain the most from a fully combined regimen where as occipital tumors may fare best with only two treatments.

Again, to further test the validity of our findings, we also ran a two-way ANVOA to test whether whether more treatments produce different mean survival and to see if this hypothesis was significant or not.

                    Df  Sum Sq Mean Sq F value Pr(>F)
n_treat              3     751   250.4   0.839  0.472
Location             3     223    74.3   0.249  0.862
n_treat:Location     9    2933   325.9   1.093  0.364
Residuals        19984 5961429   298.3

Because all p-values exceeded 0.05, we fail to reject the null hypotheses of no differences in mean survival across treatment counts, no differences across locations, and no treatment-by-location interaction. While our plots suggested a visual upward trend with more treatments, especially for temporal tumors, the variability in survival rates is too large for those trends to be statistically distinguishable. We’ll need either more data, additional covariates (e.g., age, tumor grade), or a different modeling approach to uncover any true treatment effects.

Section 3

Continuing our investigation into the effectiveness of the 3 treatments (Radiation, Chemotherapy, Surgery), our next research question focused on these treatments’ effects on Tumor Growth Rate, given any Stage of the tumor. Since tumor growth rates differ between stages, knowing the relationship between these treatments and/or a combination of the three with growth rates can help with developing treatment plans for tumors, depending on the stage. The variables of interest in this question are: Radiation_Treatment (Yes/No), Surgery_Performed (Yes/No), Chemotherapy (Yes/No), Stage (I, II, III, IV), and Tumor_Growth_Rate (cm/month).

Density Plots

We wanted to first determine if there were any noticeably different tumor growth rates for a treatment combination, given a particular stage. Using the same 8 treatment combinations, we created density plots of the distributions of tumor growth rate given the treatment combination, facetted by stage.

Upon first glance, all of the distributions, regardless of treatment or stage, look fairly similar, being multimodal and symmetric. To confirm our observation, we did a simplified comparison procedure and chose 2 densities from each stage to run KS tests on. These 2 densities were chosen based on the density value range between them: for each stage, the first chosen distribution had the highest density value in the plot, and the second chosen distribution had the lowest density value in the plot. The distance between these two values is the largest, and thus these 2 densities were deemed the “most different”. As such, if the p-value from the KS test showed that these 2 distributions are similar, we could safely assume that if we ran KS tests on any other distribution with any of these 2, we would also arrive at the same conclusion that all of these distributions are similar to each other.

Stage I (none vs. radiation + surgery):


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  s1_none$Tumor_Growth_Rate and s1_rs$Tumor_Growth_Rate
D = 0.055997, p-value = 0.29
alternative hypothesis: two-sided

Stage II (surgery + chemo vs. chemo):


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  s2_sc$Tumor_Growth_Rate and s2_chemo$Tumor_Growth_Rate
D = 0.059077, p-value = 0.2178
alternative hypothesis: two-sided

Stage III (none vs. surgery + chemo):


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  s3_none$Tumor_Growth_Rate and s3_sc$Tumor_Growth_Rate
D = 0.055285, p-value = 0.3042
alternative hypothesis: two-sided

Stage IV (radiation + surgery + chemo vs. radiation):


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  s4_rsc$Tumor_Growth_Rate and s4_radiation$Tumor_Growth_Rate
D = 0.060741, p-value = 0.1867
alternative hypothesis: two-sided

To account for multiple testing, we used a conservative Bonferroni correction for the 28 pairwise comparisons within each stage. The p-values from these KS tests needed to be below 0.05/28 (~0.001) in order to reject the null hypothesis. Clearly, since the p-values from each of our KS tests (0.29, 0.2178, 0.3042, 0.1867) are above the threshold, our null hypothesis is rejected, meaning that these distributions in each KS test are not significantly different from each other.

From our density plots and KS tests, we see that in our tumor dataset, the distributions of tumor growth rate conditioned on treatment do not significantly differ from each other; this is apparent for each stage as well. This indicates that for our dataset, the treatments may not significantly impact tumor growth rate.

Correlogram

To further support the conclusion above, we decided to create a correlogram to observe whether there were any correlations between Radiation_Treatment, Surgery_Performed, Chemotherapy, Tumor_Growth_Rate, and Stage. This could help us observe which treatments are likely to be paired with each other, and which treatments are most correlated with tumor growth rate and stage.

All of the correlation coefficients are about 0, if not actually 0. This indicates that these 5 variables are virtually uncorrelated, and that if we were to try to determine a relationship between any 2 of these variables, we would not be able to see a linear relationship.

To sum up, from the density plots, we were unable to determine any particularly different tumor growth rate distributions given treatment, for any stage. From the correlogram, we were also able to see that no linear relationship exists between any of the 3 treatments, stage, and/or tumor growth rate. Neither of the variables can really explain variance in the other, and this is reflected in the similar densities. As a result, our dataset cannot tell us anything conclusive about whether the 3 treatments have any relationship with tumor growth rate or stage.

Section 4

Violin Plot

The main question we aim to answer in this section is how radiation treatment specifically affects tumor size, and if the tumor type affects the effect of radiation treatment. Our prior understanding is that different tumor types, benign vs malignant, will have different reactions to just radiation treatment. We believe that benign tumors should be more responsive to treatments than malignant tumors, and that in general they will be smaller after treatment. We are interested in radiation treatment since we understand that, besides chemotherapy, it is the most commonly used treatment for tumors, so we are interested if it is statistically effective on its own.

The above graph plots the distribution of tumor sizes with or without radiation treatment, faceted on the tumor type. This plot reveals that the distributions for tumor sizes with or without radiation treatment are not statistically different. This holds for both benign and malignant tumors. This was an interesting conclusion to reach, as it indicates the tumor sizes with and without radiation treatment are similar, which is not the result we anticipated. The below KS Test justifies our findings.


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  yes_rad_malig$Tumor_Size and no_rad_malig$Tumor_Size
D = 0.024322, p-value = 0.103
alternative hypothesis: two-sided


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  yes_rad_benign$Tumor_Size and no_rad_benign$Tumor_Size
D = 0.017572, p-value = 0.4249
alternative hypothesis: two-sided

We were curious if the argument that faceting on tumor type was too strong a prior, and that regardless of type there would be a statistical difference. We decided to create another plot and perform another statistical test to investigate this hypothesis.

The above graph, alongside the KS test, indicates that with just radiation treatment, the distribution of tumor sizes are not different. This is without faceting on tumor size, which indicates that radiation treatment on its own is not effective at decreasing tumor size. This conclusion makes sense given the context of Cancer treatment in general, but it was important to verify for ourselves with the plots and KS-Test, as seen below.


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  yes_radiation$Tumor_Size and no_radiation$Tumor_Size
D = 0.018564, p-value = 0.06372
alternative hypothesis: two-sided

LOESS Regression

Another question we were interested in investigating was how age was correlated to survival rate given different treatments. Treatments such as chemotherapy are effective, but they may be more risky for older patients as it affects their immune system more. On the other hand, radiation and surgery should be safer for older patients, but may experience less effectiveness on their own, as we already demonstrated for radiation treatment. To accomplish this, we decided to investigate the Age, Survival Rate, Radiation_Treatment, Surgery_Performed and Chemotherapy variables, using patients who received none of these treatments as a baseline.

We use LOESS regression to plot the 4 different graphs of interest, showing survival rate versus age for each treatment type. Each treatment type was considered mutually exclusively of one another. We chose this graphic since we did not know what type of relation our variables would have beforehand, and this would provide a flexible way to not only show the relations but later test if there were any meaningful interactions between variables. The plot reveals interesting trends regarding how survival rate is influenced by both age and treatment. We see an intuitive result, which is that without treatment, the survival rate decreases to a minimum of roughly 68%. There is a slight bump around the years 50 to 65, where the survival rate seems to increase without treatment, but this is more likely than not a byproduct of the synthetic nature of the dataset. We see relatively similar results for Chemotherapy and Radiation Treatment, where as age increases, the survival rate roughly decreases as a general trend, with some occasional increases due to the nature of the data. However, Surgery does not follow this trend. We see that the as age increases, survival rate increases if Surgery is the treatment used. This was an interesting result that we wanted to look into further with a statistical test.


Call:
lm(formula = Survival_Rate ~ Age * Treatment_Group, data = tumor_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.826 -14.816   0.119  14.661  30.855 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     71.56212    1.06520  67.182   <2e-16 ***
Age                             -0.03443    0.02020  -1.705   0.0883 .  
Treatment_GroupRadiation        -2.03242    1.48860  -1.365   0.1722    
Treatment_GroupChemotherapy     -1.51554    1.48202  -1.023   0.3065    
Treatment_GroupSurgery          -2.07167    1.48054  -1.399   0.1618    
Age:Treatment_GroupRadiation     0.04008    0.02834   1.415   0.1572    
Age:Treatment_GroupChemotherapy  0.02687    0.02826   0.951   0.3417    
Age:Treatment_GroupSurgery       0.05496    0.02815   1.952   0.0509 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.21 on 9947 degrees of freedom
Multiple R-squared:  0.0007792, Adjusted R-squared:  7.605e-05 
F-statistic: 1.108 on 7 and 9947 DF,  p-value: 0.3546

This output unfortunately shows that our results are not statistically significant. We see that Age with Surgery has an almost statistically significant interaction term, but at \(\alpha = 0.05\), we cannot reject the null and claim statistical significance when the p-value is only 0.0509. The rest of the results were not close to statistical significance, but indicated somewhat intuitive results, such as all treatment groups helping older patients with their survival rates and that as age increases, survival rate decreases slightly (not statistically significantly).

In hindsight, this makes sense. The survival rates were all in a small interval from 68% to 73%. As a result, though we observed these trends over a wide array of ages, we should have been more skeptical if our plot showed genuine statistical significance. These results ultimately correct our prior understanding of the situation, which was the belief that radiation treatment and chemotherapy would have a negative correlation with survival rate as age increased, due to how aggressive these treatments are. Ultimately, even for this synthetic dataset, we are confronted with the conclusion that cancer is an extremely multifaceted subject, and even with the breadth of statistical tools at our disposal, interactions are not as simple as we originally believed.

Conclusions and Future Works

In this work, we explored various questions regarding trends in patients with brain tumors. In section 1, we investigated how tumor characteristics (tumor type and growth) affect survival rate. Furthermore, we saw how tumor growth rate was correlated with the survival rate with statistical tests. Ultimately, this section indicated that just tumor characteristics, such as type and growth, have no individual correlation with survival rate. Section 2 focused on how effective different treatment types were depending on the location of the tumor in the Brain. We saw that temporal tumors with radiation plus surgery seemed to have the highest survival chance, but this was not a statistically significant result. We also tested if the number of treatments affected survival rate significantly, which it did not. Section 3 was interested in if there were any noticeably different tumor growth rates for a treatment combination, given a particular stage, which there were not. We repeated a similar experiment seeing if there was any relation between treatment, tumor growth rate and stage, which there was not. Finally, Section 4 saw if radiation treatment affected the distribution of tumor sizes for tumors, both in aggregate and conditioned on the tumor type. There was no statistical difference. We also looked into the relation between age, treatment type and survival rate, but there was no correlation.

All in all, our work on this dataset bore an important conclusion, which is that brain tumor data and correlations are difficult to make out with the statistical visualizations and tests we used. The issue is complex and multifaceted, especially with such a large dataset. This prior understanding, coupled with the fact that this is synthetic data intended for training Machine Learning Models, helps us understand why we turned up so many negative results. Future work could be dedicated to having people with more field expertise help guide our visualizations and search for correlations. Ideally, this future work will also focus on more techniques for visualizing high dimensional data and corresponding correlations in high dimensions.