36-315 Final Project: An Analysis of the 1994 US Census Income Data
Dataset
Our dataset is provided to us by UC Irvine ML Repository. It consists of a clean subset of historical demographic and income data from the 1994 US Census. Incomes are bucketed into less than or greater than $50,000 in the income
variable.
The quantitative variables are:
age
: Age of respondent in yearseducation_num
: Variable loosely associated with respondent’s number of years of educationcapital_gain
: Respondent capital gaincapital_loss
: Respondent capital losshours_per_week
: Hours per week worked by the respondent
The qualitative variables are:
workclass
: Sector of work respondent is in (e.g., public)education
: Education level of respondent (e.g., Bachelors)martial_status
: Respondent’s martial status (e.g., Divorced)occupation
: Respondent’s occupation / field (e.g., Sales)relationship
: Respondent’s status in the relationship (e.g., Wife)race
: Respondent’s race (e.g., White)sex
: Respondent’s sex (e.g., Male)native_country
: Respondent’s native country (e.g., United States of America)
Each row of the dataset is one individual with at least some of these variables populated (there are some missing values).
Research Question
We are interested in the following questions:
- What quantitative factors contribute the most to predicting income buckets?
- Is there a gender gap? If so, what factors affect it?
- How does a native country affect income distribution?
RQ 1: What quantitative factors contribute the most to predicting income buckets?
We are interested in what factors impact individuals to be in different income brackets. We first start with the quantitative to conduct a principal component analysis to find if there are axes within the data that cleanly split the data into the income buckets.
PC1 | PC2 | PC3 | |
---|---|---|---|
age | -0.361 | 0.080 | -0.868 |
education.num | -0.565 | -0.011 | 0.423 |
capital.gain | -0.439 | -0.591 | -0.114 |
capital.loss | -0.273 | 0.802 | -0.002 |
hours.per.week | -0.532 | 0.034 | 0.236 |
From the PCA plot, we can see that there is no clear split between the income brackets that can be attributed directly to the quantitative variables. This hints towards the quantitative variables not being the best suited for the task of discerning income buckets.
However, if we look into the principal components’ rotation matrix, we can see that the first component is very negatively affected by more years in education and we tend to see lesser income as we move along the x-axis. Looking at the second principal component, the largest (negative) factor of this is capital gains, and we see a large group of blue points towards the bottom of the axis. Additionally, the first two components only explain 46.5% of the variance of the data, so there are patterns in the income buckets that are inexplicable based on the purely the quantitative variables.
To further verify the efficacy of the quantitative variables, we decided to try a dendrogram.
This graph shows a dendrogram generated from the 2D PCA data via complete linkage. There does not seem to be rhyme or reason to the plot; that is there are no clear branches of incomes above or below $50,000. This further suggests that the quantitative variables do not do a good job of characterizing the incomes of workers in 1994.
RQ 2: Is there a gender gap? If so, what factors affect it?
Next we wanted to test for a gender gap. We first created a faceted bar plot on gender to explore how relationship status and race can effect wages.
We can see from the graph that relationship status does effect the size of the gender gap as some of the distributions are significantly different. This strongly suggests the existence of a wage gap for the largest groups (not-in-family, unmarried, and married). However, the different proportions between the groups implies that the relationship status of men and women the wage gap can change this gap.
To explore a different dimension, we looked at how race and sex interacted to determine income. We created the mosaic plot below:
From the mosaic plot it seems that race does affect the gender (wage) gap as there are many Pearson residuals with large magnitudes (> 4). White males are over-represented among individuals earning above 50K, while Black females are over-represented among those earning below 50K. These patterns suggest economic disparities linked to race and gender.
RQ 3: How does a native country affect income distribution?
To determine how one’s native country might affect their income, we first created a chloropleth plot plotting the proportion of incomes above $50,000 by native country.
The chloropleth plot suggests that most continents / geographical regions are a mixed bag of proportional incomes. Both Europe and Asia have high-income and low-income countries with little to no representation from Africa and the Middle East. The only consistently-colored continent is Central / South America with very low proportions of incomes above $50,000. This is surprising because we understand much of immigration from this region to be economically motivated, yet they still earn significantly less than most their counterparts from around the world. Perhaps this goes to show how dire the situaton was (in 1994) such that their incomes, despite being lower than $50,000, is still significantly higher than in their native country.
However, there are many other factors that could influence income, not least of which is education. To see how education and income interact with respect to native country, we created a scatterplot with notable countries labeled (i.e., countries with at least 70 reported incomes).
Unsurprisingly, we see a strong positive correlation between education score (higher is better) and proportions of income > $50,000. We see that the Central / South American countries are in the bottom left quadrant of the scatterplot, suggesting education levels could explain a lot of this pattern we identified.
Conclusion
In this project, we examined the 1994 US Census Income dataset to explore how quantitative traits, gender, and native country relate to income disparities. Our analysis showed that while variables like education and capital gains influence income, they do not distinctly separate individuals into income brackets on their own. Gender gaps were evident across most relationship types, with additional disparities emerging along racial lines—particularly disadvantaging Black women. Country of origin also played a significant role, with immigrants from Central and South America earning lower incomes on average, a trend that appears tied to lower education levels. Together, these findings underscore how income inequality in 1994 reflected intersecting demographic and geographic factors, suggesting both individual and systemic influences on economic outcomes. Future work in the area could view how these factors have changed in incomes within the United States for example through more recent census reports. Another route could be to do deeper analysis using continuous income data instead of bucketed data around $50,000.