36-315 Final Project: Airline Passenger Satisfaction

Introduction

Aviation is an essential industry, but many people find it frustrating to fly by plane. In order to help airlines improve their service, we are looking at flight survey satisfaction data from Kaggle. This data set has nearly 13,000 individual responses for 23 variables, including 14 markers for satisfaction for different components of the process of flying, and an overall rating of satisfaction. There are also delay times (both arrival and departure), flight distance, customer loyalty, type of travel (business or personal), class of ticket (economy, economy plus or business) and demographic information (age and gender).

Description of Data Set

The Airline Passenger Satisfaction dataset is a comprehensive collection of customer feedback from passengers. The dataset contains information on various aspects of the passengers’ travel experience, such as flight distance, gender, age, type of travel, class, seat comfort, inflight entertainment, onboard service, cleanliness, departure delay, arrival delay, and overall satisfaction. This dataset aims to provide insights into the factors that contribute to passenger satisfaction and dissatisfaction, which can be used by airlines to improve their services and enhance their customers’ travel experience. It contains the following columns:

Gender: Gender of the passengers (Female, Male)

Customer Type: The customer type (Loyal customer, disloyal customer)

Age: The actual age of the passengers

Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

Class: Travel class in the plane of the passengers (Business, Economy, Economy Plus)

Flight distance: The flight distance of this journey (miles)

Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

Ease of Online booking: Satisfaction level of online booking

Gate location: Satisfaction level of Gate location

Food and drink: Satisfaction level of Food and drink

Online boarding: Satisfaction level of online boarding

Seat comfort: Satisfaction level of Seat comfort

Inflight entertainment: Satisfaction level of inflight entertainment

On-board service: Satisfaction level of On-board service

Leg room service: Satisfaction level of Leg room service

Baggage handling: Satisfaction level of baggage handling

Check-in service: Satisfaction level of Check-in service

Inflight service: Satisfaction level of inflight service

Cleanliness: Satisfaction level of Cleanliness

Departure Delay in Minutes: Minutes delayed when departure

Arrival Delay in Minutes: Minutes delayed when Arrival

Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

Through our report we aim to answer the following questions:

1. Which features correspond to whether a customer is satisfied overall?

2. How does overall satisfaction compare between between loyal and disloyal customers?

3. How is overall satisfaction related to flight distances and age demographics?

4. Do delays impact customer overall satisfaction with the timing of their flights?

5. Are higher class passengers more satisfied than lower class passengers?

Question 1: Which features correspond to whether a customer is satisfied overall?

Since this data set contains 25 variables, we are going to use the dimension reduction technique Principal Component Analysis to understand the most important dimensions in the data and group the points by overall satisfaction to examine its relationship with both the components and the variables. Typically we use PCA for quantitative data, so in order to apply it to this case we must transform the qualitative variables into quantitative ones somehow. The possible values of the four variables were transformed as follows: for gender, male individuals are coded as -1 and female individuals are coded as 1; for customer type, disloyal customers are coded as -1 and loyal customers are coded as 1; for type of travel, personal travel is coded as -1 and business travel is coded as 1; for class, economy and economy plus are coded as -1 and business is coded as 1. Lastly, the individual survey ratings are ordinal categorical variables so their coding is maintained (ratings from 0-5). It should be noted that we are merging the training and testing data set originally listed on Kaglle, are omitting rows with missing/NA values, and are omitting rows with ratings of 0. Also, we are removing the “x” and “ID” columns since they do not provide useful information. After removing overall satisfaction from the data set (for the purposes of this specific research question), there are 22 features that we will be performing PCA on.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0495 1.5784 1.47725 1.40293 1.34645 1.19141 1.00735
## Proportion of Variance 0.1909 0.1132 0.09919 0.08946 0.08241 0.06452 0.04613
## Cumulative Proportion  0.1909 0.3042 0.40336 0.49282 0.57523 0.63975 0.68587
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.97191 0.96533 0.90154 0.82431 0.71399 0.68603 0.65802
## Proportion of Variance 0.04294 0.04236 0.03694 0.03089 0.02317 0.02139 0.01968
## Cumulative Proportion  0.72881 0.77117 0.80811 0.83900 0.86217 0.88356 0.90324
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.63900 0.59471 0.56849 0.54144 0.52767 0.51071 0.41973
## Proportion of Variance 0.01856 0.01608 0.01469 0.01333 0.01266 0.01186 0.00801
## Cumulative Proportion  0.92180 0.93788 0.95257 0.96589 0.97855 0.99040 0.99841
##                           PC22
## Standard deviation     0.18687
## Proportion of Variance 0.00159
## Cumulative Proportion  1.00000

Based on the elbow plot, it’s not very clear how many dimensions we should visualize. There is no obvious point at which the plot begins to level off, however relative to the other slopes the most gradual decrease seems to start at around the twelfth principal component. That being said, the proportion of variance explained by the 8th principal component falls below the dashed horizontal line, which indicates the amount of variance that a single variable contributes. Since the goal of PCA is to reduce the dimensionality of the data and there is no clear elbow point, it seems that we should visualize the first seven principal components to get a better understanding of the data’s variance structure.

Although we would ideally examine the first seven components, for simplicity we will only visualize the first two components (which together explain about 30.4% of the variation) and group the data points by overall satisfaction level to determine if there are any distinct clusters.

From the biplot, we can see that most of the variables (all of them except for gender, departure delay, and arrival delay) point towards the left indicating that passenger reports with a low first principal component tend to have higher values of those variables. Also, the fairly clear distinction between blue points (satisfied passengers) on the left and orange points (neutral or dissatisfied passengers) on the right shows that satisfied customers are strongly associated with a lower first principal component, and thus more closely associated with high values in the variables previously mentioned.

The three variables pointing to the right (gender, departure delay, and arrival delay) indicate that passenger surveys with a higher first principal component tend to have higher values of these variables, which is associated with neutral or dissatisfied passengers. Since gender is coded as -1 for male and 1 for female, this seems to indicate that female passengers are more closely associated with dissatisfaction. Among the variables with low values of the first principal component, four of them have relatively large values of the second principal component: inflight wifi service, ease of online booking, convenience of departure and arrival time, and gate location. Contrary to the clear trend for the first component however, there is no clear distinction for satisfied and neutral/dissatisfied passengers in the second principal component so the relationship between those four variables and overall satisfaction is not clear.

##                                            PC1           PC2
## Gender                             0.003883067 -0.0004842386
## Customer.Type                     -0.083893852  0.0230591807
## Age                               -0.081945049 -0.0147426199
## Type.of.Travel                    -0.149363861 -0.0757461097
## Class                             -0.238704589 -0.0980880770
## Flight.Distance                   -0.158646926 -0.0768253859
## Inflight.wifi.service             -0.211861303  0.4326703328
## Departure.Arrival.time.convenient -0.060389964  0.4805048964
## Ease.of.Online.booking            -0.141009700  0.5211197369
## Gate.location                     -0.054911345  0.4640398412
## Food.and.drink                    -0.269989781 -0.0941787668
## Online.boarding                   -0.298663390  0.1055810561
## Seat.comfort                      -0.330686841 -0.1062050034
## Inflight.entertainment            -0.394661849 -0.1302093693
## On.board.service                  -0.269529446 -0.0639788809
## Leg.room.service                  -0.223657605 -0.0381212411
## Baggage.handling                  -0.244710012 -0.0529184935
## Checkin.service                   -0.179195921 -0.0273064810
## Inflight.service                  -0.247252549 -0.0563983333
## Cleanliness                       -0.322893899 -0.1113875408
## Departure.Delay.in.Minutes         0.037730621 -0.0059030262
## Arrival.Delay.in.Minutes           0.040685860 -0.0060135717

Since there are many dimensions and the variables closely associated with satisfied passengers are not easy to see, we can use the rotation matrix for more detailed analysis. The magnitude of the coefficients in the rotation matrix tells us the strength of the relationship between the specified variable and principal component. Since the first principal component seems closely related to overall satisfaction, we’ll list its coefficient for each of the variables that have a negative sign from smallest to largest magnitude to get a better understanding: gate location, age, customer type, ease of online booking, type of travel, flight distance, check in service, in flight wifi service, leg room service, class, baggage handling, inflight service, on board service, food and drink, online boarding, cleanliness, seat comfort, and lastly in-flight entertainment. Due to the clear relationship between the first principal component and overall satisfaction, we can infer that each of these variables are increasingly associated with overall flight satisfaction (from lowest to highest). This result makes sense, especially considering that most of these variables are individual survey questions, so we would naturally expect higher ratings of each of these categories to be associated with higher satisfaction in general.

Question 2: How does overall satisfaction compare between between loyal and disloyal customers?

We can determine if there is a difference in overall satisfaction between loyal and disloyal customers by creating a mosaic plot comparing overall satisfaction and customer type, and coloring it by its Pearson residuals. This will allow us to see if satisfaction and customer type are independent, and if not, which combinations are larger or smaller than expected.

By Pearson residuals, we can see that satisfaction and loyalty are not independent. Loyal customers have less dissatisfaction than what would be expected, and thus more satisfaction than what would be expected under the null model. Also, disloyal customers have more dissatisfaction than what would be expected, and thus less satisfaction than what would be expected under the assumption of independence.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 5587.1, df = 1, p-value < 2.2e-16

By running a Chi-squared test for independence, we get a test statistic of 5587.1 with 1 degree of freedom, and a p-value of nearly zero, giving us significant evidence to reject the null hypothesis that customer type and overall satisfaction are independent. Therefore, we can conclude that there is an association between these two variables. In conjunction with our inferences from the mosaic plot, we can see that overall satisfaction tends to be significantly higher for loyal customers, and lower for disloyal customers.

Question 3: How is satisfaction related to flight distances and age demographics?

In order to help airlines improve, we want to help them predict whether their customers will be satisfied. In order to do this, we will first explore whether flight distance and age influence overall satisfaction, by examining a pairs plot, the empirical cumulative distribution functions, and KS tests comparing age and flight distance between satisfied and dissatisfied passengers.

Based on this pairs plot, it seems that there is some sort of trend between age, flight distance, and satisfaction. We can see that the there is a significant correlation between age and flight distance at a level of 0.001, and when we condition on satisfaction their correlations are also significant for both satisfied and dissatisfied customers. The density curves for both age and flight distance conditioned on satisfaction seem to have different shapes, so we will examine this further with eCDFs. To show that this trend is not solely due to random chance, we also choose to perform two two-sample Kolmogorov-Smirnov tests comparing the distributions of satisfied and unsatisfied customers for age and flight distance.

Between ages 0 to about 30 years old, we can see that the density of dissatisfied passengers is much larger than than satisfied passengers. At from around 30 to 60 years old, we see that the density of satisfied customers increases more rapidly than it does for unsatisfied customers. At 60 years old, the density of dissatisfied customers begins to increases more quickly than satisfied customers once again, until they both level off at around 70 years old.

From flight distances between 0 to 2,500 miles, we can see that the density of dissatisfied passengers increases much more than it does for satisfied passenger for each additional mile flown. At 2,500 miles, the density for dissatisfied customers begins to increase more slowly in comparison to satisfied customers, until they both level off at around 4000 miles.

In order to quantify this difference between the distributions for satisfied and unsatisfied customers, among age and flight distance, we will now perform KS tests for ecah of the variables.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  asp_satisfied$Age and asp_unsatisfied$Age
## D = 0.20867, p-value < 2.2e-16
## alternative hypothesis: two-sided

By running a two-sample KS test, we get a test statistic of 0.20867 and a p-value of nearly zero, giving us significant evidence to reject the null hypothesis that the ages of satisfied and dissatisfied customers are sampled from the same distribution. Therefore, we can conclude that there a significant difference between the distributions of age based on satisfaction level, which we observed previously in our eCDF.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  asp_satisfied$Flight.Distance and asp_unsatisfied$Flight.Distance
## D = 0.29678, p-value < 2.2e-16
## alternative hypothesis: two-sided

Now we’ll run another two-sample KS test, to get a test statistic of 0.29678 and a p-value of nearly zero, giving us significant evidence to reject the null hypothesis that the flight distances of satisfied and dissatisfied customers are sampled from the same distribution. Therefore, we can conclude that there a significant difference between the distributions of flight distance based on satisfaction level, which we observed previously in our eCDF.

After performing both KS tests and observing significant p-values of nearly zero, we can conclude that both age and flight distance are in fact dependent on flight satisfaction. Passenger satisfaction seems to be associated with passengers between the ages of around 30 to 60 years old, and for passengers outside of this range they tend to be more dissatisfied. Relatively speaking, shorter flight distances have much more passenger dissatisfaction, and longer flight distances have higher passenger satisfaction.

Question 4: Do delays impact customer satisfaction with the timing of their flights?

We want to see if having a delay impacts customer satisfaction with the timing of their flights. In order to examine this, we will look at both arrival and departure delay times, as well as the satisfaction with the convenience of the flights.

From this heat map, where the dashed reference line shows y=x, we can see that there is a linear relationship between departure and arrival delay. This is expected, since departure delay of a flight can cause its arrival delay. Additionally, there are lower levels of satisfaction for flights where the arrival delay is higher for lower departure delays (top of the line). For relatively shorter arrival and departure delays, the timing satisfaction rate rating is generally between a 2 to 4. Additionally for flights with large outlier values of both arrival and departure delays, the timing satisfaction rate is relatively low, which makes sense (oddly, with the exception of arrival and departure delays of around 1,100 minutes).

Question 5: Are higher class passengers more satisfied than lower class passengers?

Next, we wanted to know whether higher class passengers more satisfied than lower class passengers. However, since we know the seat class has a relationship with both the distance of the flight and the age of the customers, we will include that in our analysis for this question as well.

While the relationship between age and flight class still holds (the longer the flight, the more likely the customer is to be business class), from this faceted heat map we can see that dissatisfied customers stay in economy or economy plus classes for longer flight distances. We can also see that satisfied customers have a stronger relationship between age and class, since more older customers get higher class tickets (like economy plus and business class) in the satisfied group than the dissatisfied group.

Conclusion and Future Reccomendations

The Airline Passenger Satisfaction data set provides many insights into the factors that contribute to passenger satisfaction and dissatisfaction in air travel, which can be used by airlines to improve their services and enhance their customers’ travel experience. Our analysis of the data set has revealed several important findings.We found that there were many relationships between overall passenger satisfaction and the variables tested. Throughout this report we used multiple visualization techniques to inspect and understand the data such as PCA plots (scree plots and biplots), Mosaic plots, Pair plots, eDCFs, and heat maps.

Using PCA as our first form of EDA because the data set is so large, we found that most of the features were in fact associated with passenger satisfaction. From here, we used a mosaic plot and a chi-square test of independence to find that satisfaction and customer type are dependent, and overall satisfaction tends to be significantly higher for loyal customers, and lower for disloyal customers. Upon inspecting how flight distances and age demographics are related to satisfaction using a pairs plot, eCDFs, and KS tests, we found that both age and flight distance are in fact dependent on flight satisfaction. Passenger satisfaction seems to be associated with passengers between the ages of around 30 to 60 years old, and for passengers outside of this range they tend to be more dissatisfied. Relatively speaking, shorter flight distances have much more passenger dissatisfaction, and longer flight distances have higher passenger satisfaction. Next, when we analyzed how delays impact customer overall satisfaction with the timing of their flights using a heat map, unsurprisingly we found that there is a strong linear relationship between arrival and departure delays, and that longer delays are associated with lower satisfaction. Lastly, in our investigation of whether higher class passengers are more satisfied than lower class passengers based on flight distance and age using a heat map, we found that satisfied customers have a stronger relationship between age and class.

For the future we would like to continue to explore other data sets that could be used in conjunction with the observations we made to get a better picture of the airline service as a whole. Also, we would be interested in learning and applying more appropriate methods of dimension reduction that work on both categorical and numeric data, such as Factor Analysis of Mixed Data (FAMD) to see if we can get more precise and insightful results. One possible avenue for future research is to examine the impact of other factors, such as the airline’s reputation, the customer’s previous experience with the airline, and the frequency of travel on passenger satisfaction. Additionally, it would be interesting to explore the relationship between passenger satisfaction and airline profitability, as there may be a strong correlation between the two. Finally, as the airline industry is constantly evolving, it would be worthwhile to conduct similar studies periodically to see if the factors influencing passenger satisfaction have changed over time.