Aviation is an essential industry, but many people find it frustrating to fly by plane. In order to help airlines improve their service, we are looking at flight survey satisfaction data from Kaggle. This data set has nearly 13,000 individual responses for 23 variables, including 14 markers for satisfaction for different components of the process of flying, and an overall rating of satisfaction. There are also delay times (both arrival and departure), flight distance, customer loyalty, type of travel (business or personal), class of ticket (economy, economy plus or business) and demographic information (age and gender).
The Airline Passenger Satisfaction dataset is a comprehensive collection of customer feedback from passengers. The dataset contains information on various aspects of the passengers’ travel experience, such as flight distance, gender, age, type of travel, class, seat comfort, inflight entertainment, onboard service, cleanliness, departure delay, arrival delay, and overall satisfaction. This dataset aims to provide insights into the factors that contribute to passenger satisfaction and dissatisfaction, which can be used by airlines to improve their services and enhance their customers’ travel experience. It contains the following columns:
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Economy, Economy Plus)
Flight distance: The flight distance of this journey (miles)
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
Through our report we aim to answer the following questions:
1. Which features correspond to whether a customer is satisfied overall?
2. How does overall satisfaction compare between between loyal and disloyal customers?
3. How is overall satisfaction related to flight distances and age demographics?
4. Do delays impact customer overall satisfaction with the timing of their flights?
5. Are higher class passengers more satisfied than lower class passengers?
Since this data set contains 25 variables, we are going to use the dimension reduction technique Principal Component Analysis to understand the most important dimensions in the data and group the points by overall satisfaction to examine its relationship with both the components and the variables. Typically we use PCA for quantitative data, so in order to apply it to this case we must transform the qualitative variables into quantitative ones somehow. The possible values of the four variables were transformed as follows: for gender, male individuals are coded as -1 and female individuals are coded as 1; for customer type, disloyal customers are coded as -1 and loyal customers are coded as 1; for type of travel, personal travel is coded as -1 and business travel is coded as 1; for class, economy and economy plus are coded as -1 and business is coded as 1. Lastly, the individual survey ratings are ordinal categorical variables so their coding is maintained (ratings from 0-5). It should be noted that we are merging the training and testing data set originally listed on Kaglle, are omitting rows with missing/NA values, and are omitting rows with ratings of 0. Also, we are removing the “x” and “ID” columns since they do not provide useful information. After removing overall satisfaction from the data set (for the purposes of this specific research question), there are 22 features that we will be performing PCA on.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0495 1.5784 1.47725 1.40293 1.34645 1.19141 1.00735
## Proportion of Variance 0.1909 0.1132 0.09919 0.08946 0.08241 0.06452 0.04613
## Cumulative Proportion 0.1909 0.3042 0.40336 0.49282 0.57523 0.63975 0.68587
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.97191 0.96533 0.90154 0.82431 0.71399 0.68603 0.65802
## Proportion of Variance 0.04294 0.04236 0.03694 0.03089 0.02317 0.02139 0.01968
## Cumulative Proportion 0.72881 0.77117 0.80811 0.83900 0.86217 0.88356 0.90324
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.63900 0.59471 0.56849 0.54144 0.52767 0.51071 0.41973
## Proportion of Variance 0.01856 0.01608 0.01469 0.01333 0.01266 0.01186 0.00801
## Cumulative Proportion 0.92180 0.93788 0.95257 0.96589 0.97855 0.99040 0.99841
## PC22
## Standard deviation 0.18687
## Proportion of Variance 0.00159
## Cumulative Proportion 1.00000
Based on the elbow plot, it’s not very clear how many dimensions we should visualize. There is no obvious point at which the plot begins to level off, however relative to the other slopes the most gradual decrease seems to start at around the twelfth principal component. That being said, the proportion of variance explained by the 8th principal component falls below the dashed horizontal line, which indicates the amount of variance that a single variable contributes. Since the goal of PCA is to reduce the dimensionality of the data and there is no clear elbow point, it seems that we should visualize the first seven principal components to get a better understanding of the data’s variance structure.
Although we would ideally examine the first seven components, for simplicity we will only visualize the first two components (which together explain about 30.4% of the variation) and group the data points by overall satisfaction level to determine if there are any distinct clusters.
From the biplot, we can see that most of the variables (all of them except for gender, departure delay, and arrival delay) point towards the left indicating that passenger reports with a low first principal component tend to have higher values of those variables. Also, the fairly clear distinction between blue points (satisfied passengers) on the left and orange points (neutral or dissatisfied passengers) on the right shows that satisfied customers are strongly associated with a lower first principal component, and thus more closely associated with high values in the variables previously mentioned.
The three variables pointing to the right (gender, departure delay, and arrival delay) indicate that passenger surveys with a higher first principal component tend to have higher values of these variables, which is associated with neutral or dissatisfied passengers. Since gender is coded as -1 for male and 1 for female, this seems to indicate that female passengers are more closely associated with dissatisfaction. Among the variables with low values of the first principal component, four of them have relatively large values of the second principal component: inflight wifi service, ease of online booking, convenience of departure and arrival time, and gate location. Contrary to the clear trend for the first component however, there is no clear distinction for satisfied and neutral/dissatisfied passengers in the second principal component so the relationship between those four variables and overall satisfaction is not clear.
## PC1 PC2
## Gender 0.003883067 -0.0004842386
## Customer.Type -0.083893852 0.0230591807
## Age -0.081945049 -0.0147426199
## Type.of.Travel -0.149363861 -0.0757461097
## Class -0.238704589 -0.0980880770
## Flight.Distance -0.158646926 -0.0768253859
## Inflight.wifi.service -0.211861303 0.4326703328
## Departure.Arrival.time.convenient -0.060389964 0.4805048964
## Ease.of.Online.booking -0.141009700 0.5211197369
## Gate.location -0.054911345 0.4640398412
## Food.and.drink -0.269989781 -0.0941787668
## Online.boarding -0.298663390 0.1055810561
## Seat.comfort -0.330686841 -0.1062050034
## Inflight.entertainment -0.394661849 -0.1302093693
## On.board.service -0.269529446 -0.0639788809
## Leg.room.service -0.223657605 -0.0381212411
## Baggage.handling -0.244710012 -0.0529184935
## Checkin.service -0.179195921 -0.0273064810
## Inflight.service -0.247252549 -0.0563983333
## Cleanliness -0.322893899 -0.1113875408
## Departure.Delay.in.Minutes 0.037730621 -0.0059030262
## Arrival.Delay.in.Minutes 0.040685860 -0.0060135717
Since there are many dimensions and the variables closely associated with satisfied passengers are not easy to see, we can use the rotation matrix for more detailed analysis. The magnitude of the coefficients in the rotation matrix tells us the strength of the relationship between the specified variable and principal component. Since the first principal component seems closely related to overall satisfaction, we’ll list its coefficient for each of the variables that have a negative sign from smallest to largest magnitude to get a better understanding: gate location, age, customer type, ease of online booking, type of travel, flight distance, check in service, in flight wifi service, leg room service, class, baggage handling, inflight service, on board service, food and drink, online boarding, cleanliness, seat comfort, and lastly in-flight entertainment. Due to the clear relationship between the first principal component and overall satisfaction, we can infer that each of these variables are increasingly associated with overall flight satisfaction (from lowest to highest). This result makes sense, especially considering that most of these variables are individual survey questions, so we would naturally expect higher ratings of each of these categories to be associated with higher satisfaction in general.
We can determine if there is a difference in overall satisfaction between loyal and disloyal customers by creating a mosaic plot comparing overall satisfaction and customer type, and coloring it by its Pearson residuals. This will allow us to see if satisfaction and customer type are independent, and if not, which combinations are larger or smaller than expected.
By Pearson residuals, we can see that satisfaction and loyalty are not independent. Loyal customers have less dissatisfaction than what would be expected, and thus more satisfaction than what would be expected under the null model. Also, disloyal customers have more dissatisfaction than what would be expected, and thus less satisfaction than what would be expected under the assumption of independence.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 5587.1, df = 1, p-value < 2.2e-16
By running a Chi-squared test for independence, we get a test statistic of 5587.1 with 1 degree of freedom, and a p-value of nearly zero, giving us significant evidence to reject the null hypothesis that customer type and overall satisfaction are independent. Therefore, we can conclude that there is an association between these two variables. In conjunction with our inferences from the mosaic plot, we can see that overall satisfaction tends to be significantly higher for loyal customers, and lower for disloyal customers.
We want to see if having a delay impacts customer satisfaction with the timing of their flights. In order to examine this, we will look at both arrival and departure delay times, as well as the satisfaction with the convenience of the flights.
From this heat map, where the dashed reference line shows y=x, we can see that there is a linear relationship between departure and arrival delay. This is expected, since departure delay of a flight can cause its arrival delay. Additionally, there are lower levels of satisfaction for flights where the arrival delay is higher for lower departure delays (top of the line). For relatively shorter arrival and departure delays, the timing satisfaction rate rating is generally between a 2 to 4. Additionally for flights with large outlier values of both arrival and departure delays, the timing satisfaction rate is relatively low, which makes sense (oddly, with the exception of arrival and departure delays of around 1,100 minutes).
Next, we wanted to know whether higher class passengers more satisfied than lower class passengers. However, since we know the seat class has a relationship with both the distance of the flight and the age of the customers, we will include that in our analysis for this question as well.
While the relationship between age and flight class still holds (the longer the flight, the more likely the customer is to be business class), from this faceted heat map we can see that dissatisfied customers stay in economy or economy plus classes for longer flight distances. We can also see that satisfied customers have a stronger relationship between age and class, since more older customers get higher class tickets (like economy plus and business class) in the satisfied group than the dissatisfied group.
The Airline Passenger Satisfaction data set provides many insights into the factors that contribute to passenger satisfaction and dissatisfaction in air travel, which can be used by airlines to improve their services and enhance their customers’ travel experience. Our analysis of the data set has revealed several important findings.We found that there were many relationships between overall passenger satisfaction and the variables tested. Throughout this report we used multiple visualization techniques to inspect and understand the data such as PCA plots (scree plots and biplots), Mosaic plots, Pair plots, eDCFs, and heat maps.
Using PCA as our first form of EDA because the data set is so large, we found that most of the features were in fact associated with passenger satisfaction. From here, we used a mosaic plot and a chi-square test of independence to find that satisfaction and customer type are dependent, and overall satisfaction tends to be significantly higher for loyal customers, and lower for disloyal customers. Upon inspecting how flight distances and age demographics are related to satisfaction using a pairs plot, eCDFs, and KS tests, we found that both age and flight distance are in fact dependent on flight satisfaction. Passenger satisfaction seems to be associated with passengers between the ages of around 30 to 60 years old, and for passengers outside of this range they tend to be more dissatisfied. Relatively speaking, shorter flight distances have much more passenger dissatisfaction, and longer flight distances have higher passenger satisfaction. Next, when we analyzed how delays impact customer overall satisfaction with the timing of their flights using a heat map, unsurprisingly we found that there is a strong linear relationship between arrival and departure delays, and that longer delays are associated with lower satisfaction. Lastly, in our investigation of whether higher class passengers are more satisfied than lower class passengers based on flight distance and age using a heat map, we found that satisfied customers have a stronger relationship between age and class.
For the future we would like to continue to explore other data sets that could be used in conjunction with the observations we made to get a better picture of the airline service as a whole. Also, we would be interested in learning and applying more appropriate methods of dimension reduction that work on both categorical and numeric data, such as Factor Analysis of Mixed Data (FAMD) to see if we can get more precise and insightful results. One possible avenue for future research is to examine the impact of other factors, such as the airline’s reputation, the customer’s previous experience with the airline, and the frequency of travel on passenger satisfaction. Additionally, it would be interesting to explore the relationship between passenger satisfaction and airline profitability, as there may be a strong correlation between the two. Finally, as the airline industry is constantly evolving, it would be worthwhile to conduct similar studies periodically to see if the factors influencing passenger satisfaction have changed over time.