Tartan Data Science Cup

Problem Overview

People take out loans for numerous reasons: To pay for college, to buy a house or pay for renovations, to purchase a car, to refinance or consolidate existing debt, and to pay for big events (e.g. weddings, vacations, etc), among other reasons.

From the perspective of the lender, it would be useful to know if there are important differences in the features of the loans or borrowers that correspond to the differences in the likelihood that the loan is paid off.

Using (slightly altered) data from Lending Club here, your team is asked the following:

You work for a company that issues loans. Your company's data servers malfunctioned, causing some of the data to be "corrupted." Important information about the loans in the corrupted data was lost, including the repayment status of these loans.

Using the complete, uncorrupted data here, your task is to predict the repayment statuses of the corrupted loans. In particular, you are asked to predict if the loan status is in one of the following groups:

Group 0: "Good" -- The loan is fully paid off ("Fully Paid"), in its grace period ("In Grace Period"), or currently being paid back ("Current").

Group 1: "Bad" -- The loan is in default ("Default"), payment on the loan is late ("Late (16-30 days)" or "Late (31-120 days)"), or the loan has been charged off ("Charged Off").

Additionally, your boss is interested in knowing what features of the loans or borrowers correspond to a higher likelihood of being in the "Bad" group.

For every loan in the corrupted dataset, you must provide a prediction of which group that loan is in. The predictions can be 0 (for Group 0) or 1 (for Group 1), or the probability that the loan is in Group 1.

Additional information on the data can be found here.

Rules

You may not use any other data sources aside from the datasets provided. You may not use any other data you find online. Exactly how you justify your answer is up to you. That said, we suggest the following:

Use graphics / data visualization

When appropriate, incorporate the results of statistical models/tests

Provide detailed descriptions of the methodology used, but be concise

Submissions

Each team should submit all of the following:

Due at 4pm: a set of predictions in the format specified here.

Due at 5pm: a 2-3 page report containing at least two graphics or tables, a detailed description of the methods used to analyzing the data, and any key results that were obtained (submitted as a .pdf file)

Due at 6pm: up to 3 slides for a 5-minute research presentation (submitted as a .pdf file)

Due at 6pm: all (well-documented!) code used to analyze the data, obtain results, create graphics, etc (any programming language/software is acceptable)

Submission constitutes permission to post (anonymized) winning team entries online.

Finalists

The 15 teams with the lowest Brier Scores of their predictions will make the judging round.

Of these 15, eight teams will make the finals, as determined by a group of expert judges, who will read the reports.