A predictive learning algorithm predicts an outcome based on learning from previous instances of data. For example: Given an instance of a loan application, predict if the applicant will repay the loan. The learning algorithm makes these predictions based on a training dataset, where many other instances (other loan applications) and actual outcomes (whether they repaid) are provided.
Unfortunately, as you have discovered throughout this class, sometimes the patterns that are found by these learning algorithms may amplify historical biases. For example, a loan repayment algorithm may discover that age plays a signiﬁcant role in the prediction of repayment because the training dataset happened to have better repayment for one age group than for another. This raises a major problem – even if this outcome is representative of the data, there are legal precedence/laws that make it illegal to base any decision on an applicant’s age, regardless of whether this is a good prediction based on historical data.
To enable the mitigation of bias, a number of pre-determined “fairness metrics” have been proposed to help identify the bias since, unless you know what’s broke, you can’t fix it.
In this assignment, we will look at the impact of computing and applying fairness metrics to “fix” data that could be used to train algorithms associated with learning from credit-based data sets. Remember to answer ALL questions irrespective of the outcome from each step.
Step 1 – Select one of the datasets for completion of this assignment:
• German Credit Data Set – https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
• Taiwan Credit Data Set – https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
• Portuguese Bank Marketing Data Set – https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Step 2 – Explore the data by answering the following questions:
• Which dataset did you select?
• How many observations are in the dataset?
• How many variables in the dataset?
• How many and which variables in the dataset are associated with a legally recognized protected class? Of those variables associated with a protected class, what is the associated legal precedence/law it falls under as discussed in the lectures?
Step 3 – Based on your selected dataset, specify an outcome variable, protected attributes, and split the dataset into training and testing sets
• Select outcome variable(s) that relates to the creditworthiness of a customer and derive a formula to score each customer based on whether they are an Excellent Credit Risk (i.e. highly likely to pay back a loan) versus Bad Credit Risk (i.e. highly likely to default on loan). Select a range of scores from 0-100 where 100 is the maximum value for Excellent Credit Risk. To compute creditworthiness, you can apply any algorithm or set of calculations on the variables that makes sense to you – you can implement your own ML algorithm (which is perfectly fine), create a mathematical formula (which is the basis of all things AI/ML) or even just close your eyes, throw a dart, and pick a single variable from the dataset.
• Select a protected class attribute – i.e. choose an attribute on which the bias can occur, basically the attribute you want to test bias for.
• Define an unprivileged group and privileged group– i.e. choose a subset of protected attribute values which are considered unprivileged versus privileged from a fairness perspective (i.e. your unprivileged group would be your historically disadvantaged group of interest).
o For example, we might select age as our protected class attribute. In this case, I may decide to choose Older (age >= 40) as the unprivileged group and Young (age < 40) as the privileged group.
o This allows us to transform our data based on binary membership in a protected group
• Randomly split your original dataset into equally-size training and testing sets. How many of each (privileged versus unprivileged) members are in each set?
• Provide your results indicating your selected outcome variable/conversion formula, protected class attribute, privileged group, and privileged group.
• Example Output:
o Outcome variable: Creditworthiness derived from History of past payments and Y o Formula used to score members creditworthiness from 0 to 100 is [Some Formula] o Protected Class Attribute: Age o Privileged group: Young (age < 40); Number of Members in Training Set: J; Number of Members in Testing Set: K o Unprivileged group: Older (age >=40); Number of Members in Training Set: X; Number of Member in Testing Set: Y
Step 4 – Graph and compute a default threshold that maximizes profit
• Using a histogram, graph the data associated with Creditworthiness (where creditworthiness is on the X-axis and the number of associated customers with that creditworthiness is on the Y-axis)
• Compute a threshold for approving a loan (based on credit risk) that tries to maximize profit. Assume that a good credit risk is associated with a creditworthiness score >=50. Highlight the threshold information on the graph.
• To compute profits, assume, in this case:
o Approved Loan/Good Credit Risk = +10 Profit o Approved Loan/Bad Credit Risk = -5 Profit o Declined Loan/Good Credit Risk = -3 Profit o Decline Loan/Bad Credit Risk = 0 Profit
• What is your threshold value? What is the profit based on your threshold value? Compute how many in each group (privileged and unprivileged) received Favorable (i.e. Approved) versus Unfavorable (i.e. Declined) outcomes based on your threshold value. Create a table documenting your results. Note: A Favorable outcome is associated with an Approved Loan. An Unfavorable outcome is associated with a Declined Loan.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: email@example.com 微信:itcsdx