A Confusion Matrix is a table that tallies outcomes versus predictions. There are many good explanations about what a Confusion Matrix is. For example, Wikipedia - Confusion Matrix.
This dashboard allows you to explore how a Confusion Matrix behaves.
See the article, Algorithmic Bias and the Confusion Matrix Dashboard.
Code and data are on github.
Also check out a series of YouTube videos walking through the Confusion Matrix Dashboard's features and the Apple Snacks example. Start with the promo intro: Algorithmic Bias and the Confusion Matrix Dashboard.
The Confusion Matrix Dashboard is designed to fit an HD screen (1920 x 1080 pixels).
Example
Suppose you are packaging snack boxes for 500 kids for a school picnic. Some of the kids like apples, some prefer something else like popcorn, a cracker, or cookie. You have two kinds of snack box, one with an apple, and one with something else. You must decide in advance which kind of box to give to each kid, which you will then label with their name and hand out to them. When each kid opens their snack box, they will either be happy with their snack and say "Yay!", or else they will be disappointed and say "Awwww".
To help you decide, you predict which kind of snack each kid likes based on some rules of thumb. Older kids tend to like apples while younger kids do not. Taller kids like apples while shorter kids don't. Kids in Mr. Applebaum's class tender to prefer apples, while kids in Ms. Popcorn's class want something else. There is no hard and fast rule, just an educated guess.
For each kid, you give them a point score indicating the likelihood that they will want an apple. A score of 10 means you are quite sure they'll want an apple, for example a 10-year-old girl in Mr. Applebaum' class. A score of 1 means you're confident they will not want an apple, like a 6-year old boy in Ms. Popcorn's class.
For each kid, after calculating their prediction score, you make a decision. You might set the decision at the mid-point score of 5. Or, you might set the apple snack threshold higher or lower. For example, if it's important that the kids eat fruit, you'll set the threshold lower so that you'll catch more kids who prefer apples. Or, if you want to err on the side of having fewer apple slices discarded in the trash, then you'll set the threshold higher, so fewer apple snacks are handed out. In other words, your decision depends on the tradeoffs for different kinds of errors.
At the picnic, you record kids' reactions. You write down what prediction score you gave them, whether you gave them an apple or not based on the decision threshold, and what their reaction is. This is the payoff for your decision.
In a table, the number of kids who actually wanted an apple (positive outcome) are plotted in red, while the kids wanted no apple (negative outcome) are in green.
The confusion matrix counts the numbers and ratios for each quadrant of the table.
Actual Outcome |
|
Performance
Your performance depends on several things:
There are various ways of assessing your performance. You can focus on how many kids you predicted correctly (TP and TN). You can focus on how many kids you got wrong (FP and FN). You can focus on what proportion of kids you thought would want an apple, and really did (precision = TP / (TP + FP)). Several other measures combine these numbers in simple formulas. The ROC (Receiver Operating Characteristic) and Precision/Recall Curves show how some of the measures trade off with one another as the decision threshold changes.
Confusion Matrix Behavior
Even when we understand what a Confusion Matrix is and how performance measures are calculated, it can be tricky to anticipate how they behave under different distributions of positive and negative predictions, subject to a decision threshold.
The Confusion Matrix Dashboard offers a way for us to gain insight into how different shapes of prediction distributions and different thresholds lead to different values for critical measures of performance, accuracy, and bias.
The Dashboard allows you to explore the behavior of two different forms of prediction distribution that you can adjust:
The dashboard also includes some real data whose predictions are made by an existing commercial algorithm, by a simple statistical algorithm, or by a more sophisticated Machine Learning algorithm. After inspecting the real data prediction distributions, you can try to approximate them with the adjustable idealized distributions.
The dashboard also allows you to import your own data from a file.
The Data Source -> Import file data -> Help panel explains how.
Algorithmic Bias
There are numerous ways to determine whether decisions might be biased. It turns out that there is no single best way to make decisions that provide accuracy, ensure fairness, and nullify all of the potential indicators for bias that can be derived from the Confusion Matrix. In fact, in general there are multiple common ways to assess bias that are mathematically incompatible with one another. Usually, some measure for bias will necessarily find support in a Confusion Matrix.
The Confusion Matrix Dashboard proposes a different measure for bias that stands apart from the Confusion Matrix. The Positive Prediction Ratio Score (PPRS) compares predictions and outcomes between a pair of subpopulations. The tan curve is the fraction of predictions that had a positive outcome, per bin, across prediction scores. The PPRS measures how closely the positive prediction ratio curves agree. This is indepedent of decision threshold, whereas values in the Confusion Matrix depend entirely on the threshold.
The relatively simple mathematical formula for PPRS that is used here is described in the accompanying article, Algorithmic Bias and the Confusion Matrix Dashboard.
The rationale for PPRS is that an algorithm is fair with respect to subpopulations when it assigns prediction scores that match in terms of their probability of actual outcomes. This is known as calibration fairness. An algorithm is biased to the degree that it assigns members of two subpopulations different prediction scores when they would in fact have the same probability of a positive outcome. A low PPRS (less than about .2) indicates an unbiased prediction. A high PPRS (greater than about .7) indicates a biased prediction. But the PPRS measure is subject to sample size, sample noise, and the number of bins.
The PPRS is agnostic with regard to the sizes of the subpopulations, their overall probability of positive or negative outcome, and the shape of their distributions of positive and negative outcomes which can arise from different characteristics of different subpopulations. The PPRS is intended only to assess whether prediction scores are consistent or not, as measured by probability of outcomes (relative size of red and green bars at a prediction score bin).
Some pre-loaded data sets illustrate the PPRS in the Apple Snack example. It can be possible for girls to have different preferences for an apple snack than boys. Various measures of bias based on the Confusion Matrix can nonetheless be triggered, while the PPRS remains low. Conversely, one form of biased prediction systematically scores girls lower than boys who have the same apple preference. This is reflected in a shifted prediction ratio curve and a high PPRS. Another form of bias is to ignore some girls' preferences and assign scores to half of them just randomly. This results in a bent prediction ratio curve, and also a high PPRS when compared to boys. Explore these data sets in the Dashboard to see how the PPRS behaves. Then, prepare your own data sets with subpopulations to test for algorithmic prediction bias using the Positive Prediction Ratio Score.
You can load your own data from a file to calculate the ROC and Precision/Recall Curve, and see how the threshold affects the confusion matrix, and the scores calculated from it.
A data file should be a .json file in the following format:
{"data-set-nickname": "<nickname>",
"data-set-display-name": "<display-name>",
"notes": "<notes>",
"data-slices": {
"all-data": {"pos-outcomes": [ int, int, int, ...],
"neg-outcomes": [ int, int, int, ...]
},
"<data-slice-name>": {"pos-outcomes": [ int, int, int, ...],
"neg-outcomes": [ int, int, int, ...]
},
...
}
}
The lists of integers, pos-outcomes
and neg-outcomes
, are lists of counts per prediction histogram bin.
They must all be the same length within the data file.
The number of bins can range from 2 to 80.
Give your data set a nickname of the form, 'my-data-set-1'. Give your data set a display name that will appear on the Data Source menu, like "My Data Set 1".
You should have one data-slice called 'all-data' that includes all samples from your data set. Then, you can have any number of slices, or subsets of the data. The Dashboard allows you to compare pairs of data slices.