As we work with clients on data analytics projects—whether for legal discovery, compliance programs, or other information governance initiatives—we routinely use technology-assisted review and machine learning tools. These analytics capabilities help us sift through huge volumes of data and better understand the content and context of these large collections. Whatever the objective in using these tools, the machine learning systems are always trained by the decisions of one or more human reviewers. And because human inputs are critical in the use of these systems, we go to great lengths to ensure the quality of these inputs—we carefully train the reviewers, we conduct QC and QA checks of their work, and we measure their error rates. We also know that, despite these efforts, human decision-making will never be perfect. So, we accept a reasonable error rate, and we incorporate processes in our workflows designed to minimize the influence of those errors on the machine learning.
But what if the human decisions we’re using to train the analytics algorithms are infected with a systemic bias of some kind? And what if that bias is not detected by the testing and measurements we typically use, which focus on machine error and sampling error? I’ve often pondered this idea, and I recently came across an article that explored one particular type of bias, which provided some insight into how we might adjust our thinking about training and testing our machine learning systems. In “Prevalence-Induced Concept Change in Human Judgment,” authors David E. Levari et al. discussed some fascinating experiments that show how biased simple decision-making can be.
Prevalence-Induced Concept Change
Imagine showing a person a series of dots that range from very purple to very blue and instructing that person to identify whether a dot was blue or not blue. Levari’s team established two series of dots: In one series the prevalence of blue dots does not change over time during review; in the other series the prevalence of blue dots decreases over time during review. Then they asked people to review the sets in sequence and identify which dots were blue and which were not.
The logical expectation is that people will make consistent and correct decisions about a dot’s color, regardless of the prevalence of the color in the series (i.e., a blue dot is blue, regardless of how often you see it). In other words, changes in the frequency in which blue dots are presented should not matter to a correct identification of a dot’s color. But Levari’s team showed that when blue dots became rare, reviewers began to see purple dots as blue. Levari described this as “prevalence-induced concept change.” The reviewers were biased towards a pre-conceived notion of the richness of the data set—based on the richness of the first set they encountered—and unconsciously changed the definition of “blue” to fit that bias. Even more surprising, Levari’s experiments also showed that “[t]his prevalence-induced concept change occurred even when participants were forewarned about it and even when they were instructed and paid to resist it.”
Levari also included more subjective experiments, where people were asked to judge whether computer-generated faces were or were not threatening, or whether proposed scientific studies were or were not ethical. Each of those experiments was designed in a similar fashion to the blue dot experiment, where the prevalence of threatening faces or unethical studies changed over time. In each of these experiments, Levari showed the existence of prevalence-induced concept change. His experiments show that the phenomenon is persistent and applies across diverse situations.
Implications for Analytics in the Legal Industry
These excellent, broad-ranging experiments have obvious implications for our industry, especially in the use of machine learning to help identify documents for potential production in legal discovery. Our “responsive vs. non-responsive” decisions are very similar to the blue dot, threatening face, or ethical study decisions—in some instances it’s obvious whether a document is responsive (i.e., the dot is blue), but in other instances the assessment involves some subjectivity (i.e., is the face threatening?). We also change the prevalence of responsiveness over the course of a case. We purposefully give our review teams sets of documents that have high richness (such as prioritizing responsive documents in review) and very low richness (such as elusion samples). We may have the same review teams look at each set of documents, but we may not take any steps to help correct for the subtle bias a reviewer may have in expecting a certain richness based on prior experience.
The bias that Levari describes has real effects in our workflows. How should we correct this bias? Should we inform reviewers about the expected richness before they begin making decisions? Would doing that reduce the effectiveness of quality control or elusion testing? Can we know that our estimates themselves are unbiased? Do we add a correction factor in our measurements? Can you fix a bias by introducing another bias?
I can’t answer all those questions, but I do know there are ways to help control for this bias. Even though Levari’s experiments showed that the bias still occurred when participants were forewarned about it, the bias was diminished with a warning. So, at the very least, we should be training our review teams and leadership on this bias issue and provide the teams with the knowledge needed to self-correct these biases. Awareness can be a check—even if imperfect—on subtle human bias.