Turning Known Unknowns into Known Knowns: Statistical Sampling in Search Term Workflows

Home » News and Insights » Insights » Turning Known Unknowns into Known Knowns: Statistical Sampling in Search Term Workflows

Statistical Sampling Proportionality Elusion Test - Statistical Sampling in Search Term WorkflowsHere at DiscoverReady, we believe in the power of incorporating TAR and other analytics into discovery workflows—these tools can reduce volumes of electronically stored information, effectively identify relevant and important documents, and improve the efficiency and reduce the costs of discovery efforts. But these techniques aren’t right for every case. Sometimes, traditional methods such as keyword searches can be just as effective—provided these methods are subject to appropriate testing and validation.

That perspective garnered further support recently in City of Rockford, et al. v. Mallinckrodt ARD Inc. , 2018 WL 3766673, No. 3:17-cv-50107 (N.D. Ill., Aug. 7, 2018), when Magistrate Judge Iain Johnston entered an order establishing a protocol in the case for the discovery of electronically stored information (ESI). Judge Johnston’s well-written order offers a thorough explanation of the principles of defensible, proportionate search and production of ESI. He includes a host of citations to important cases and useful articles and commentaries. And he manages to inject some humor into an admittedly dry subject matter.

Judge Johnston starts out with a famous quote from Donald Rumsfeld—

“[A]s we know there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. . .”

How does this quote relate to search and production of ESI? What is a “known unknown” in litigation discovery? Too often, the efficacy of search terms in finding responsive documents is unknown—and litigators know it’s unknown. But they would “rather not know” if their search process is leaving behind significant numbers of responsive documents that will not be produced. They fear that knowing this unknown might lead to additional costs, or uncover something adverse to their client’s position.

But in City of Rockford, the plaintiffs proposed an ESI protocol that enabled the litigants to know this unknown. Plaintiffs’ requested approach required defendants to measure the adequacy of the parties’ agreed-upon search terms by running an “elusion” test. Elusion refers to the fraction of documents missed by the search that are in fact responsive. To estimate elusion, defendants would draw a statistically random sample from the “null set”—the set of documents not returned by the searches—and determine how many documents in the sample are responsive. Plaintiffs proposed that the size of the sample should be calculated using a confidence level of 95% and a margin of error of 2%. Following the elusion test, and depending on the results, the parties would meet and confer to assess whether additional or different search terms should be run to improve the performance of the search and find more responsive documents.

Defendants resisted the requirement of an elusion test, arguing that it would be unduly expensive and burdensome. Defendants took the position that unless and until plaintiffs offered some evidence that certain categories of documents had not been produced, no further steps in the search process should be required once the parties had agreed on search terms. (Judge Johnston commended the parties on crafting a collaborative protocol for arriving at a set of search terms, and notes that they were able to come to agreement on all aspects of the protocol except for this one.)

Before setting out his analysis, Judge Johnston cautioned the reader: “Don’t freak out” about the statistical sampling and measurement proposed by plaintiffs. Discovery of ESI—along with statistics to test and validate search methodologies—is not something to be scared of. Indeed, competent representation requires that attorneys become familiar and comfortable with these concepts. (We agree, Judge Johnston. Here on the blog we’ve made the case for lawyers learning statistics, and provided a foundational primer.)

Ultimately the court agreed with plaintiffs’ proposal. First, the Judge found that sampling the null set would be a reasonable means of assuring that the production is complete, thereby supporting the certification required by Federal Rule 26(g). Judge Johnston also found that a null set sample would be proportionate under Rule 26(b)(1), rejecting defendants’ assertion—which was not supported by any specific evidence—that the sampling would be unreasonably expensive and burdensome. In this case, plaintiffs bring racketeering and antitrust claims regarding prescription drug pricing; the matter involves significant issues at stake, an “extraordinary” amount in controversy, and defendants with substantial resources and access to the vast majority of the relevant information. In those circumstances, it is not unreasonable to require defendants to review a sample of a few thousand documents. Any burden or expense of the sampling does not outweigh the benefit of “ensuring proper and reasonable—not perfect—document disclosure.”

From our perspective, Judge Johnston’s ruling is a welcome, helpful explanation of how parties can use a reasonable, defensible protocol to validate a search methodology. Whether the search relies on keyword terms, TAR, or other analytics, we advocate the use of statistical sampling and measurement to test and validate the results. Indeed, we would have suggested some additional sampling in the City of Rockford protocol. For instance, it would have been useful to sample the entire corpus of documents at the outset to measure the “richness” of the full population. That measurement would enable the parties to estimate the percentage of responsive documents in the collection, and gauge how many responsive documents the defendants should expect to find with their searches. Also, as the parties engaged in the collaborative, iterative process of agreeing on search terms, statistical sampling and measurement could objectively assess which terms were high-performing (those that effectively found responsive documents without pulling in too much “junk”), and which terms were poor-performing (those that brought back too many false positives, or missed responsive documents).

If you’d like to learn more about DiscoverReady’s approach to statistical sampling in support of defensible, proportional discovery efforts, please reach out to us at better@discoverready.com.

Author Details
Senior Vice President, Discovery Strategy & Data Privacy/Security
A recognized thought leader in e-discovery, Maureen collaborates with the company’s clients and operations teams to develop innovative information strategies for legal discovery, compliance, and sensitive data protection. She speaks and writes frequently on significant issues in e-discovery and information governance, and participates actively in the Sedona Conference Working Groups on Electronic Document Retention and Production and Data Privacy and Security. Prior to DiscoverReady, Maureen was a partner at Paul Hastings LLP, where she represented Fortune 100 companies in complex employment litigation matters.
Posted on