Everything Litigators Need to Know About Statistics in eDiscovery (But Were Afraid to Ask)January 17th, 2014
OK, this post won’t cover EVERYTHING you need to know about statistics in ediscovery. But it should provide a simple overview of the key concepts a litigator should understand to effectively incorporate statistical sampling into a discovery program.
Statistical Sampling: A method of estimating a characteristic of a large population by examining only a subset of it. In the context of e-discovery, the population typically is a collection of documents that we want to know more about, but without having to look at every single document in the collection.
- The subset of documents to be examined is called the “sample.” To be statistically valid, the sample must be selected at random — meaning that every document in the collection has an equal chance of being selected in the sample.
- The required size of the sample depends primarily on the acceptable margin of error and the desired confidence level (defined below).
Richness or Prevalence of Documents
Richness (or Prevalence): This refers to the percentage of documents in the population that have the characteristic we’re interested in estimating, typically responsiveness or relevance. For example, if our statistical sampling shows that approximately 30% of documents in a collection are relevant to the matter, we would say that the collection has a “richness” or “prevalence” of 30%.
Margin of Error and Confidence Interval
Margin of Error and Confidence Interval: The margin of error is a range above and below the measured value in which the true value likely lies.
- For example, let’s assume that we took a random sample from a document collection and determined that the proportion of relevant documents in the sample was 30%. We can extrapolate that the proportion of relevant documents in the collection (i.e., the “richness” of the collection) is approximately, but not exactly, 30%.
- The exact richness of the collection is unknown, but is likely to fall within a margin of error of +/- 3%. We express this by saying that that the richness of the collection is estimated to be 30%, plus or minus 3%.
An alternative way of stating the estimate is by using a confidence interval, which is the range of values that is likely to contain the actual value.
- In our previous example, we would state that the richness is likely to fall within a range of 27% to 33%.
- Unlike the margin of error, the confidence interval does not have to be exactly symmetrical around the estimate, and can therefore be a more precise way of expressing the uncertainty of the estimate.
Confidence Level: The confidence level is the probability that the margin of error (or confidence interval) would contain the true value if the sampling process were to be repeated a large number of times.
- For example, if the confidence level is 95%, it means that there is a 95% chance that the true value is within the confidence interval.
- In the previous example, we would say “with a 95% confidence, the richness of the document collection falls between 27% and 33%.”
Generally speaking, as the confidence level goes up, either the sample size or the margin of error must become larger. In general, a higher confidence level is better. But that comes at the price of a larger sample size, and/or a wider confidence interval (a higher margin of error).
Also generally speaking, the smaller the margin of error you want (or the narrower the confidence interval desired), the larger the sample size must be, or the lower the confidence level will be. A smaller margin of error generally is better, because it reflects a more precise estimate. But that requires a lower confidence level (less certainty), and/or a larger sample size (higher cost and/or less efficiency).
When using statistical sampling in e-discovery, there are no bright-line rules around the minimum acceptable confidence level or margin of error. Rather, the operative standard is one of “reasonableness.” Every matter is different, and what is reasonable in one matter may not be in another. Factors affecting the reasonableness calculus include:
- The cost of greater precision in measurement as compared to the amount at stake and the importance of the matter (proportionality)
- The purpose for which the sampling is being performed (for example, sampling used for making representations to the court or the opposing party may need to be more precise than sampling used solely for internal purposes)
- The time and resources available for sampling
Recall and Precision
Recall and Precision: When statistical sampling is being conducted to test the efficacy of a document search or retrieval process – for instance, a keyword search, a human review to identify responsive documents, or a predictive coding tool – the sampling will provide measurements of recall and precision.
- Recall is the fraction of relevant documents in the collection identified by the process as relevant. In other words, recall measures the completeness of the search. For example, if a keyword search hits on 80% of the total number of relevant documents in the collection (missing 20%, which the search terms did not identify) we say that the search has 80% recall.
- Precision is the fraction of documents identified by the process as relevant that are in fact relevant. In other words, precision measures the accuracy of the search. For example, if a predictive coding tool identifies 10,000 documents as potential relevant, but only 9,000 of the documents are actually relevant (1,000 of the documents are not relevant, and are “false positives”), we say that the tool’s results have 90% precision.
A high-quality search process should maximize both precision and recall – which often is challenging to do, as the concepts intrinsically are in tension with each other. Typically, the higher the recall the lower the precision, and vice versa. Put differently, if your search captures all of the relevant documents – 100% recall – you very likely are also capturing a bunch of non-relevant documents, and therefore the precision is low. And if your search has very high precision, bringing back virtually no false positives, you likely are leaving behind some relevant documents that the search didn’t identify.
These broad definitions are intentionally simplified to provide a general overview of statistics in ediscovery. In the interest of simplicity, I have glossed over some of the nuances that need to be addressed when actually conducting statistical sampling. Please heed the good advice: “Don’t try this at home.” As you spot opportunities to incorporate statistical sampling into your e-discovery efforts, be sure to engage an expert who understands those nuances and can help ensure that your statistical sampling is correct, effective and defensible. Next time, we’ll look at how these concepts are actually applied in some of the best use cases for rolling out statistical sampling to improve discovery.
Learn how DiscoverReady simplifies statistical sampling in e-discovery with Samplyzer™.