Sensitive Data in E-Discovery: Find It, Cull It, Protect It

In several prior posts here on the blog we’ve discussed the problem of sensitive data in e-discovery: As volumes of electronically stored information increase, more and more sensitive data finds its way into ESI collected for legal matters (especially as it creeps into unstructured data sources), and it’s becoming increasingly difficult for organizations and their counsel to effectively protect this information in discovery. In today’s post, we’ll explore some specific measures and recommended best practices for protecting sensitive data in e-discovery project workflows.

Scan for Sensitive Data During Processing

Find Cull Protect Sensitive Data E-DiscoverySome of the software tools used to process ESI for legal discovery include “scans” intended to find sensitive data. Other commercially marketed scans are also available to run against collections of data during processing. The types of data typically targeted by these scans are found in “regular expression” format, such as social security number, birthdate, credit card number, driver’s license number, etc. At DiscoverReady, we’ve performed extensive testing and analysis of the effectiveness of these scans. Generally speaking, we find that these scans perform poorly in finding sensitive data, with unacceptably low recall and precision. (For a primer on “recall” and “precision,” which are essential statistical concepts to understand, you can refer back to one of my earlier posts on the subject: “Everything Litigators Need to Know About Statistics in eDiscovery (But Were Afraid to Ask)“) In light of these findings, DiscoverReady has developed optimized versions of these scans that greatly improve their recall and precision over the “out of the box” versions. In our view, running these scans on every collection of data during processing, to find as much sensitive data as possible early in the workflow, is a best practice in e-discovery.

But what if you’re not working with DiscoverReady and don’t have access to the optimized scans—should you run the “out of the box” scans? Is a “mediocre” scan for sensitive data better than none at all? Maybe not. Relying on these scans without doing some investigation of their recall could be dangerous—you may end up with a false sense of security that the scan is finding the sensitive data in your collection when in fact the scan is missing significant amounts of that information. Likewise, the poor precision of these tools could waste a lot of time and money chasing down false “hits” for sensitive data that do not help mitigate risk. So proceed with caution, and consider performing your own improvements on these tools.

Analyze and Cull Sensitive Data

After processing, a myriad of available e-discovery analytics tools provide the next opportunity to find and protect sensitive data. For example, predictive coding, concept clustering, and other forms of TAR can be used to identify some types of sensitive data, and to group and organize documents containing sensitive data for more efficient handling. Once you’ve identified sensitive data in a document, “find more like” tools can locate similar documents in the collection. E-mail threading can help logically group communications containing sensitive information. And even ol’ fashioned keyword searches—properly tested and validated, of course—can be effective at finding sensitive data existing in the collection.

Once you’ve found and organized documents containing sensitive data, those that are irrelevant to the matter can be culled out—minimizing the volume of sensitive data moving on to the next phase of the workflow. And if you’ve fully utilized the available technology, that culling can be done quickly and efficiently across large groups of documents.

Conduct a Careful Review for Sensitive Data

You also can use technology and process to protect sensitive data after defining the set of data moving to the next phase of the matter.

First, take advantage of whatever highlighting or flagging capabilities exist in the hosting platform to call out sensitive data for anyone accessing the documents. Being able to quickly spot instances of sensitive data goes a long way toward ensuring those documents are appropriately handled.

Next, whoever is conducting a review or otherwise working in the documents—whether counsel handling the matter or a specialized document review team—must be fully educated on the types of information considered “sensitive” and how to handle that information when it’s encountered. Both substantive guidelines and a process workflow should be defined—and documented in writing—for handling sensitive data. How should sensitive data be coded? What data should be withheld? What data should be redacted? If redactions will be made, who is responsible for the redactions, and what is that workflow? What information should be produced but marked with a confidentiality or other protective designation? What are the operative provisions of the applicable protective orders and/or confidentiality agreements, and how should those provisions be reflected in the workflow? All these questions must be answered at the outset of discovery, before any review or other work with the documents takes place.

And don’t forget: Anyone reviewing documents should receive training on and demonstrate competence with the review platform. While we might assume that most lawyers practicing today are experienced in using common hosting and review platforms, that assumption is incorrect. Err on the side of over-training.

What about matters where counsel or their client do not want to conduct a traditional “review” of the document collection before making a production, but instead want to run screens to filter out privileged content and then produce whatever remains? Although strong protective orders and FRE 502(d) orders can allow for that approach while still protecting privilege, such arrangements may not appropriately protect other types of sensitive data, especially personally identifying information and personal health information. For a “screen then produce” approach to fully protect personal sensitive data, a robust, validated screen for sensitive data must be used to identify any protected content before production. And for certain types of sensitive data not amenable to an automated screen, some amount of review must be performed to prevent production of that information.

Whatever type of review takes place need not be an inefficient, expensive, manual linear review. Technology  can dramatically increase overall review efficiency. For instance, coding decisions for sensitive data can be automatically propagated across duplicate documents. Redactions of sensitive data can be automated. Even in native documents like spreadsheets, redactions can be performed quickly using specialized workflows and tools.

Check for Quality, Consistency, and Gaps in Treatment of Sensitive Data

Once documents are potentially ready for production, use the technology features of the hosting platform to conduct checks aimed at finding and resolving any mistakes or inconsistencies in how sensitive data is handled. An experienced project manager can suggest appropriate checks based on the workflow used. For example, run searches to confirm that every document slated for production has an affirmative coding decision, that all duplicate documents are coded for sensitive data identically, and that all documents noted for redaction of sensitive data do in fact have redactions applied. Consider whether to segregate documents that contain non-searchable content, to confirm that no sensitive data exists in these documents that couldn’t be found with data analytics.

Finally, before any production goes “out the door,” consider running one last scan/search for sensitive data elements, to confirm that any sensitive content in the production correctly belongs there, and if so, that the appropriate protections are in place for those documents. That final check can also be used for an additional measurement and validation of the recall and precision of your sensitive data scans, to provide further proof of the defensibility and reasonableness of your efforts.

Use Knowledge Gained in Discovery for Better Information Governance

Organizations frequently express surprise when the legal discovery process finds sensitive data turning up in places that they should never exist. We refer to this situation as “data exhaust” or “data destructuring”—the movement of sensitive data from protected systems of record (such as structured databases, vaults, and repositories) into unprotected systems that should not contain the information (such as file shares, network drives, e-mail, and personal computers/devices).

By identifying sensitive data exhaust or destructuring when it turns up in discovery collections, and working “upstream” in the organization to determine the source of the exhaust or the cause of a destructuring, litigation counsel can play a valuable role in improving information governance practices across the organization. Working with other stakeholders such as compliance, human resources, information management, and IT, the company can identify root causes of the information governance failures, prevent future occurrences, and mitigate the risk of compromises of sensitive data.

At DiscoverReady, we’ve been working on the sensitive data problem for many years, and we’re proud of our industry-leading solutions for protecting our clients’ most confidential information. Please contact us at if you’d like to speak with us about our approach to sensitive data in e-discovery.

Maureen O'Neill