Disclaimer: To protect confidentiality, some facts and circumstances in this case study have been adjusted while maintaining the accuracy of the workflow and results.
It’s becoming harder to argue that generative AI will not play a transformative role in the future of document review. The technology can apply complex review protocols with many distinct determinations across an ocean of data in hours. It can review foreign languages without translation, review images without OCR, apply contextual knowledge of external subjects and events, and provide rationales and citations for each determination. And it’s the worst that it will ever be.
Yet powerful technology requires equally thoughtful implementation. The difference between potential and performance lies in the details: how you scope the review population, design sampling strategies, refine your prompts, structure validation, and manage disclosure. Success depends as much on these reliable processes as it does the technology itself.
This case study examines what I would call a “traditional” first-pass review for responsiveness using Relativity aiR for Review. What follows is a high-level account of the workflow, focusing on key decisions, controls, and considerations throughout the process. Note that this is descriptive, not prescriptive, and I urge readers to ask questions and come to their own conclusions.
Case Background
We were hired by outside counsel for a corporation to manage their response to a set of production requests in a commercial contract lawsuit. The collection size was over 820,000 documents. My team worked closely with two partners and one senior attorney with knowledge about the case.
| Case Type | Commercial Contract Dispute |
| Client Type | Law Firm |
| Review Type | First Pass Review for Production |
| Collection Size | 821,523 |
Step 1: Defining the Review Scope [2 hours]
The first step in any review, be it AI or eyes-on, is to define the review population. In this case, we were able to narrow an initially broad review population through running Boolean searches and other population restrictions. Once we finalized the review set, we screened the population for ineligible documents due to file size and other exclusions, creating a separate eyes-on review queue using Review Center for those files. The final eligible review set was 197,010 documents, with 4,932 documents in the “set aside” pile, requiring other forms of inspection.
Consideration: There is no “too small” for aiR. While more documents will result in a better ROI, there is no minimum document count from a technical standpoint. Setting aside a few mega-matters, our average project size across 22 matters is about 75,000 documents, and plenty have been under 10,000, with the smallest around 1,000. That said, for matters <1,000 documents, the review time savings may not be worth the “iteration and validation tax.”
Step 2: Drafting the Initial Prompt [1.5 hours]
After the review scope had been defined, we met with the client to discuss the legal objectives and develop a technical approach. One of the key decisions we needed to make at this stage was whether to use aiR for Review’s “issues” or “relevance” analysis. I’ve described how I think about the differences in the chart below.
| Relevance Analysis | Issues Analysis | |
| Definition | Single, free-form relevance prompt per document, which outputs a 0-4 relevance ranking for each document. | Classify documents according to multiple issues, topics, or RFPs. Output is a 0-4 ranking for each distinct topic of issue. |
| Benefit | Working with a single prompt and ranking per document (e.g. for validation) offers simplicity and efficiency. | Better visibility into how each issue/topic is performing during iteration. Users can drill in on more specific results (e.g. “very relevant to Issue 9”). |
| Use for… | Straightforward review projects where ease of use is more important than fine tuning. | Reviews involving multiple distinct criteria for relevance or responsiveness, where you’d like granular control over the process. |
Relevance Analysis in aiR for Review Issues Analysis in aiR for Review
In this case, the document request included 46 RFPs spanning 10 generalized topics, and any document meeting the conditions for one or more of the RFPs was to be considered responsive. Since the matter contained multiple distinct criteria for responsiveness, we chose to use aiR for issue review, working closely with the client to sort the 46 RFPs into 10 prompts. This was the preferred approach for a couple reasons:
- Using issues analysis to craft an individual prompt per RFP or topic provides greater visibility into how well each topic-specific prompt performs. Similar to a search term report, we can easily spot that Issue 6 isn’t returning results, while Issue 9 is clearly too broad, so we can make modifications accordingly.
- When the review is complete, each document is given a rank from 0-4 for each issue, rather than one collective “relevance” score. This has two key benefits:
- We can easily drill in on specific issues (e.g. “Very responsive to prior knowledge”).
- We can set a distinct rank-cutoff per issue, rather than a single generalized score. For example, if Issues 1-6 are cut and dried, we may define “responsive” to be a score of three or higher. If Issues 8-10, are more nuanced, we may set the threshold at two or higher and include “borderline” documents.
Step 3: Prompt Iteration [11.5 hours]
If you’ve ever managed a document review project, you know that, over the first 2-4 weeks, the review protocol is revised on a near-daily basis. As documents are reviewed, new facts and issues come to light, and review instructions need to be clarified or modified. Developing a good attorney review protocol means adapting to what you’ve learned, anticipating edge cases, and leaving as little “gray area” as possible.
Prompt iteration is a systematic way of reaching the exact same goal, as efficiently as possible. By cycling prompts through samples prior to running a review, we’re able to clarify our instructions and improve the performance of our prompt.
A high-level prompt iteration workflow for document review
In this case, we spent about 90 minutes drafting the initial set of prompts and then were ready to draw our first sample for iteration.
We used multiple sampling techniques, including:
- Random sampling: A simple random sample of documents from across the review set that serves as the starting point. This can also be used to estimate richness.
- Diversity sampling: A sample of documents drawn from each cluster in a cluster set, to surface edge cases and outliers that may be missed by a simple random sample.
- Threshold sampling: Sampling gray-area documents to expose issues that need clarification.
- Keyword-stratified sampling: A random sample document for each keyword hit across a set of search terms, to ensure we have included documents relevant to each topic.
Consideration: Thoughtful sampling is, in my opinion, the most important aspect of prompt iteration. Strictly random samples may either overlook critical edge cases or require unnecessarily large sample volumes to ensure coverage across all topics. This creates delays and introduces risk that the final prompt does not generalize well across the document set. A well-designed sampling strategy saves time later by surfacing ambiguity early and improving prompt performance across diverse facets of the review population.
A good sampling strategy for prompt iteration should optimize for:
- Representation from diverse segments of the population
- Edge cases and low-richness issues/topics
- Disambiguation of gray-area issues (e.g., by over-sampling near relevance thresholds)
- Minimal waste (without cutting corners)
After defining our sample sets, we began testing and iterating the prompt(s). The iteration process is outlined below.
- Run prompt against the iteration sample set(s).
- Have an attorney review the results, noting disagreements and providing clarifying instructions using a “prompt iteration notes” field on the coding pane.
- Update the prompt based on the attorney’s feedback and clarifications.
- Run the revised prompt against the sample set(s).
- Revaluate disagreements and repeat the process as needed.
Generally, during iteration, results improve with each cycle until we reach a point of diminishing returns.
Consideration: Be familiar with the risks of overfitting. If you’re using multiple iteration samples, ensure that your final prompt is “back compatible,” in that it performs well across all samples, not just the last one you iterated on. You want the prompt to generalize well across the population, not just on the sample set.
In this case, prompt iteration consisted of 13 cycles, each of which incorporated new information gleaned from each sample set. In total, 206 documents were reviewed during the iteration process.
Step 4: Validation [18 hours]
While prompt iteration is meant to build “gut-level confidence,” validation is the meant to quantify the effectiveness of the classifier, generally by estimating recall and precision. Notably, using a generative AI-based classifier, you can choose to validate either before or after the full review is run.
Pre-Run vs End-of-Project Validation
| Pre-Run Validation | End of Project Validation |
| Builds confidence before running the review | Addresses entire process |
| Allows for recourse if validation fails | Can include non-AI workflow components |
| Allows producing party to disclose recall estimate prior to running review project | More like traditional TAR validation (though “pre-run” validation is also possible with TAR) |
Prompt iteration and validation workflow for AI doc review
Consideration: “Do I need to statistically validate?” This is a question for a legal decision maker. Often, like traditional TAR, responsiveness reviews undergo some form of validation. How to validate and the rubric for success varies case by case, but often involves statistical sampling to measure recall and precision.
In this specific case, for several reasons, the client chose to validate the prompt by estimating recall and precision prior to running the full review job. To do this, we used the following workflow:
- We drew a fixed random sample of 1,000 documents from the review set.
- A pair of attorneys reviewed the 1,000-document sample.
- The same 1,000 documents were reviewed using aiR for Review.
- We computed confidence intervals for recall and precision based on the results. We used a 95 percent confidence level and 14 percent richness estimate.
The results were as follows. Note that this project pre-dated the release of Relativity’s built-in validation application for aiR, which streamlines the sampling, review, and calculations involved in validation. See Relativity’s white paper on validating results in aiR for Review for more information.)
Consideration: Don’t iterate on the control set.
Bear in mind the difference between validation and iteration. After the validation review, you will likely discover “disagreements” between the AI and attorney. It may be tempting to modify the prompt in an attempt to resolve the conflicts. However, modifying the prompt at this stage changes the classifier, and you will need to repeat validation with a new sample.
Stage 5: Disclosure
In this particular case, the client met and conferred with opposing counsel over the use of aiR for Review. We agreed to share a description of the process along with confidence intervals for recall and precision as estimated after iteration (but before the full run). Our process description and validation results were well received by opposing counsel, so we proceeded with analyzing the full population.
Each practitioner will establish their own best practices when it comes to disclosure. For me, when the approach is to disclose the use of aiR, my priorities as an AI SME are to:
- Understand the review objectives, dynamics between parties, anything that has already been agreed to, and any other relevant history.
- Educate the client on how the review process works, what to expect, and what options they have for validation and disclosure.
- Define and/or negotiate a validation protocol with an appropriate level of rigor to meet any agreed-to requirements, while protecting our client’s interests.
- Thoroughly document all steps taken throughout the review.
While disclosure of recall is common, I’d note that including precision was a unique and generous offering of transparency. Over the last two years consulting on aiR for Review protocols, I’ve observed a pattern of clients offering more transparency and tightening up validation protocols, in an effort to satisfy receiving parties who may be less familiar with generative AI-based review processes.
Stage 6: Running the Review Job [13.5 hours]
After receiving client approval, we proceeded with kicking off the full aiR for Review job. Because we had already set aside ineligible documents prior to review, the error rate was low, and most document errors resolved on retry.
Stage 7: The Results [3.5 hours]
After the analysis completed, we reviewed the results. Among other things, we needed to decide how best to define “responsive” based on 0-4 scoring for each issue. In this case, the client chose a rank cutoff of 2 for responsiveness, meaning we included borderlines in the production for all issues. The net result was 32,172 responsive documents.
As is typical, issues varied in prevalence. In this case, based on aiR’s coding, several low richness issues spanned less than 1 percent of the population while the top two issues comprised more than 50 percent of the responsive documents. Using a cutoff of 2 for responsiveness, below is the prevalence of each issue across the review set (as classified by aiR).
| % of Produced Documents | |
| Issue 1 | 34.14% |
| Issue 2 | 25.36% |
| Issue 3 | 4.43% |
| Issue 4 | 5.21% |
| Issue 5 | 14.71% |
| Issue 6 | 0.36% |
| Issue 7 | 71.07% |
| Issue 8 | 20.00% |
| Issue 9 | 18.36% |
| Issue 10 | 1.07% |
This is the payoff of issue-based tagging. Rather than a single “relevant” pile, attorneys get a structured view of the production’s composition, and can drill directly into any issue, whether it represents 71 percent of the documents or less than one percent.
Retrospective
Time Investment Breakdown
Overall, this project was a success. We were able to complete the first pass responsiveness review in under two weeks.
| Stage | Attorney Time | AI SME Time | Machine Time |
| Initial Call | 1 | 1 | |
| Project Setup and Logistics | 0.5 | 2 | |
| Prompt Iteration | 4 | 7 | 0.5 |
| Validation | 14 | 4 | |
| Run Time | 13.5 | ||
| Packaging Results and Reporting | 3.5 | ||
| Total | 19.5 | 13.5 | 14.0 |
Highlights
- The M&C and disclosure aspects of the project went smoothly. Pre-run validation gave the client and receiving party confidence in the process.
- aiR achieved impressive recall and precision rates. Even though parties didn’t set a “minimum threshold” for recall or precision, both rates were satisfactory, with the achieved recall rate being especially strong.
- Using aiR allowed us to meet an aggressive production deadline.
Items of Note
- While the total time invested was significantly lower than exhaustive eyes-on review, AI-based review requires more SME time, both from the LSP side (e.g. prompt iteration) and from the client-side (e.g. reviewing iteration and validation samples).
- aiR’s built-in validation is now available, which simplifies the validation process and reduces SME time spent on setup and computation.
- Over time, aiR is likely to support more document types and larger documents, which will minimize the “set aside” pile that requires manual review.
This case study demonstrates how aiR for Review can be applied to a first-pass responsiveness review, from defining the population and drafting prompts through validation and disclosure. The process was completed within two weeks while meeting agreed-upon validation standards. As with any review method, the process plays as much of a role in achieving positive outcomes as the algorithm. Thanks for reading, and happy prompting!
Ben Sexton is the senior vice president of innovation and strategy at JND eDiscovery and a recognized subject-matter expert in AI and e-discovery.




