White Paper & Guides

Generative AI Passes the Legal Ethics Exam

We challenged leading generative AI models to 100 simulated exams. Download the research!

Download Research

Can Generative AI Pass the Legal Ethics Exam?

Nearly every jurisdiction in the United States requires prospective lawyers to pass the Multistate Professional Responsibility Exam (“MPRE”). Administered by the National Conference of Bar Examiners (NCBE), the MPRE is one of the two tests an aspiring attorney must pass, testing the examinee’s knowledge and understanding of the ethical rules and standards of conduct they must adhere to in their practice of law. 

Professional ethics form the bedrock of the legal profession, helping lawyers protect their clients’ interests, maintain the public’s confidence, and uphold principles of justice and fairness in the legal profession. They serve as a lawyer’s compass in navigating complex and sensitive issues and play a critical role in maintaining the rule of law.

Earlier this year, generative AI model GPT-4 passed the bar exam, the other exam aspiring attorneys must pass to practice law in the United States. In that research, GPT-4 answered correctly nearly 75% of the bar exam’s multiple choice questions, outperforming the average human test taker by more than 7%.

Passing the bar exam is no small feat. The two-day exam is notoriously difficult, designed to evaluate an applicant’s legal knowledge and skills, including their ability to interpret complex legal texts. Generative AI’s ability to interpret nuanced legal language opens the door for using technology as a force multiplier in the legal profession like never before. One significant application is in contract review, a time-consuming and often repetitive aspect of legal work. Automating such tasks could free legal professionals to focus more on the issues that require human judgment and decision-making, rather than line-by-line manual review.

Indeed, the legal profession has long been identified as one of the most ripe industries for AI transformation. Researchers at Princeton University, the University of Pennsylvania, and New York University pegged the legal industry as the most exposed to AI disruption. An often-cited report from Goldman Sachs estimated that 44% of legal work could be automated - second only to office and administrative support jobs.

As capabilities improve, AI may come into tasks that touch the complex ethical dilemmas lawyers face. How does AI fare in a field filled with such considerations? To better understand what the latest generative AI models are capable of, we conducted an experiment, testing GPT-4 and other generative AI models against the challenges of the legal ethics exam.

In this research brief, we demonstrate that the latest generative AI models from OpenAI and Anthropic rose to the challenge. As shown by the results we report against exam questions developed by Professor Dru Stevenson, both GPT-4 and Claude 2 can pass the legal ethics exam.

What is the legal ethics exam?

Description of the MPRE

Developed and administered by the National Conference of Bar Examiners, the Multistate Professional Responsibility Exam is required for admission to the bar in all but two US jurisdictions.

Designed to measure a candidate’s knowledge of ethical standards of the legal profession, it covers rules of professional conduct, including the American Bar Association (ABA) Model Rules of Professional Conduct, the ABA Model Code of Judicial Conduct, as well as federal and state case law.

The two-hour exam consists of 60 multiple choice questions. Of these, 50 are scored and 10 are questions in development that are not scored. A broad range of topics is covered on the MPRE, ranging from conflicts of interest and the client-lawyer relationship to judicial conduct and a lawyer’s duties to the public and legal system. Some topics are covered more frequently than others.

Scoring and our approximation of a passing score

The MPRE is scored on a scale ranging from 50 to 150. Passing scores are set by each jurisdiction and range between 75 and 86. The most common minimum passing score is 85, historically corresponding to an approximate 60% correct answer rate. (see Figure 1).

We note that the adjustment of scaled scores varies from one exam to another based on the exam’s difficulty compared to previous ones. Therefore, in line with many law schools and bar review courses, we conservatively estimate the minimum passing score range to be approximately 56-64%, depending on the test jurisdiction.

The average score of the MPRE’s initial administration was set at 100, which converts to approximately 68% correct answers. Similarly, the national mean for the November 2022 MPRE was 97.2%. As such, we estimate the national mean score to be approximately 68% correct.

Data and Methodology

Questions

To simulate the breadth and variety of questions faced by MPRE exam takers, we tested generative AI models against 500 questions developed by Professor Dru Stevenson from the University of Houston Law Center. These questions have the same format and style as the questions on the current MPRE. 

The questions used herein supplement Professor Stevenson’s book, The Glannon Guide to Professional Responsibility (3rd ed.). The Glannon Guide assists law students in their understanding of professional rules of responsibility and their preparation for the MPRE exam.

The MPRE covers a broad range of different topic areas, with some topics tested more frequently than others. The 500 sample questions from Professor Stevenson thoroughly test the topics covered by the MPRE (See Figure 2). As described in “Methods,” we sampled these questions to match the frequency that each topic is tested in the MPRE exam.

Methods

For this study we evaluated four state-of-the-art generative AI models: OpenAI’s GPT-4 and GPT-3.5, Anthropic’s Claude, and Google’s PaLM 2 Bison. These large language models were selected to represent diversity across leading AI vendors, given that while the underlying technology is very similar - they trained on different collections of documents.

All models were accessed via their standard application programming interfaces (APIs) using basic “prompting” of “Answer the following multiple choice question.” This enabled standardized “zero-shot” evaluation of each model’s capabilities on legal ethics comprehension.

While the simplest approach to evaluate performance is to simply have each model attempt to answer each of the 500 test questions in the test data, we instead report only the performance on tests that simulated the MPRE test experience described in the “Description of the MPRE” section above. In summary, from the 500 available questions, we randomly selected 60 questions such that each subject area had the required proportion of questions. For example, for the “Conflict of interest” section, we ensured that only between 7 and 11 questions from this subject area were included (the table below includes the average number of questions in each subject area).

Results

Overall accuracy

Based on the sampling and testing methodology described above, the overall mean accuracy for each of the models was as follows: GPT-4 answered 74% correct, Claude 2 answered 67% correct, GPT-3.5 answered 49% correct, and PaLM 2 answered 42% correct.

The figure below summarizes the average accuracy on our simulated MPRE experiments in comparison to our estimated passing score range and estimated student mean score. (See ‘Scoring and Our Approximation of a Passing Score’). We also include plus/minus variance bars. Given their tightness around the average score, we can be confident that these average results are close to their true accuracy.

Ethics exam results by topic area

The table below presents the performance by subject area for GPT-4, along with the number of questions that were typically answered over the 100 random samples. Two of the most commonly tested subject areas, “Conflicts of interest” and “The client-lawyer relationship,” resulted in accuracy scores of 90.8% and 87.8% respectively.

Though the overall accuracy is reasonably high, there is notable room for improvement in certain areas, including in “Communications about legal services” with 71.1%, the “Different roles of the lawyer” with 71.7%, and “Safekeeping funds and other property” with 71.7%.

Results in comparison to the approximate passing threshold

As described in Scoring and Our Approximation of a Passing Score, we estimate the minimum passing score range to be 56-64%, depending on the test jurisdiction. We also estimate that the mean student test score on the MPRE is 68%.

GPT-4 performed best, answering 74% of questions correctly, outperforming the average human test-taker by an estimated 6%. GPT-4 and Claude 2 both scored above the MPRE’s approximate passing threshold in all jurisdictions where it is required.

Implications

Generative AI models can apply black-letter ethical guidelines

Successfully passing the legal ethics exam is a testament to the capabilities of generative AI in understanding and applying the established rules and guidelines that form the bedrock of legal ethics. This development is not just a technological milestone; it underscores the potential of AI in assimilating vast swaths of information and providing reasoned conclusions or guidance based on codified principles. This research demonstrates for the first time that top-performing generative AI models can apply black- letter ethical rules as effectively as aspiring lawyers.

Generative AI is not perfect

Generative AI models can perform well, but not perfectly against the rigorous demands of the legal ethics exam. For high-stakes legal and ethical decisions, good is often not good enough. That’s why it remains as important as ever for legal technology providers to test their models extensively with lawyer-led validations, coach LLMs to consistently produce professional-grade results, and augment generative AI with domain-specific content and training. Additionally, legal professionals who use generative AI should understand its capabilities and limitations and ensure that its use is adequately supervised, maintains confidentiality, and is accurate and reliable.

Legal practitioners maintain ethical responsibility

Despite the advanced capabilities of AI, the responsibility for ethical decisions within legal practice firmly remains with human professionals. AI can be a valuable tool for providing insights, spotting potential ethical red flags, and suggesting decisions based on historical and learned data. However, the ultimate judgment call should be made by lawyers. They must weigh these insights against the nuances of human experience, professional intuition, and moral considerations that transcend codes and algorithms.

The future of attorney-AI collaboration

The rapidly evolving future of legal is heading toward dynamic collaboration between attorneys and their AI assistants. As AI becomes an integral part of legal technology, it will usher in an era of heightened efficiency. For instance, it can expedite contract review by detecting potential risks and can advance legal research by surfacing relevant issues and precedents – tasks that traditionally consumed significant manual effort. This not only enables legal teams to focus more on decisions requiring nuanced legal judgment, but also paves the way for more cost-effective and accessible legal services.

Now, with AI demonstrating its capability in understanding and applying ethical guidelines, we anticipate systems designed specifically to reinforce a lawyer’s professional obligations and alert them on potential ethical issues and risks in real time. With robust validation, training, and oversight in place, this fusion of law and technology holds the promise of a legal system that is not only more efficient, but also more just and ethical.

Download now to access the research!

Download Research

Experience LegalOn in Action

Sign up to request free early access to LegalOn