TL;DR:
- Large AI language models, including GPT-4-Turbo, struggle with analyzing SEC filings, per Patronus AI.
- Even with access to complete filings, GPT-4-Turbo answered only 79% of questions correctly.
- Challenges include refusals to answer and generating inaccurate data, raising concerns for automation in finance.
- Incorporating non-deterministic AI models into regulated industries requires rigorous testing and human oversight.
- Patronus AI’s FinanceBench dataset sets a performance standard for language AI in finance.
- Despite challenges, there is potential for AI to aid finance professionals but with ongoing human involvement.
Main AI News:
In a recent study conducted by Patronus AI, it has been revealed that large language models, akin to the one that powers ChatGPT, face significant challenges when it comes to analyzing Securities and Exchange Commission (SEC) filings. Even the most advanced AI model tested, OpenAI’s GPT-4-Turbo, when provided with the entirety of an SEC filing along with a related question, managed to answer only 79% of the questions correctly in Patronus AI’s new test.
The issues encountered ranged from outright refusal to answer questions to the creation of inaccurate information not present in the SEC filings, a phenomenon referred to as “hallucination.” Anand Kannappan, Co-founder of Patronus AI, expressed his concerns, stating, “That type of performance rate is just absolutely unacceptable. It has to be much, much higher for it to really work in an automated and production-ready way.”
These findings underscore the challenges faced by AI models, especially in industries subject to rigorous regulation, such as finance, where integrating cutting-edge technology for tasks like customer service and research is a priority. One of the most promising applications for AI in finance is the ability to swiftly extract critical financial data and analyze narratives, and SEC filings are a treasure trove of such information.
In recent times, major players like Bloomberg LP, business school professors, and JPMorgan have all ventured into the domain of AI-driven financial solutions. However, GPT’s entry into this industry hasn’t been without its hiccups. For instance, when Microsoft launched Bing Chat using OpenAI’s GPT, it presented an example where the chatbot summarized an earnings press release, but astute observers quickly spotted inaccuracies and fabricated numbers in the provided content.
One of the main challenges in incorporating Large Language Models (LLMs) like GPT into real-world applications is their inherent non-deterministic nature. LLMs do not guarantee consistent output for the same input, necessitating rigorous testing to ensure their correct operation, relevance, and reliability.
The Co-founders of Patronus AI, who previously worked at Meta (Facebook’s parent company) on AI-related problems, established their startup to automate LLM testing. They aim to provide companies with the assurance that their AI bots won’t provide off-topic or incorrect responses, thus ensuring a more responsible AI deployment.
To create a robust testing dataset, Patronus AI compiled over 10,000 questions and answers derived from SEC filings of major publicly traded companies, called FinanceBench. This dataset not only contains correct answers but also specifies where in the filing to locate them. Some questions even require light mathematical or reasoning skills, setting a “minimum performance standard” for language AI in the financial sector.
Here are a few sample questions from the dataset provided by Patronus AI:
- Has CVS Health distributed dividends to common shareholders in Q2 of FY2022?
- Did AMD disclose customer concentration in FY22?
- What is Coca-Cola’s FY2021 COGS % margin? Calculate it using the line items clearly shown in the income statement.
Patronus AI evaluated four language models: OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude 2, and Meta’s Llama 2, using a subset of 150 questions. Different configurations and prompts were tested, including “Oracle” mode, where models were given the exact source text for the answer. GPT-4-Turbo, for instance, answered correctly in “Oracle” mode 85% of the time but still provided incorrect responses 15% of the time, illustrating the difficulty of automating the precise task of locating information in filings.
Llama 2, developed by Meta, struggled with “hallucinations,” generating incorrect answers 70% of the time and correct answers only 19% of the time when provided access to underlying documents. Anthropic’s Claude 2 performed well when given “long context,” answering 75% of the questions correctly.
Even when models performed relatively well, Patronus AI concluded that their accuracy wasn’t sufficient, especially in regulated industries where even a 5% error rate is unacceptable. Despite the challenges, the Co-founders of Patronus AI remain optimistic about the potential of language models like GPT in the finance industry, believing that continuous improvement will eventually lead to automated solutions. However, they acknowledge that, for now, human oversight remains essential for ensuring the accuracy and reliability of AI-driven workflows in finance.
OpenAI, in response to these findings, has emphasized the importance of adhering to usage guidelines, especially in financial applications, where qualified human review is necessary, and clear disclaimers regarding AI usage and its limitations are mandated.
Conclusion:
The revelation that even advanced AI models face difficulties in analyzing SEC filings highlights the challenges in integrating AI into the finance industry. The inability of AI models to consistently provide accurate answers, coupled with their non-deterministic nature, underscores the need for extensive testing and human supervision in regulated sectors like finance. While AI holds promise for aiding financial professionals, its current limitations call for a cautious approach to automation in this market.