ChatGPT Detectors Still Face Difficulty in Distinguishing Human and AI-Generated Texts

TL;DR:

  • The growth of ChatGPT and similar chatbots has led to the development of text detection tools.
  • OpenAI’s AI Classifier was discontinued due to low accuracy, reflecting challenges in accurate detection.
  • A comparative study of eight detection tools revealed CopyLeaks’ AI Content Detector as the most accurate.
  • GPTKit showed no false positives, while GPTZero performed poorly in all metrics.
  • ChatGPT excelled in generating English but struggled with Spanish and computer code.
  • Academic integrity maintenance requires improvements in current detection tools.
  • Another academic team’s study reported similar challenges in detecting AI-generated text.

Main AI News:

Chatbots like ChatGPT have experienced substantial growth in the past year, giving rise to software designed to identify AI-generated texts. While this market is still evolving, recent developments have shown that not all detection programs are entirely accurate, with some being discontinued altogether.

OpenAI LP, the company behind ChatGPT, quietly ended its AI Classifier tool due to its low accuracy rate. Educators, a significant user base for such detection tools, often rely on them to validate their students’ written assignments. In response to this demand, a group of university professors from Canada, Indonesia, and Ecuador conducted a comprehensive review of eight different detection tools, including the now-defunct AI Classifier.

Their research, titled “Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases,” assessed 124 submissions written by computer science students, as well as 40 papers generated by ChatGPT. To ensure the authenticity of the student-written papers, they were authored before 2018. The evaluation involved measuring each tool’s accuracy in identifying AI-generated text, false positives (mistakenly flagging human-origin text as AI-generated), and the resilience of the program in detecting paraphrased or deliberately edited text.

Among the eight tools, CopyLeaks’ AI Content Detector demonstrated the highest overall accuracy, identifying human text 99% of the time and AI-generated text 95% of the time. GPTKit had no false positives, while the others ranged from one to 52 false positives. Moreover, the Giant Language Test Room proved to be the most resilient detector against the paraphrasing tool QuillBot.

Although GPTZero claimed to be the first detector, it performed poorly across all three metrics, showing 52 false positive results. Some other detectors combined results from multiple models, like OpenAI’s Text Classifier, which used 34 different models. Different detectors had various ways of highlighting AI-generated or human-written text, each with its own set of strengths and weaknesses.

Interestingly, ChatGPT excelled in generating English texts but struggled with Spanish and computer code. The review team concluded that current LLM-generated text detectors are not yet entirely reliable for maintaining academic integrity and called for further improvements. They highlighted the need for better API integration, clear documentation of features, and support for more commonly used languages beyond English.

Another academic team also conducted a study, examining 14 different tools, and reported its findings earlier in the MIT Technology Review. They tested the detectors’ ability to identify human-written text with 96% accuracy on average. However, detecting AI-generated text proved to be more challenging, especially when the text was modified. Unfortunately, specific tool metrics were not provided, making direct comparisons between the tools difficult.

Conclusion:

The market for AI-generated text detection tools is facing challenges due to varying accuracy rates and the need for further improvements. The discontinuation of OpenAI’s AI Classifier highlights the importance of ensuring reliable and foolproof solutions for maintaining academic integrity. Businesses in this sector should focus on enhancing accuracy, supporting multiple languages, and providing clear documentation and seamless API integration to meet the evolving demands of educators and users seeking trustworthy text detection solutions.

Source