OpenAI: ‘Impossible to train today’s leading AI models without using copyrighted materials’

TL;DR:

  • OpenAI asserts that AI models require copyrighted content for effective training.
  • An IEEE report highlights “plagiaristic outputs” by AI models, raising legal concerns.
  • Legal experts differ on whether AI creators or users should be accountable for copyright infringement.
  • Ongoing lawsuits involve The New York Times, book authors, and software developers.
  • Copyrighted content remains essential for AI model functionality, impacting the future of AI development.

Main AI News:

In the realm of artificial intelligence, OpenAI has made a compelling case that utilizing copyrighted content is an indispensable cornerstone for training advanced neural networks that cater to modern demands. This assertion underscores the challenge faced by the machine learning community as it grapples with the intricacies of copyright law. OpenAI contends that relying solely on out-of-copyright public domain material would inevitably lead to the development of suboptimal AI software solutions.

This debate has reached a fever pitch, with a recent IEEE report authored by Gary Marcus, an esteemed AI expert and critic, alongside digital illustrator Reid Southen, shedding light on instances of what they term “plagiaristic outputs.” These occurrences involve OpenAI and DALL-E 3, two prominent AI services known for transforming textual prompts into visual representations, producing remarkably similar renditions of copyrighted scenes from movies, images of renowned actors, and video game content.

The central question at the heart of this matter revolves around legality and culpability, as it remains contentious whether AI vendors or their customers can be held accountable for potential copyright infringement. The findings in the report, however, may serve to bolster legal actions against Midjourney and OpenAI.

Gary Marcus and Reid Southen assert, “Both OpenAI and Midjourney are fully capable of producing materials that appear to infringe on copyright and trademarks.” They further highlight a crucial issue: these systems fail to inform users when they inadvertently infringe upon copyrights, leaving creators and users in a legal quagmire.

Significantly, neither OpenAI nor Midjourney have disclosed the complete details of the training data used for their AI models, further complicating matters. It’s not just digital artists who are raising concerns; even media giants like The New York Times have taken legal action against OpenAI due to its ChatGPT text model generating content that closely resembles their paywalled articles. Similar claims have been made by book authors and software developers, amplifying the urgency of addressing this complex issue.

Previous research has indicated that OpenAI’s ChatGPT can replicate training text, and litigants against Microsoft and GitHub argue that the Copilot coding assistant model reproduces code verbatim. Southen observes that Midjourney is not only allowing the creation of infringing content but also profiting from it through subscription revenue. OpenAI follows a similar model, charging subscription fees and thereby sharing in the profits. Both companies, however, have remained tight-lipped in response to requests for comment.

Surprisingly, OpenAI recently issued a blog post addressing The New York Times lawsuit, asserting that if their neural networks produce infringing content, it is a “bug.” In a comprehensive rebuttal, OpenAI emphasizes its collaboration with news organizations, the fair use defense under copyright law for training on copyrighted data, and ongoing efforts to eliminate any instances of “regurgitation.”

Legal expert Tyler Ochoa from Santa Clara University believes that the IEEE report’s findings will support copyright claims in court. However, he questions the report’s conclusion that AI models produce plagiaristic outputs without direct solicitation, highlighting that the prompts used in the report specifically mention copyrighted movies and scenes, essentially requesting such outputs. Ochoa argues that the responsibility for these outputs should rest with the individuals who prompt the AI to replicate copyrighted content.

Furthermore, Ochoa notes that AI models are more likely to reproduce specific images when multiple instances of those images exist in their training data. In this case, it is probable that the training data primarily consisted of still images distributed for publicity purposes, making it unfair to accuse AI creators of infringing copyrights.

Ultimately, the issue of whether AI models should be held accountable for reproducing copyrighted content hinges on the context of the prompts and the intentions of those who generate them. The ongoing legal battles will likely shape the future of AI development and its relationship with copyright law, as AI continues to evolve and permeate various industries.

In the midst of this legal and ethical quagmire, it becomes increasingly evident that copyrighted content plays an integral role in the efficacy of these AI models, raising profound questions about the intersection of innovation, intellectual property, and the evolving landscape of artificial intelligence. The outcome of these legal battles will undoubtedly have far-reaching implications for the AI industry and its stakeholders.

Conclusion:

The ongoing debate surrounding AI and copyright issues underscores the complex legal landscape facing the market. As AI models increasingly rely on copyrighted material for training, it is imperative that stakeholders in the AI industry monitor these legal battles closely. The outcome will shape the direction of AI development and its compatibility with intellectual property laws, ultimately influencing business strategies and legal frameworks within the market.

Source