AI threatens Wikipedia

TL;DR:

  • The Wikipedia community is divided over the use of large language models for content generation.
  • Concerns include the need for human review to prevent the dissemination of inaccurate or fabricated information.
  • Amy Bruckman emphasizes the importance of editing and source verification for content generated by AI models.
  • OpenAI’s ChatGPT has been found to fabricate information and faces criticism for spreading misinformation.
  • Wikimedia Foundation is exploring tools to identify bot-generated content and drafting a policy on AI usage.
  • The policy includes in-text attribution for AI-generated content.
  • AI-generated content is seen as a form of vandalism that requires strategies similar to combating traditional vandalism.
  • Issues exist between volunteers and foundation staff regarding unfinished technical migrations and community decision-making.
  • Large language model training on Wikipedia content and the potential biases they introduce are a topic of debate.
  • Responsible AI licenses like RAIL are suggested to impose restrictions and promote transparency.
  • There is concern about the inequality between languages on Wikipedia and how AI models exacerbate biases.
  • The Wikimedia Foundation acknowledges the importance of human engagement and the need for rigorous fact-checking in AI-generated content.

Main AI News:

As the influence of generative artificial intelligence continues to pervade various spheres of society, the guardians of Wikipedia find themselves at odds regarding the most prudent path forward. In a recent gathering of the community, it became evident that a schism exists concerning the utilization of expansive language models for content generation. While certain individuals asserted that tools such as Open AI’s ChatGPT could assist in the creation and condensation of articles, others maintained a sense of caution.

The crux of the concern lies in need to strike a delicate balance between machine-generated content and extensive human review, as the former has the potential to inundate lesser-known wikis with substandard material. While artificial intelligence generators undoubtedly prove valuable in producing text that possesses a striking resemblance to human writing, they are not impervious to errors. These mistakes may manifest in the inclusion of inaccurate information, referencing nonexistent sources, or academic papers. Consequently, seemingly accurate text summaries often unravel upon closer scrutiny, exposing them as entirely fabricated.

Amy Bruckman, a distinguished regents professor and senior associate chair of the school of interactive computing at the Georgia Institute of Technology, and author of the enlightening publication, “Should You Believe Wikipedia?: Online Communities and the Construction of Knowledge,” likens the process of socially constructing knowledge to the efficacy of large language models. Bruckman emphasizes that the effectiveness of such models hinges upon their capacity to differentiate fact from fiction. Therefore, she advocates for a course of action that involves utilizing these expansive language models while also subjecting their output to rigorous editing and thorough source verification.

“The only recourse we have is to leverage the capabilities of large language models but edit their output and ensure someone verifies the sources,” Bruckman expressed to Motherboard, highlighting the imperative nature of maintaining the integrity of the information disseminated through Wikipedia.

Researchers swiftly discovered that OpenAI’s ChatGPT possesses a lamentable propensity for fabrication, which proves detrimental to students who naively rely solely on the chatbot for essay composition. The chatbot occasionally conjures up articles and their purported authors, exhibiting audacious confidence as it intertwines the names of lesser-known scholars with those of more prominent figures. OpenAI itself has even admitted that the model “hallucinates” when it fabricates facts, a term that has elicited criticism from certain experts in the field of artificial intelligence. Some argue that such language allows AI companies to evade accountability for the dissemination of misinformation through their tools.

The risk for Wikipedia lies in the potential dilution of quality resulting from the inclusion of unchecked content,” Bruckman further cautioned. “I don’t perceive any issue with employing it as an initial draft, but every assertion must undergo meticulous verification.”

The Wikimedia Foundation, the esteemed nonprofit entity responsible for the operation of Wikipedia, is proactively exploring the development of tools that would facilitate the identification of bot-generated content by volunteers. Simultaneously, Wikipedia is diligently working on formulating a policy that delineates the boundaries dictating how volunteers can harness the power of large language models to produce content.

The ongoing drafting process of the policy underscores the crucial notion that individuals unfamiliar with the inherent risks associated with large language models should abstain from utilizing them to create Wikipedia content. Such actions could potentially expose the Wikimedia Foundation to legal ramifications, including libel suits and copyright violations, which the nonprofit enjoys protections against, but the Wikipedia volunteers do not. Furthermore, these expansive language models tend to harbor implicit biases, leading to the creation of content that tilts unfavorably against marginalized and underrepresented groups.

Within the Wikipedia community, there exists a division regarding the permissibility of allowing large language models to train on Wikipedia content. While the ethos of open access remains a cornerstone of Wikipedia’s design principles, concerns arise regarding the unregulated scraping of internet data, which could enable AI companies like OpenAI to exploit the open web for the creation of proprietary datasets. This predicament becomes particularly problematic if the very content on Wikipedia itself is generated by artificial intelligence, thus establishing a feedback loop that perpetuates potentially biased information if left unchecked.

One proposition that garnered attention on Wikipedia’s mailing list proposed the utilization of BLOOM, a substantial language model unveiled last year under the newly introduced Responsible AI License (RAIL). This license amalgamates an Open Access approach to licensing with behavioral constraints aimed at fostering responsible AI use. Similar to certain iterations of the Creative Commons license, the RAIL license affords flexible utilization of the AI model while simultaneously imposing certain restrictions. For instance, derivative models must explicitly disclose that their outputs are AI-generated, and any endeavors built upon them must adhere to the same set of rules.

Mariana Fossatti, a coordinator affiliated with Whose Knowledge?, a global campaign dedicated to facilitating access to knowledge on the internet across diverse geographical locations and languages, contends that large language models and Wikipedia are ensnared in a feedback loop that perpetuates the introduction of further biases. Fossatti emphasizes that while a vast reservoir of knowledge exists across more than 300 languages, there exists considerable inequality among these languages. English Wikipedia, in particular, boasts a far greater wealth of content than its counterparts, and this information is actively fueling AI systems.

We possess this immense repository of knowledge in over 300 languages,” Fossatti expressed to Motherboard, underscoring the profound disparities that exist. “However, these 300 different languages exhibit significant inequality. English Wikipedia is considerably more enriched in content than the others, and we are inadvertently nourishing AI systems with this extensive corpus of knowledge.”

In the pursuit of responsible and unbiased information dissemination, the Wikimedia Foundation, Wikipedia volunteers, and various stakeholders continue to grapple with the intricate challenges posed by the integration of large language models within the ecosystem of knowledge construction and distribution.

While the concept of using AI is not entirely foreign to Wikipedians, as automated systems have long been employed for tasks like machine translation and combating vandalism, there are veteran volunteers who harbor reservations about expanding AI utilization on the platform. However, the Wikimedia Foundation views AI as an opportunity to enhance the efforts of Wikipedia volunteers and scale their work on various projects.

In a statement to Motherboard, a spokesperson from the Wikimedia Foundation acknowledged the feedback received from volunteers and expressed their interest in exploring how AI models can contribute to bridging knowledge gaps and fostering broader access and participation. Nonetheless, the foundation firmly asserts that human engagement remains the fundamental cornerstone of the Wikimedia knowledge ecosystem. AI functions optimally as a complement to the work carried out by humans within the project.

The current iteration of the draft policy emphasizes the necessity of in-text attribution for AI-generated content. Amy Bruckman, an esteemed voice in the field, draws parallels between the challenges posed by large language models and deliberate, malicious attempts to edit Wikipedia pages. Bruckman views unreviewed AI-generated content as a form of vandalism and suggests employing the same strategies used to combat vandalism on Wikipedia to address the influx of subpar content stemming from AI.

Selena Deckelmann, the chief product and technology officer at the Wikimedia Foundation, acknowledged the existence of intricate issues between volunteers and the foundation staff regarding unfinished technical migrations that impact community decision-making. Deckelmann emphasized the need to prioritize maintenance and technical migration areas, acknowledging that certain projects may take precedence over others in order to achieve completion.

However, until these challenges are fully addressed, Bruckman emphasizes the importance of remaining vigilant for editors and volunteers. She underscores that the reliability of content hinges on the number of individuals who have verified it through robust citation practices. While generative AI lacks strong citation preferences, Bruckman advocates for thorough verification processes.

She acknowledges that discouraging people from using AI is not a viable solution, given its widespread usage. The proverbial genie cannot be put back in the bottle. Therefore, the best course of action is to meticulously check the outputs of AI-generated content to ensure accuracy and reliability.

Conlcusion:

The division within the Wikipedia community regarding the use of large language models for content generation highlights the challenges and considerations that arise when integrating artificial intelligence into the market. While AI models present opportunities to scale work and bridge knowledge gaps, there are significant concerns surrounding accuracy, bias, and the need for human review.

As organizations and businesses navigate the use of AI in their respective markets, it is crucial to prioritize rigorous verification processes, responsible licensing, and the augmentation of human expertise to ensure the reliability and integrity of information. Additionally, addressing the disparities among languages and promoting inclusivity within AI-generated content will be essential in fostering a more equitable market landscape.

Source