Machine Translation Woes: Lost in Digital Translation

TL;DR:

  • Bill Gates foresaw the Internet uniting people worldwide through language diversity.
  • The Internet has indeed connected the world, but machine translations pose serious challenges.
  • A report by Amazon Web Services AI Lab and UC Santa Barbara reveals the poor quality of machine translations.
  • Over 6 billion web sentences were examined, with more than half being translated into multiple languages.
  • Translation quality deteriorates with each iteration, impacting lower-resource languages.
  • AI-generated content is prevalent in languages with limited resources.
  • Multilingual large language models trained on web data raise concerns.
  • AI-generated translations dominate content in languages with fewer resources.
  • Simple, low-quality content is created to generate ad revenue.
  • Machine-generated translations can lead to humorous or embarrassing outcomes.
  • Quality control and advancements in machine translation are essential for bridging linguistic gaps.

Main AI News:

In the waning years of the previous century, Bill Gates envisioned the possibility of uniting individuals from nearly 200 nations, conversing in over 7,000 languages, converging in a shared discourse within the rapidly growing web community. “The Internet is evolving into the global village’s town square of tomorrow,” he prophesied.

Over the years, the Internet has unquestionably succeeded in bringing the world closer together, profoundly enriching global communications, commerce, research, and entertainment. However, a recent report serves as a stark reminder that progress often accompanies its own set of challenges.

Researchers hailing from Amazon Web Services Artificial Intelligence Lab and the University of California, Santa Barbara, have conducted a thorough examination of more than 6 billion sentences scattered across the web. Their findings indicate that over half of these sentences had undergone translation into two or more languages, and regrettably, the quality of these translations left much to be desired. Furthermore, with each subsequent translation, sometimes extending to as many as eight or nine iterations, the quality deteriorated exponentially.

This alarming revelation, titled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism,” was publicly disclosed on the preprint server arXiv on January 11th. The authors of the report unequivocally assert that the subpar quality of these translations strongly suggests their origin in machine translation. They go on to raise significant concerns regarding the practice of training multilingual large language models on data scraped from both monolingual and bilingual sources found on the web.

The researchers point out that not only are texts being translated by artificial intelligence, but AI is also generating content from scratch. They noted that AI-generated translations are most prevalent in languages with fewer resources, such as Wolof and Xhosa, which are African languages. The authors of the report emphasize that highly multi-way parallel translations are notably lower in quality compared to their two-way parallel counterparts.

This revelation underscores a pressing issue. As AI systems ingest trillions of data bits for training, regions with limited web representation, like African nations and other countries with less common languages, face formidable challenges in establishing reliable and grammatically sound large language models. With limited native resources at their disposal, they are forced to heavily rely on subpar translations that inundate the market.

Mehak Dhaliwal, a former applied science intern at Amazon Web Services, expressed her concerns in an interview with Motherboard, stating, “We became interested in this topic because several colleagues who work in machine training and are native speakers of low-resource languages observed that much of the internet content in their native language seemed to be machine-generated. Everyone should be aware that the content they encounter on the web may have been produced by a machine.”

The Amazon researchers also discovered a bias in the selection of content used for AI training. They highlight that machine-generated, multi-way parallel translations not only dominate the total amount of translated content in languages with limited resources but also constitute a substantial portion of the overall web content in those languages. This content is often characterized by its simplicity and lower quality, likely created for the sole purpose of generating advertising revenue. Given the inherent lower fluency and accuracy of machine-trained material, a proliferation of translations only exacerbates the issue of inaccurate content and increases the likelihood of AI-generated hallucinations.

In the annals of machine-generated translations, there have been instances that resulted in unintentionally humorous or embarrassing interpretations. Google infamously misinterpreted the phrase “Russia is a great country,” instead referring to Mordor, a fictional village from J.R.R. Tolkien’s “The Lord of the Rings.” In 2019, Facebook’s translation software mistakenly labeled China’s President Xi Jinping as “Mr. S***hole” multiple times in an English article translated from Burmese text, leading to a swift apology and attribution of the mishap to a “technical error.” Even a medical prescription translation tool for Armenian speakers once provided rather unfortunate advice for a patient with a headache:

Original English: “You can take over-the-counter ibuprofen as needed for pain.” Translation to Armenian: “You may take anti-tank missile as much as you need for pain.”

Conclusion:

The prevalence of poor-quality machine translations on the web, particularly in languages with limited resources, poses significant challenges. As the digital landscape continues to expand, it is crucial for the market to prioritize quality control and invest in advancements in machine translation technology to ensure accurate and meaningful communication in the global online ecosystem.

Source