InternVL 1.5 Enhances Multimodal AI with Enhanced Resolution and Bilingual Skills in Open-Source Models

  • Multimodal large language models (MLLMs) blend text and visual data processing to mimic human-like interactions.
  • Open-source MLLMs often lag behind commercial models due to limitations in handling complex visual inputs and diverse languages.
  • InternVL 1.5, developed by Shanghai AI Laboratory, SenseTime Research, and others, addresses these limitations.
  • InternVL 1.5 enhances vision comprehension with a robust vision encoder and dynamic high-resolution methodology.
  • It handles images up to 4K resolution by segmenting them into tiles, improving comprehension of intricate scenes.
  • The model is trained on a bilingual dataset, improving performance in Optical Character Recognition (OCR) and multilingual tasks.
  • InternVL 1.5 excels in OCR-centric datasets and bilingual scene comprehension, surpassing some proprietary models.

Main AI News:

The fusion of text and visual data processing within multimodal large language models (MLLMs) has been pivotal in refining artificial intelligence’s grasp of and interaction with its environment. This realm of study is dedicated to crafting systems adept at deciphering and responding to a blend of visual and linguistic cues, thereby emulating human-like interactions more authentically.

Yet, the roadblock often encountered lies in the restricted capabilities of open-source models when compared to their commercial counterparts. These models frequently exhibit shortcomings in processing intricate visual inputs and accommodating diverse languages, thereby constraining their practical utility and efficacy across a spectrum of scenarios.

Traditionally, open-source MLLMs have been hindered by fixed resolutions and a predominant focus on English language datasets. Such an approach severely limits their functionality when confronted with high-resolution imagery or content in alternative languages, thereby impeding their performance in tasks necessitating nuanced visual comprehension or multilingual prowess.

The latest endeavor from Shanghai AI Laboratory, SenseTime Research, Tsinghua University, Nanjing University, Fudan University, and The Chinese University of Hong Kong introduces InternVL 1.5, an open-source MLLM engineered to substantially augment the capabilities of open-source systems in multimodal comprehension. This model incorporates three pivotal enhancements aimed at bridging the performance chasm between open-source and proprietary commercial models. These enhancements include:

  1. Optimization of a robust vision encoder, InternViT-6B, via a continuous learning approach, thereby fortifying its visual comprehension capacities.
  2. Implementation of a dynamic high-resolution methodology enabling the model to handle images scaling up to 4K resolution by dynamically adapting image tiles based on input aspect ratio and resolution.
  3. Assembly of a premium-quality bilingual dataset meticulously curated to encompass common scenarios and document images annotated with English and Chinese question-answer pairs.

These three strides significantly elevate the model’s prowess in Optical Character Recognition (OCR) and tasks pertaining to the Chinese language. Such enhancements empower InternVL 1.5 to competently vie in an array of benchmarks and comparative analyses, underscoring its enhanced efficacy in multimodal tasks.

InternVL 1.5 adopts a segmented strategy for image processing, facilitating the handling of images with resolutions up to 4K by segmenting them into tiles spanning 448×448 pixels. This adaptive approach, contingent on the image’s aspect ratio and resolution, enhances image comprehension and fosters nuanced understanding of intricate scenes and documents.

The model’s fortified linguistic adeptness emanates from its training on a diverse dataset encompassing both English and Chinese, encapsulating a myriad of scenarios and document types. This, in turn, amplifies its performance in OCR and text-centric tasks spanning linguistic boundaries.

The model’s performance stands as a testament to its efficacy across numerous benchmarks, particularly excelling in OCR-centric datasets and bilingual scene comprehension. InternVL 1.5 showcases state-of-the-art outcomes, evidencing notable enhancements over preceding iterations and even eclipsing certain proprietary models in specific assessments.

Conclusion:

Text-driven visual question answering attains a commendable accuracy rate of 80.6%, while document-based question answering achieves an impressive 90.9%. Across multimodal benchmarks scrutinizing models on both visual and textual comprehension, InternVL 1.5 consistently delivers competitive outcomes, frequently outpacing other open-source models and rivalling commercial counterparts.

Source