Introducing VISPROG: A Neuro-Symbolic Breakthrough in Complex Visual Task Solving with Natural Language Instructions

TL;DR:

Researchers introduce VISPROG, a neuro-symbolic approach for solving complex visual tasks using natural language instructions.
VISPROG breaks down difficult activities into simpler phases and leverages big language models and specialized end-to-end trained models.
It allows users to build complicated programs without prior training, offering flexibility and abstraction.
VISPROG combines modules such as language models, image processing subroutines, and computer vision models.
It provides interpretability, enabling users to understand and verify program logic.
VISPROG demonstrates versatility in tasks like compositional visual questions, NLVR, knowledge tagging, and image manipulation.
The system achieves significant performance improvements and sets a new standard in AI systems.

Main AI News:

The pursuit of universal AI systems has paved the way for the advancement of comprehensive end-to-end trainable models, many of which aim to provide users with a user-friendly natural language interface. Achieving this has typically relied on massive-scale unsupervised pretraining followed by supervised multitask training, enabling these systems to handle a wide array of tasks. However, scaling these systems to tackle an indefinite number of complex jobs poses a significant challenge. This necessitates a meticulous selection of datasets tailored to each specific task.

In a groundbreaking study, researchers from the Allen Institute for AI have delved into the utilization of large language models to address the long tail of intricate tasks. They propose a novel approach that involves deconstructing complex activities expressed in natural language into simpler phases, which can be efficiently handled by specialized end-to-end trained models or other programs.

Imagine instructing a computer vision program to “Tag the seven main characters from the TV show Big Bang Theory in this image.” Before carrying out this instruction, the system must first comprehend its purpose, followed by a series of steps: face detection, retrieving the list of Big Bang Theory’s main characters from a knowledge base, face classification using the character list, and finally tagging the image with the recognized characters’ names and faces. While various vision and language systems can perform each individual task, executing natural language instructions remains beyond the capabilities of end-to-end trained systems.

To address this challenge, the researchers introduce VISPROG—an innovative program that takes visual information (a single picture or a collection of images) and a natural language command as input. VISPROG creates a series of instructions, referred to as a visual program, which is then executed to produce the desired outcome. Each line of the visual program calls upon one of the many supported modules within the system. These modules can range from pre-built language models and OpenCV image processing subroutines to arithmetic and logical operators. Moreover, pre-built computer vision models can also be integrated. The outputs generated by running earlier lines of code are consumed by subsequent modules, producing intermediate outputs that can be utilized later in the program.

Returning to the previous example, VISPROG utilizes a face detector, GPT-3 as a knowledge retrieval system, and CLIP as an open-vocabulary image classifier within the visual program to provide the necessary output. This holistic approach enhances both the generation and execution of programs for vision applications. Neural Module Networks (NMN) play a crucial role in this context. These networks combine differentiable neural modules in a question-specific manner to create an end-to-end trainable network for visual question answering (VQA) problems. Existing methods either train a layout generator using weak answer supervision or rely on brittle, pre-built semantic parsers to generate deterministic module layouts. VISPROG, on the other hand, empowers users to construct complex programs without prior training, leveraging the power of the GPT-3 language model and limited in-context examples. By invoking state-of-the-art models, non-neural Python subroutines, and greater levels of abstraction compared to NMNs, VISPROG programs achieve a higher degree of abstraction.

These advantages position VISPROG as a swift, efficient, and versatile neuro-symbolic system. Furthermore, VISPROG emphasizes interpretability. Firstly, it generates programs that are easy to understand, allowing users to verify their logical accuracy. Secondly, by breaking down predictions into manageable parts, VISPROG enables users to scrutinize the results of intermediate phases, identify flaws, and make necessary corrections to the logic.

To visually justify the predictions, a completed program showcases intermediate step outputs such as text, bounding boxes, segmentation masks, and produced pictures, all interconnected to illustrate the flow of information. The researchers demonstrate the versatility of VISPROG by applying it to four distinct activities. These tasks require a combination of fundamental skills like picture parsing and specialized cognitive abilities related to visual manipulation and critical thinking. The tasks include answering compositional visual questions, conducting zero-shot NLVR (Natural Language for Visual Reasoning) on picture pairings, object labeling based on factual knowledge from NL (Natural Language) instructions, and language-guided image manipulation.

Importantly, the researchers highlight that none of the modules or the language model itself has been modified in any way. By providing a few in-context examples with natural language commands and appropriate programs, VISPROG can be adapted to any given task effortlessly. Its user-friendly nature and significant performance improvements, including a 2.7-point enhancement on the compositional VQA test, robust zero-shot accuracy of 62.4% on NLVR, and impressive qualitative and quantitative results in knowledge tagging and picture editing tasks, solidify VISPROG as a groundbreaking neuro-symbolic system.

Conclusion:

The introduction of VISPROG represents a significant advancement in the market for AI systems. By combining a neuro-symbolic approach with natural language instructions, VISPROG enables the development of sophisticated visual programs for complex tasks. Its flexibility, interpretability, and performance improvements make it an attractive solution for businesses and industries seeking efficient and versatile AI systems. VISPROG sets a new standard in human-machine interactions, paving the way for more intuitive and productive applications across various sectors.

Source

Kroll Unveils AI-Driven Document Review with Fixed Fee Model

Researchers at the University of Groningen develop an AI-driven sarcasm detector

Phison launches Pascari-branded SSDs for enterprise storage, diversifying from controller supply

USask Teams Up With PINQ² for Exclusive Access to Canada’s IBM Quantum System One

Recall.ai Secures $10M Series A Funding for Advancing Virtual Meeting Data Utilization

Daffodil Health Nabs $4.6 Million to Revolutionize Healthcare Pricing & Administration

CoLab’s innovation in engineering collaboration secures $21M in fresh funding

Snowflake is in talks to acquire Reka AI for over $1 billion

Musk’s Strategy: China Data to Fuel Tesla’s AI Drive

Lawmakers Push Pentagon to Expedite Deployment of AI-Driven Counter-Drone Capabilities

Xiaomi’s ‘MiLM’ LLM clears registration for integration across smartphones, automobiles, and more devices

Deltek Survey Highlights AI, Machine Learning as Premier Investment Frontiers in Government Contracting Industry

EU Warns Microsoft of Potential Multi-Billion Dollar Fine Over GenAI Risk Disclosure

AgentClinic: Pioneering Clinical Simulation for Evaluating Language Models in Healthcare

Daffodil Health Nabs $4.6 Million to Revolutionize Healthcare Pricing & Administration

Squirrel Ai Pioneers Integration of Large Language Models in Education at Leading AI for Education Conference – AIED

Google Trials AI for Scam Detection in Phone Calls

WWF and Google Collaborate to Utilize Artificial Intelligence for Wildlife Conservation

Microsoft’s AI Drive Poses Challenges to Climate Commitments

Berlin-Based Startup secures €10M Investment to Transform SME Renewable Energy Procurement with AI

Ghana Harnesses AI for Enhanced Agricultural Security

Food tech innovator, Hungryroot, leverages AI to combat food waste

Introducing VISPROG: A Neuro-Symbolic Breakthrough in Complex Visual Task Solving with Natural Language Instructions

TL;DR:

Main AI News:

Conclusion:

Introducing VISPROG: A Neuro-Symbolic Breakthrough in Complex Visual Task Solving with Natural Language Instructions

TL;DR:

Main AI News:

Conclusion:

Subscribe Now