TL;DR:
- MIT introduces Compositional Foundation Models for Hierarchical Planning (HiP).
- HiP streamlines AI reasoning by integrating language, vision, and action for complex tasks.
- It reduces data requirements by training expert models independently.
- HiP promotes harmonized expert conclusions to ensure coherent planning.
- An iterative refinement process guarantees consistency without extensive model finetuning.
- HiP’s universal accessibility doesn’t require access to model weights.
- Promising outcomes were demonstrated in three tabletop manipulation scenarios.
Main AI News:
In the ever-evolving landscape of machine learning, MIT researchers are pioneering a transformative approach to tackle complex, long-horizon tasks. Their groundbreaking work, titled “Compositional Foundation Models for Hierarchical Planning (HiP): Integrating Language, Vision, and Action,” is poised to reshape the way AI systems navigate intricate scenarios, from preparing a simple cup of tea to solving complex planning challenges.
Hierarchical Reasoning: A Key to Efficiency
Imagine entering an unfamiliar home and facing the task of making tea. Efficiently accomplishing this seemingly routine chore involves multi-level reasoning: abstract planning (identifying the steps to heat the tea), geometric understanding (navigating the kitchen space), and precise control (maneuvering joints to lift a cup). The crux lies in ensuring that reasoning at each level aligns seamlessly with the others. In this context, the MIT study delves into the development of long-horizon task-solving bots driven by hierarchical reasoning.
The Foundation Model Paradigm
In recent times, “foundation models” have surged to the forefront of solving complex challenges in mathematics, computer vision, and natural language processing. However, extending their capabilities to address unique long-horizon decision-making problems has become a focal point. Prior research efforts have primarily relied on matched visual, linguistic, and action data, with a single neural network tasked to handle long-term tasks. Yet, the expense and complexity of collecting data across these three modalities pose significant hurdles.
A New Path Forward: Compositional Foundation Models (HiP)
Enter Compositional Foundation Models for Hierarchical Planning (HiP). The hallmark of this innovative model lies in its ability to significantly reduce the data requirements by independently training expert models on language, vision, and action data (see Figure 1). HiP leverages a powerful language model to extract subtasks from abstract language instructions. It then constructs an intricate plan by utilizing a large video diffusion model to gather vital geometric and physical information about the environment. Finally, HiP relies on a robust inverse model to translate egocentric images into actions.
Harmonizing Expert Conclusions
One of the central challenges addressed by HiP is the harmonization of conclusions derived from independently trained models. In cases where three models yield conflicting results, the naive approach of selecting the most likely outcome at each stage proves inadequate. HiP, instead, advocates for a strategy that jointly maximizes likelihood across all expert models. This approach ensures comprehensive and coherent planning.
Iterative Refinement: Ensuring Consistency
To guarantee consistency, MIT’s researchers introduce an iterative refinement technique that leverages feedback from downstream models. This technique incorporates intermediate feedback from a likelihood estimator and the action model, resulting in hierarchically consistent plans that align with the task’s objectives and the existing state and agent capabilities. Importantly, this refinement process is computationally efficient and does not require extensive model finetuning.
Universal Accessibility
What sets MIT’s approach apart is its universal accessibility. Researchers do not need access to model weights, and this strategy can be applied to all models that provide input and output API access.
Conclusion:
MIT’s HiP model represents a significant advancement in AI, enabling efficient handling of complex tasks. It reduces data needs, ensures consistent planning, and has broad applicability. This innovation holds immense potential for enhancing AI capabilities in various market sectors, offering cost-effective, versatile solutions to complex, long-term planning challenges.