Innovations in Full-Body Motion Generation: Microsoft’s HMD-NeMo Revolutionizes Mixed-Reality Experiences

TL;DR:

  • HMD-NeMo, developed by Microsoft, tackles the challenge of generating accurate full-body avatar motion in mixed-reality scenarios.
  • Existing solutions using HMDs face limitations due to partial hand visibility, hindering the immersive experience.
  • HMD-NeMo is a real-time neural network with temporally adaptable mask tokens (TAMT) for plausible motion in partial hand visibility scenarios.
  • The approach combines recurrent neural networks and transformers for efficient modeling.
  • It handles both Motion Controllers (MC) and Hand Tracking (HT) scenarios effectively, ensuring temporal coherence even when hands are partially out of view.
  • Training includes data accuracy, smoothness, and human pose reconstruction in SE(3) with evaluations using the AMASS dataset.
  • HMD-NeMo outperforms existing methods in accuracy and smoothness in motion controller scenarios, demonstrating generalizability across datasets.
  • Ablation studies highlight the significance of the spatiotemporal encoder and TAMT module.

Main AI News:

In the ever-evolving landscape of mixed-reality experiences, the challenge of generating accurate and plausible full-body avatar motion persists. With a primary reliance on Head-Mounted Devices (HMDs), existing solutions have often grappled with limited input signals, mainly stemming from head and hands 6-DoF (degrees of freedom). While recent strides have showcased commendable progress in rendering full-body motion from these inputs, they all share a common limitation—the assumption of complete hand visibility. In the realm of mixed reality, this assumption falls short, particularly when hand tracking depends on egocentric sensors, introducing the issue of partial hand visibility due to the HMD’s constrained field of view.

In a groundbreaking stride forward, Microsoft’s Mixed Reality & AI Lab in Cambridge, UK, presents an innovative solution – HMD-NeMo (HMD Neural Motion Model). This unified neural network introduces a paradigm shift, enabling the generation of plausible and accurate full-body motion even when hands are only partially visible. What sets HMD-NeMo apart is its real-time and online functionality, rendering it apt for dynamic mixed-reality scenarios.

At the heart of HMD-NeMo resides a spatiotemporal encoder, featuring novel temporally adaptable mask tokens (TAMT). These tokens play a pivotal role in fostering plausible motion in the absence of complete hand observations. The approach harnesses recurrent neural networks to efficiently capture temporal information while leveraging a transformer to model intricate relationships between various input signal components.

The research paper outlines two pivotal scenarios for evaluation: Motion Controllers (MC), where hands are tracked using motion controllers, and Hand Tracking (HT), where hands rely on egocentric hand-tracking sensors. Remarkably, HMD-NeMo emerges as the first approach capable of addressing both scenarios within a unified framework. In the HT scenario, where hands may occasionally vanish from the field of view, the temporally adaptable mask tokens demonstrate their remarkable effectiveness in upholding temporal coherence.

The proposed methodology undergoes rigorous training using a loss function that encompasses considerations for data accuracy, smoothness, and auxiliary tasks associated with human pose reconstruction in SE(3). The experimentation phase entails exhaustive assessments on the AMASS dataset, a substantial repository of human motion sequences transformed into 3D human meshes. Metrics such as mean per-joint position error (MPJPE) and mean per-joint velocity error (MPJVE) come into play, serving as the yardstick to gauge HMD-NeMo’s performance.

Comparative analyses with state-of-the-art approaches within the motion controller scenario unequivocally demonstrate that HMD-NeMo not only attains superior accuracy but also delivers smoother motion generation. Moreover, the model’s remarkable generalizability shines through cross-dataset evaluations, where it outperforms existing methods across multiple datasets.

A comprehensive series of ablation studies delve into the impact of various components, with particular focus on the effectiveness of the TAMT module in handling situations involving missing hand observations. These studies underscore how HMD-NeMo’s design choices, including the spatiotemporal encoder, contribute significantly to its resounding success. In the dynamic world of mixed-reality experiences, Microsoft’s HMD-NeMo marks a transformative milestone, bringing us closer to truly immersive and realistic encounters in the virtual realm.

Conclusion:

Microsoft’s HMD-NeMo is a game-changer in the mixed-reality market. It addresses a longstanding challenge by enabling accurate full-body motion generation even with partial hand visibility. This innovation promises to enhance the immersive experience and offers substantial potential for applications in gaming, simulation, training, and beyond. It sets a new benchmark for the industry, paving the way for more realistic and engaging mixed-reality scenarios.

Source