TL;DR:
- Twelve Labs specializes in AI models for in-depth video comprehension.
- They aim to bridge natural language with video content, enabling advanced applications.
- Bias mitigation is a key concern in their models, with plans for transparency.
- Twelve Labs distinguishes itself with fine-tuning features for tailored video analysis.
- Pegasus-1, their latest model, offers comprehensive video analysis capabilities.
- The startup has attracted 17,000 developers and partnerships across various industries.
- Recent funding of $10 million from Nvidia, Intel, and Samsung Next boosts total funding to $27 million.
Main AI News:
Twelve Labs is spearheading the development of models that possess a profound comprehension of videos. While text-generating AI has its merits, the real game-changer lies in AI models that can decipher images as adeptly as they decode text, thereby paving the way for a plethora of potent applications.
Consider Twelve Labs, a pioneering San Francisco-based startup, where the crux of their mission revolves around training AI models to tackle intricate video-language alignment predicaments. As Jae Lee, the co-founder and CEO of Twelve Labs, eloquently puts it, their goal is to “solve complex video-language alignment problems.” In an email interview with TechCrunch, Lee expounds on their vision, stating, “The vision of Twelve Labs is to help developers build programs that can see, listen, and understand the world as we do.”
The core of Twelve Labs’ models revolves around the ability to bridge the gap between natural language and the content within a video. This encompasses actions, objects, and even background sounds, empowering developers to craft applications that possess the capability to search through videos, categorize scenes, extract topics from them, automatically condense video clips into chapters, and more. The spectrum of applications spans from driving functionalities like ad insertion and content moderation to media analytics and the automated generation of highlight reels, blog post headlines, and tags derived directly from videos.
However, in light of the well-established fact that models can exacerbate biases present in the data on which they are trained, the question of potential bias arises. Training a video understanding model primarily on clips of local news, often sensationalized and racialized, could inadvertently lead to the amplification of racist and sexist patterns within the model.
Lee emphasizes that Twelve Labs is diligently committed to addressing internal bias and “fairness” metrics for their models prior to their release. Furthermore, the company has plans to provide model-ethics-related benchmarks and datasets in the future, although no specifics were shared.
Lee draws a distinction between Twelve Labs and large language models like ChatGPT, asserting that their platform is meticulously tailored for video processing and comprehension, seamlessly integrating visual, audio, and speech components within videos. He asserts, “We have really pushed the technical limits of what is possible for video understanding.”
While tech giants like Google are venturing into the realm of multimodal models for video understanding, Twelve Labs stands out not only due to the exceptional quality of its models but also thanks to the platform’s fine-tuning capabilities. These features enable customers to harness the power of the platform’s models for “domain-specific” video analysis.
On the model front, Twelve Labs has introduced Pegasus-1, a cutting-edge multimodal model designed to comprehend a wide array of prompts related to comprehensive video analysis. Pegasus-1 can be prompted to generate a comprehensive, detailed report about a video or simply provide highlights along with timestamps.
Lee underscores the significance of this innovation for enterprise organizations, stating, “Enterprise organizations recognize the potential of leveraging their vast video data for new business opportunities … However, the limited and simplistic capabilities of conventional video AI models often fall short of catering to the intricate understanding required for most business use cases.” By leveraging potent multimodal video understanding foundation models, enterprise organizations can achieve a level of video comprehension akin to human cognition, without the need for manual analysis.
Twelve Labs has been steadily gaining traction, with a user base of 17,000 developers since its private beta launch in early May. The company has also forged partnerships across diverse industries, including sports, media and entertainment, e-learning, and security, with clients such as the NFL.
The journey for Twelve Labs is ongoing, with continued fundraising being a pivotal component of their startup strategy. Recently, the company announced the successful closure of a $10 million strategic funding round, with notable backers including Nvidia, Intel, and Samsung Next. This brings their total raised capital to $27 million.
Jae Lee acknowledges the significance of these strategic partnerships, describing them as accelerators for their research, product development, and distribution efforts. He envisions these investments as a driving force for innovation in the field of video understanding, enabling Twelve Labs to offer powerful models tailored to diverse customer needs. In Lee’s words, “We’re moving the industry forward in ways that free companies up to do incredible things.“
Conclusion:
Twelve Labs’ unveiling of the Pegasus-1 model, along with their strategic funding, underscores the growing demand for advanced video understanding in various industries. Their commitment to bias mitigation and fine-tuning positions them as a player to watch in the evolving AI market. As AI-driven video analysis gains momentum, Twelve Labs is well-poised to drive innovation and meet the needs of diverse customers.