TL;DR:
- Big foundational models like CLIP, Stable Diffusion, and Flamingo have improved multimodal deep learning.
- Joint text-image modeling is crucial in artificial intelligence due to its ability to generate high-resolution imagery and tackle complex problems.
- The scarcity of high-quality, large-scale annotated video and audio datasets hinders research in multimodal deep learning.
- LAION AI introduces video2dataset, an open-source tool for fast and extensive video and audio dataset curation.
- Video2dataset enables the merging and reshaping of video datasets, providing transformations and improvements.
- The tool offers logging, subsampling, communication, and reprocessing functionalities.
- Researchers can use video2dataset to generate improved models and make advancements in video and audio applications.
Main AI News:
In recent years, the emergence of big foundational models like CLIP, Stable Diffusion, and Flamingo has revolutionized multimodal deep learning. Among the various applications, joint text-image modeling has become one of the most significant and relevant issues in the field of artificial intelligence. These models exhibit exceptional capabilities, generating stunning high-resolution imagery and tackling challenging downstream problems. What’s remarkable is that despite their distinct tasks and designs, these models share three key properties that contribute to their remarkable performance: a simple and stable objective function during (pre-)training, a well-investigated scalable model architecture, and, most crucially, access to large and diverse datasets.
However, as of 2023, multimodal deep learning predominantly focuses on text-image modeling, with limited attention given to other modalities such as video and audio. It raises the question: why don’t we have robust foundational models for these modalities, considering that the training techniques used are typically modality agnostic? The answer lies in the scarcity of high-quality, large-scale annotated datasets. This lack of clean data acts as a barrier to research and development in the video domain, unlike image modeling, where established datasets like LAION-5B, DataComp, and COYO-700M, along with scalable tools like img2dataset, exist for effective scaling.
To overcome this data challenge and unlock groundbreaking initiatives like high-quality video and audio creation, improved pre-trained models for robotics, and movie audio description for the blind community, researchers emphasize the need to address this problem through multimodal research and open-source contributions.
Introducing video2dataset: an innovative open-source program designed for fast and extensive curation of video and audio datasets. This powerful tool has been successfully tested on numerous large video datasets, proving its adaptability, extensibility, and ability to provide a wide range of transformations. Detailed case studies and instructions on replicating this method can be found in the repository.
Using video2dataset, researchers have achieved remarkable outcomes by downloading individual video datasets, merging them, and transforming them into more manageable shapes with additional features and significantly increased sample sizes. For a deeper understanding of this chain processing, please refer to the examples section. The results obtained through training various models on datasets curated using video2dataset showcase the tool’s efficacy. Our forthcoming study will comprehensively discuss the new dataset and the valuable insights derived from it.
Let’s delve into the definition of video2dataset.
Since webdataset serves as an acceptable input format, video2dataset can seamlessly integrate into a processing chain for reprocessing previously downloaded data. For instance, you can utilize the WebVid data downloaded in the previous example and execute a script with video2dataset to calculate the optical flow for each video and store it in dedicated metadata shards.
Architecture
Built upon the foundations of img2dataset, video2dataset operates by taking a list of URLs and associated metadata as input and converting it into a WebDataset that can be effortlessly loaded with a single command. The resulting WebDataset can be further reprocessed, preserving the same shard contents for additional changes. Now, let’s explore how video2dataset works in more detail.
Exchanging Ideas
The initial step involves partitioning the input data to ensure even distribution among the workers. These input shards are temporarily cached, establishing a reliable one-to-one mapping between them and their corresponding output shards, guaranteeing fault-free recovery. In the event of an unexpected termination during dataset processing, valuable time can be saved by skipping the input shards for which researchers already possess the corresponding output shard.
Communication and Study
Subsequently, workers take turns reading and processing the samples contained within the shards. Researchers offer three different distribution modes: multiprocessing, pyspark, and slurm. The multiprocessing mode is well-suited for single-machine applications, while the pyspark and slurm modes facilitate scaling across multiple machines. The reading strategy depends on the format of the incoming dataset. If the data is in the form of a table of URLs, video2dataset will fetch the corresponding videos from the internet and add them to the dataset. Furthermore, video2dataset supports various video platforms by leveraging yt-dlp to request videos that cannot be found. On the other hand, if the video samples are sourced from an existing Web dataset, the data loader specific to that dataset can read the tensor format of the bytes or frames.
Subsampling
Once the video has been read and the worker has obtained the video bytes, these bytes are passed through a pipeline of subsamplers according to the job configuration. This stage offers optional downsampling of the video in terms of frame rate and resolution, clipping, identification of scenes, and more. Similarly, subsamplers are employed to extract and incorporate metadata such as resolution/compression information, synthetic captions, optical flow, and other input modalities. Adding a new transformation to video2dataset or modifying an existing one is a straightforward process, requiring minimal changes elsewhere in the repository.
Logging
Throughout the entire process, video2dataset maintains meticulous logs at multiple checkpoints. Completion of each shard generates an associated “_stats.json” file containing essential information, including the total number of handled samples, the percentage of successful processing, and details regarding any errors encountered. Additionally, video2dataset offers seamless integration with Weights & Biases (wand), enabling detailed performance reporting and metrics for successes and failures with a single argument. These capabilities prove invaluable for benchmarking and cost estimation tasks associated with entire jobs.
Writing
Finally, video2dataset stores the modified information in output shards at user-specified locations, ready for subsequent training or reprocessing operations. The dataset can be downloaded in various formats, each consisting of shards containing a fixed number of samples. These formats include directories, tar files, records, and parquet files. The directories format is particularly useful for smaller datasets during debugging, while the tar files align with the WebDataset format for efficient loading.
Reprocessing
video2dataset allows for reprocessing of previously generated output datasets by reading the output shards and passing the samples through new transformations. This capability proves highly advantageous for video datasets, considering their often substantial size and complex nature. It empowers researchers to carefully downsample the data, avoiding multiple downloads of large datasets. The upcoming section will provide a practical example, offering further insights into this reprocessing functionality.
By combining the power of multimodal deep learning with the efficiency and scalability of video2dataset, researchers can unlock new frontiers in video and audio analysis, enabling groundbreaking applications across various domains. With its extensive capabilities and user-friendly design, video2dataset stands poised to revolutionize the process of curating video and audio datasets.
Conclusion:
The introduction of video2dataset by LAION AI represents a significant advancement in the field of multimodal deep learning. By addressing the scarcity of high-quality video and audio datasets, this open-source tool opens up new possibilities for research and development. The ability to curate, merge, and reshape video datasets efficiently and at scale enhances the capabilities of deep learning models. This innovation will have a profound impact on various industries, enabling high-quality video and audio creation, improved pre-trained models for robotics, and enhanced movie audio description for the visually impaired community. Furthermore, video2dataset’s open-source nature fosters collaboration and knowledge sharing, fostering further advancements in multimodal research. Overall, this development signifies a promising future for the market, unlocking new opportunities for AI-driven video and audio applications.