Streamlining ML Workflows: Metaflow Revolutionizes Data Science at Netflix

TL;DR:

  • Netflix’s machine learning infrastructure team developed Metaflow to simplify non-data science tasks for data scientists.
  • Metaflow standardizes processes, allowing data scientists to focus on ML activities using Python or R.
  • It automates access to data and computes, streamlines ML operations (MLops), and ensures reproducibility.
  • Metaflow enables cost optimization by leveraging different cloud instance types within ML workflows.
  • It supports popular ML frameworks and provides a GUI and decorators for interaction.
  • Metaflow has gained widespread adoption among top companies and has an active open-source community.
  • Outerbounds, co-founded by Metaflow’s creator, offers a managed version with enhanced security and performance.
  • Outerbounds focuses on reducing cloud spending and aims to enable on-prem GPU utilization.

Main AI News:

In the realm of data science, the core focus is on exactly that: science. However, data scientists often find themselves sidetracked by additional tasks like building data pipelines and managing compute resources for machine learning (ML) training. This diversion from their primary domain can be quite frustrating, hindering their productivity and job satisfaction.

To address this challenge and ensure the happiness of its 300 data scientists, Netflix’s machine learning infrastructure team, led by Savin Goyal, developed a novel framework in 2017. This framework, known as Metaflow, effectively abstracts away non-data science activities, allowing data scientists to dedicate more time to their primary work. Metaflow was subsequently released as an open-source project by Netflix in 2019 and has gained significant adoption across various domains.

Datanami recently had the opportunity to interview Goyal about the motivations behind creating Metaflow, its functionalities, and the benefits that an enterprise version, developed at his startup Outerbounds, brings to customers.

According to Goyal, the essentials for machine learning are raw ingredients that include data storage, data management, compute orchestration, and addressing MLops concerns such as versioning, experiment tracking, and model deployment. However, he emphasizes that data scientists often face challenges in seamlessly transitioning between these different tools. This is precisely where Metaflow steps in.

Metaflow plays a pivotal role by standardizing processes and tasks, allowing data scientists to fully focus on machine learning activities using popular frameworks like Python or R. Goyal describes Metaflow as a means to make data scientists “full stack” practitioners.

Netflix, as an enterprise, recognizes the significance of providing a common platform that enhances the productivity of its data scientists. Goyal emphasizes the importance of overcoming internal system complexity while delivering value through machine learning. With Metaflow, Netflix’s data scientists, who refer to themselves as machine learning engineers, no longer need to concern themselves with connecting to various internal data sources or obtaining access to ample computing resources. Metaflow automates these aspects, empowering data scientists to concentrate on training and inference pipelines at scale on Netflix’s AWS-based cloud platform.

Furthermore, Metaflow offers MLOps capabilities that enable data scientists to document their work effectively. Through features like code snapshots and other functionalities, Metaflow ensures problem reproducibility—a critical aspect often lacking in traditional machine learning. Goyal highlights the significance of reproducibility in facilitating collaboration and knowledge sharing within teams.

In addition to its reproducibility and automation features, Metaflow enables users to optimize costs by mixing and matching different cloud instance types within a given ML workflow. Goyal provides an example where a data scientist wants to train an ML model using a large dataset hosted in Snowflake. Initially, a memory-intensive analysis process is performed, followed by training models on GPUs. Finally, deploying the model for inference requires fewer resources. Metaflow allows the separation of these workflow stages into different instance types, resulting in reduced costs.

Another significant aspect of Metaflow is its flexibility. Data scientists can seamlessly work with their preferred ML frameworks, including TensorFlow, PyTorch, scikit-learn, XGBoost, and more. While a graphical user interface (GUI) is available, the primary interaction with Metaflow occurs through decorators in Python or R code. At runtime, these decorators determine the code execution flow, empowering data scientists who possess existing data science knowledge and seek a solution that puts them in control while abstracting away infrastructure concerns.

Since its initial open-source release in 2019, Metaflow has witnessed extensive adoption by numerous companies, including Goldman Sachs, Autodesk, Amazon, S&P Global, Dyson, Intel, Zillow, Merck, Warner Media, Draft Kings, and even CNN, which reported an 8x performance boost in models put into production over time.

On GitHub, the Metaflow project has amassed an impressive 7,000 stars, placing it among the top projects in this domain. The project’s Slack channel is vibrant, with approximately 3,000 active members. Metaflow has also expanded its compatibility beyond AWS to include Microsoft Azure, Google Cloud, Kubernetes, and hosted clouds from Oracle and Dell.

In 2021, Goyal co-founded Outerbounds with Ville Tuulos, a former Netflix colleague, and Oleg Avdeev, previously associated with MLOps vendor Tecton. At Outerbounds, Goyal and his team continue to spearhead the development of the open-source Metaflow project. Recently, Outerbounds launched a hosted version of Metaflow, enabling users to quickly get started on AWS. By having control over the infrastructure deployment, Outerbounds can provide guarantees regarding security, performance, and fault-tolerance in their managed offering, surpassing what is possible with the open-source version.

Outerbounds places great emphasis on reducing cloud spending, particularly with the current scarcity and high costs of GPUs. The company aims to eventually enable customers to leverage on-premises GPUs, provided there is a connection to their chosen hyperscaler. Goyal emphasizes the importance of maximizing data throughput and ensuring optimal resource utilization, going beyond what cloud providers typically offer.

With Metaflow and its hosted offering from Outerbounds, data scientists can streamline their ML workflows, significantly enhancing their productivity while minimizing infrastructure concerns. By abstracting away non-data science activities, empowering collaboration, and optimizing costs, Metaflow has become a valuable tool for data scientists across various industries.

Conclusion:

The development and widespread adoption of Metaflow by Netflix and numerous companies demonstrate the demand for streamlining ML workflows and empowering data scientists. Metaflow’s ability to simplify non-data science tasks, standardize processes, and optimize costs not only improves data scientists’ productivity and job satisfaction but also fosters collaboration and knowledge sharing within teams. The success of Metaflow in the market underscores the growing importance of tools that enable efficient and reproducible ML workflows, with potential implications for the future of data science and the broader market.

Source