BioAutoMATED: Revolutionizing Biologists’ Access to Machine Learning

TL;DR:

  • Scientists at the Wyss Institute have developed BioAutoMATED, an AutoML platform tailored for biologists.
  • BioAutoMATED enables biologists to leverage machine learning without requiring ML expertise.
  • The platform can process nucleic acids, peptides, and glycans as input data and provides comparable performance to other AutoML platforms.
  • BioAutoMATED combines three existing AutoML tools, generating standardized output results for easy comparison.
  • It helps biologists recognize patterns, ask better questions, and quickly find answers within a single framework.
  • The platform has been successfully applied to RNA, peptide, and glycan sequences, yielding valuable insights.
  • Future integration of BioAutoMATED into the AutoML landscape could extend its capabilities beyond biological sequences.

Main AI News:

In the realm of scientific research, data has become abundant, thanks to the plummeting costs of sequencing technology and the exponential growth of available computing power. However, mining through vast volumes of data to extract valuable insights is akin to searching for a microscopic needle in a colossal haystack. While machine learning (ML) and other artificial intelligence (AI) tools can expedite the data analysis process, they often remain inaccessible to non-ML experts. Addressing this challenge, a team of scientists at the renowned Wyss Institute for Biologically Inspired Engineering at Harvard University and MIT has unveiled a groundbreaking solution—an all-encompassing automated machine learning (AutoML) platform tailored explicitly for biologists with limited to no ML experience.

Named BioAutoMATED, this pioneering platform has the capability to leverage nucleic acids, peptides, or glycans as input data, boasting performance comparable to existing AutoML platforms while demanding minimal user input. Recently published in Cell Systems, the platform’s comprehensive description is available for download on GitHub.

BioAutoMATED addresses the needs of individuals who lack the proficiency to construct customized ML models. It caters to those who often ponder over questions like, “I possess this remarkable dataset, but can ML even be applied to it? How do I transform it into an ML model? The intricacies of ML impede my progress with this dataset, so how can I overcome this hurdle?” According to co-first author Jackie Valeri, a graduate student in the lab of esteemed Wyss Core Faculty member Jim Collins, Ph.D., “We aimed to simplify the process of harnessing the power of ML and AutoML for biologists and other domain experts, enabling them to address fundamental questions and unravel meaningful insights in the realm of biology.

The genesis of BioAutoMATED was not confined to the confines of a laboratory. Instead, the idea was conceived during a lunchtime conversation among the team members. Valeri, along with co-first authors Luis Soenksen, Ph.D., and Katie Collins, realized that despite the Wyss Institute’s standing as a world-class hub for biological research, only a select few experts possessed the capability to construct and train ML models that could profoundly benefit their work. Soenksen, a Postdoctoral Fellow at the Wyss Institute and a seasoned entrepreneur in the science and technology sector, expressed the team’s commitment to addressing this disparity.

We recognized the necessity of rectifying this disparity to position the Wyss Institute at the forefront of the AI biotech revolution. Moreover, we desired the development of these tools to be driven by biologists, for biologists,” Soenksen explained. While the concept of AI being the future is now widely acknowledged, this realization was not as apparent four years ago when the team conceived the idea. However, as Soenksen pointed out, the team’s goal expanded beyond serving the Wyss Institute, recognizing the vast potential of their creation.

Existing AutoML systems have simplified the process of generating ML models from datasets. Nonetheless, they often exhibit limitations. For instance, each AutoML tool focuses exclusively on a particular type of model, such as neural networks, during the search for an optimal solution. Consequently, the resulting models are restricted to a narrow range of possibilities, disregarding the potential advantages of alternative model types. Additionally, most AutoML tools are not specifically designed to handle biological sequences as input data.

While some tools leverage language models for analyzing biological sequences, they lack automation features and are challenging to utilize. In their pursuit of constructing a robust all-in-one AutoML platform for biology, the team modified three existing AutoML tools, each utilizing a distinct approach to model generation: AutoKeras, which seeks optimal neural networks; DeepSwarm, which employs swarm-based algorithms to explore convolutional neural networks; and TPOT, which explores non-neural networks using various methods such as genetic programming and self-learning. BioAutoMATED amalgamates the output results from all three tools, enabling users to compare them effortlessly and determine which type of model yields the most valuable insights from their data.

The team ensured that BioAutoMATED could seamlessly handle DNA, RNA, amino acid, and glycan sequences of any length, type, or biological function. The platform automatically pre-processes the input data and generates models capable of predicting biological functions solely based on sequence information. Additionally, the platform incorporates several features to help users gauge the need for additional data gathering to enhance output quality, comprehend which sequence features were given more emphasis by the models (indicating potential biological significance), and design novel sequences for future experiments.

Taking BioAutoMATED for a spin, the team initially utilized it to explore the impact of altering the sequence of a specific RNA segment known as the ribosome binding site (RBS) on the efficiency of ribosome binding and subsequent protein translation in E. coli bacteria. By feeding their sequence data into BioAutoMATED, the team identified a model generated by the DeepSwarm algorithm that accurately predicted translation efficiency. Impressively, this model performed as well as models created by professional ML experts, but with a significantly reduced time of 26.5 minutes and a mere ten lines of input code from the user, in contrast to other models that may require over 750 lines of code. BioAutoMATED also facilitated the identification of critical areas within the sequence that influenced translation efficiency and enabled the design of new experimentally testable sequences.

Expanding their investigations, the team then turned to peptide and glycan sequence data, utilizing BioAutoMATED to address specific questions related to these sequences. The system delivered highly accurate information regarding the essential amino acids within a peptide sequence that influences an antibody’s binding capacity to the drug ranibizumab (Lucentis). Furthermore, based on their sequences, BioAutoMATED classified different types of glycans into immunogenic and non-immunogenic groups. The team also optimized RNA-based toehold switch sequences, facilitating the design of new toehold switches for experimental testing with minimal input coding from the user.

Katie Collins, currently a graduate student at the University of Cambridge who contributed to the project while an undergraduate at MIT, summarized the significance of BioAutoMATED, stating, “Ultimately, we demonstrated that BioAutoMATED empowers individuals to recognize patterns in biological data, pose more insightful questions, and promptly find answers, all within a single framework, without necessitating expertise in ML.”

Nonetheless, any models predicted with the assistance of BioAutoMATED, like other ML tools, must undergo experimental validation in the laboratory whenever feasible. Nevertheless, the team envisions integrating BioAutoMATED into the ever-expanding array of AutoML tools, envisioning its potential extension beyond biological sequences to encompass other sequence-like objects such as fingerprints.

While machine learning and artificial intelligence tools have existed for some time, it is only through recent developments in user-friendly interfaces that their popularity has surged, as exemplified by ChatGPT. Jim Collins, the Termeer Professor of Medical Engineering & Science at MIT, emphasized the team’s hopes that BioAutoMATED will empower the next generation of biologists, enabling them to unravel the intricacies of life more rapidly and effortlessly.

Don Ingber, M.D., Ph.D., Wyss Founding Director and the Judah Folkman Professor of Vascular Biology at Harvard Medical School and Boston Children’s Hospital, as well as the Hansjörg Wyss Professor of Bioinspired Engineering at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS), commended the Collins team’s advancement. He stated, “Enabling non-experts to utilize these platforms is crucial for fully unlocking the potential of ML techniques to solve longstanding problems in biology and beyond. This breakthrough by the Collins team represents a significant stride toward AI becoming an essential collaborator for biologists and bioengineers.

The paper’s additional authors include George Cai from the Wyss Institute and Harvard Medical School, former Wyss Institute members Pradeep Ramesh, Rani Powers, Nicolaas Angenent-Mari, and Diogo Camacho, as well as Felix Wong and Timothy Lu from MIT.

Conclusion:

The development of BioAutoMATED represents a significant breakthrough in the field of biologically inspired engineering. By providing biologists with an accessible and comprehensive AutoML platform, it empowers them to leverage the power of machine learning in their research without requiring specialized ML expertise. This innovation opens up new possibilities for biologists to uncover valuable insights from vast amounts of data, accelerating the pace of biological discoveries.

With the potential for further integration and expansion into other sequence-like objects, BioAutoMATED could revolutionize not only biology but also other fields that rely on similar data analysis challenges. This advancement underscores the increasing importance of user-friendly AI tools in driving scientific progress and has the potential to shape the future of the market by democratizing access to machine learning for a wide range of professionals and industries.

Source