Arc Institute has unveiled Evo 2, the largest artificial intelligence (AI) model in biology to date.
The powerful tool is designed to comprehend and manipulate the genetic code across all domains of life, representing a significant leap forward in the ability to understand and engineer biological systems.
Built on NVIDIA’s DGX Cloud AI platform, Evo 2 has been trained on an dataset of more than 9.3 trillion nucleotides from more than 128,000 whole genomes and metagenomic data, encompassing bacterial, archaeal, and phage genomes, as well as information from humans, plants and other eukaryotic species.
The tool can can process genetic sequences up to one million nucleotides long, enabling it to understand relationships between distant parts of a genome.
Evo 2 achieved more than 90 percent accuracy in predicting which mutations in the breast cancer-associated gene BRCA1 are benign versus potentially pathogenic. It can also identify patterns in gene sequences across disparate organisms that would take experimental researchers years to uncover and is capable of designing new genomes as long as those of simple bacteria.
Collaborative effort
The project was a collaborative effort led by Arc Institute with researchers from Stanford University, UC Berkeley and UC San Francisco. In collaboration with NVIDIA, it utilised more than 2,000 NVIDIA H100 GPUs for training.
To accelerate scientific progress, the Evo 2 code is being made publicly accessible through Arc’s GitHub repository and will be integrated into the NVIDIA BioNeMo framework.
A user-friendly interface called Evo Designer will accompany the release, and the team is sharing training data, code and model weights.
The versatility of Evo 2 opens up a wide range of potential applications in healthcare, biotechnology and beyond. In disease research, the model could accelerate the identification of genetic causes of human diseases and the development of new medicines.
For targeted therapies, Evo 2 could aid in designing genetic elements for more precise gene therapies, potentially reducing side effects. In protein engineering, the model’s ability to predict how mutations affect protein function could revolutionise protein design for various applications.
Evo 2 is also expected to serve as a foundation for more specific AI models in biology, similar to an operating system kernel.
The research team has taken steps to address potential ethics and safety risks by excluding pathogens that infect humans and other complex organisms from Evo 2’s base dataset. Evo 2 is designed not to return productive answers to queries about these pathogens.
“Evo 2 represents a major milestone for generative genomics. By advancing our understanding of these fundamental building blocks of life, we can pursue solutions in healthcare and environmental science that are unimaginable today,” said Patrick Hsu, Co-founder and Core Investigator of Arc Institute and Assistant Professor of Bioengineering at University of California, Berkeley.
Image: Arc Institute
