Artificial intelligence and machine learning technologies are transforming how scientists analyze massive datasets, enabling breakthrough discoveries in pharmaceutical research. In a groundbreaking development, MIT scientists have enhanced machine-learning algorithms with innovative features that significantly boost their predictive capabilities.
This cutting-edge methodology allows computational models to incorporate uncertainty measurements when analyzing data, leading the MIT research team to identify several promising compounds that specifically target a protein essential for tuberculosis bacteria survival.
According to Bonnie Berger, the Simons Professor of Mathematics and leader of the Computation and Biology group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), this approach—previously utilized primarily by computer scientists but not widely adopted in biological research—could revolutionize protein design and numerous other biological disciplines.
"This technique represents a known subfield of machine learning that hasn't been fully applied to biology yet," Berger explains. "We're witnessing a paradigm shift that should fundamentally transform how biological exploration is conducted moving forward."
The study, published today in Cell Systems, was co-authored by Berger and Bryan Bryson, an assistant professor of biological engineering at MIT and Ragon Institute member, with MIT graduate student Brian Hie serving as the lead author.
Enhanced Predictive Capabilities
Machine learning functions as a computational approach where algorithms learn to make predictions based on previously analyzed data. Recently, biologists have begun implementing these techniques to search through extensive databases of potential drug compounds, identifying molecules that interact with specific biological targets.
A significant limitation of traditional methods is their performance drop when analyzing data that differs substantially from their training sets. These algorithms struggle to evaluate molecules that deviate from those they've previously encountered.
To address this challenge, the research team implemented a technique called Gaussian process, which assigns uncertainty values to the training data. This enhancement enables models to consider prediction reliability when processing training information.
For instance, when input data includes predictions about molecular binding strength to target proteins—along with uncertainty measurements—the model leverages this information to forecast protein-target interactions it hasn't previously encountered. The system also generates confidence estimates for its own predictions. When evaluating new data, predictions may carry lower certainty for molecules significantly different from training examples, providing researchers with valuable guidance for experimental prioritization.
Another significant advantage of this approach is its minimal training data requirements. In this study, the MIT team trained their model using just 72 small molecules and their interactions with over 400 protein kinases. They subsequently applied this algorithm to analyze nearly 11,000 small molecules from the ZINC database—a public repository containing millions of chemical compounds. Many of these molecules differed substantially from the training examples.
Through this methodology, researchers successfully identified molecules with exceptionally strong predicted binding affinities for the protein kinases incorporated into the model. These included three human kinases and one kinase from Mycobacterium tuberculosis called PknB, which is essential for bacterial survival but isn't targeted by current first-line TB antibiotics.
Following computational identification, the team experimentally tested top candidates to verify actual binding to their targets, discovering remarkable prediction accuracy. Among molecules receiving the highest certainty ratings from the model, approximately 90% proved effective—substantially outperforming the 30-40% success rate typical of existing machine learning models used in drug screening.
The researchers also trained a conventional machine-learning algorithm using identical data—but without uncertainty incorporation—to analyze the same 11,000-molecule library. "Without uncertainty quantification, the model becomes thoroughly confused and proposes bizarre chemical structures as kinase-interacting compounds," Hie notes.
The team then tested their most promising PknB inhibitors against Mycobacterium tuberculosis cultured in bacterial growth media, observing significant bacterial growth inhibition. These inhibitors also demonstrated effectiveness in human immune cells infected with the bacterium.
Adaptable Learning Framework
A crucial aspect of this approach is its adaptability—researchers can incorporate new experimental data into the model, retraining it to continuously enhance prediction accuracy. Even minimal additional data can substantially improve model performance.
"You don't require extensive datasets for each retraining iteration," Hie explains. "The model can be updated with perhaps just 10 new examples—something biologists can readily generate."
This study represents the first in many years to propose novel molecules targeting PknB, providing drug developers with valuable starting points for developing kinase-targeting therapies. "We've now furnished them with new leads beyond what has been previously published," Bryson states.
The researchers also demonstrated how this machine learning approach could enhance green fluorescent protein output—commonly used for molecular labeling within living cells. Berger, now applying this technique to analyze tumor-driving mutations, believes it could benefit numerous other biological research areas.
Funding for this research was provided by the U.S. Department of Defense through the National Defense Science and Engineering Graduate Fellowship; the National Institutes of Health; the Ragon Institute of MGH, MIT, and Harvard; and MIT's Department of Biological Engineering.