Duchenne muscular dystrophy (DMD) stands as a devastating genetic disorder primarily affecting young males, characterized by progressive muscle deterioration that ultimately leads to cardiac or respiratory failure. Typically manifesting in children by age five, this relentless condition robs patients of their mobility by approximately twelve years old. Despite medical advances, individuals battling DMD currently face a life expectancy averaging merely 26 years.
The medical community celebrated a landmark achievement in 2016 when Cambridge-based Sarepta Therapeutics unveiled a revolutionary treatment directly targeting the genetic mutation responsible for DMD. This innovative therapy employs antisense phosphorodiamidate morpholino oligomers (PMO) – sophisticated synthetic molecules designed to penetrate the cell nucleus and modify the dystrophin gene, enabling production of a crucial protein typically absent in DMD patients. However, as Carly Schissel, a PhD candidate in MIT's Department of Chemistry, explains, 'PMO by itself faces a significant limitation: poor cellular uptake efficiency.'
To overcome this cellular barrier, scientists have turned to cell-penetrating peptides (CPPs) – molecular facilitators that can be attached to therapeutic agents, significantly enhancing their ability to traverse both cellular and nuclear membranes. The critical challenge that has perplexed researchers for years involves identifying the optimal peptide sequences capable of maximizing drug delivery efficiency.
In a groundbreaking development, MIT researchers have engineered a systematic solution that seamlessly integrates experimental chemistry with artificial intelligence to identify novel, non-toxic, and highly-active peptides. These newly discovered sequences, when attached to PMO, dramatically improve drug delivery efficiency. This pioneering approach promises to accelerate the development of gene therapies not only for DMD but potentially for numerous other genetic disorders.
The culmination of their innovative research has been published in the prestigious journal Nature Chemistry, with the landmark paper led by lead authors Carly Schissel and Somesh Mohapatra, PhD students from MIT's Department of Chemistry and Department of Materials Science and Engineering, respectively. The study was supervised by senior authors Rafael Gomez-Bombarelli, the Jeffrey Cheah Career Development Professor in the Department of Materials Science and Engineering, and Bradley Pentelute, professor of chemistry. The research team also included Justin Wolfe, Colin Fadzen, Kamela Bellovoda, Chia-Ling Wu, Jenna Wood, Annika Malmberg, and Andrei Loas.
'Generating new peptide sequences computationally isn't particularly challenging – the real difficulty lies in accurately predicting their effectiveness,' explains Gomez-Bombarelli. 'Our key breakthrough involves leveraging machine learning algorithms to establish meaningful connections between peptide sequences – especially those incorporating non-natural amino acids – and their experimentally measured biological activity.'
Dream Data
Unlike conventional approaches, CPPs represent relatively compact molecular chains composed of five to 20 amino acids. While individual CPPs can enhance drug delivery, researchers discovered that linking multiple peptides together creates a synergistic effect, dramatically improving therapeutic transport. These extended chains, containing 30 to 80 amino acids, are classified as miniproteins.
Before developing predictive models, the experimental team first constructed a comprehensive dataset by strategically combining 57 different peptides, ultimately creating a library of 600 miniproteins, each attached to PMO. Using specialized assays, the team quantified how effectively each miniprotein could transport its therapeutic cargo across cellular barriers.
The decision to evaluate each sequence's activity with PMO already attached proved crucial. Since different drugs can alter CPP sequence performance, existing data often proves unreliable. By generating all data in a single laboratory environment using identical equipment and personnel, the researchers established a gold standard for consistency in machine learning datasets.
One ambitious project objective involved creating a versatile model capable of processing any amino acid type. While only 20 amino acids naturally occur in humans, hundreds more exist elsewhere – essentially serving as an expanded toolkit for drug development. Rather than using traditional one-hot encoding, which would require model rebuilding when adding new amino acids, the team implemented topological fingerprinting – essentially creating unique molecular barcodes for each sequence. 'Even if the model hasn't encountered a particular sequence before, we can represent it as a barcode consistent with patterns the model already recognizes,' explains Mohapatra, who spearheaded the project's development efforts.
The team trained a convolutional neural network using their miniprotein library, with each of the 600 examples labeled according to its cellular penetration activity. Initially, the model proposed miniproteins rich in arginine – an amino acid that damages cell membranes – an obviously undesirable characteristic. To address this issue, researchers implemented an optimization strategy that discouraged arginine incorporation, preventing the model from taking shortcuts.
Ultimately, the ability to interpret predictions proposed by the model proved essential. 'A black-box approach rarely suffices, as models might fixate on incorrect correlations or imperfectly exploit certain phenomena,' Gomez-Bombarelli notes.
In this study, researchers could overlay model-generated predictions with the barcodes representing sequence structure. 'This process highlights specific regions that the model identifies as most influential for high activity,' Schissel explains. 'While not perfect, it provides focused areas for further exploration – information that will certainly guide our empirical design of new sequences in the future.'
Delivery Boost
The machine learning model ultimately generated sequences that outperformed all previously known variants, with one particular sequence demonstrating a remarkable 50-fold improvement in PMO delivery. Through experimental validation using mouse models, the researchers confirmed their predictions while establishing the miniproteins' non-toxic nature.
While the long-term impact on patients remains to be seen, enhanced PMO delivery offers multiple potential benefits: reduced drug exposure could minimize side effects, less frequent dosing might improve quality of life (PMO currently requires weekly intravenous administration), and treatment costs could decrease significantly. Notably, PMO isn't the only therapeutic that stands to benefit – additional experiments demonstrated that these model-generated miniproteins could successfully transport various functional proteins into cells.
Recognizing the disconnect between computational researchers and experimental chemists, Mohapatra has made the model publicly available on GitHub, complete with tutorials for experimentalists working with their own sequence libraries. To date, researchers worldwide have adopted this model, repurposing it to enhance drug development across numerous therapeutic areas.
This pioneering research received support from the MIT Jameel Clinic, Sarepta Therapeutics, the MIT-SenseTime Alliance, and the National Science Foundation.