As happens every so often, there is a new push into virtual cells (or foundation cell models or whatever name people use nowadays). Neither I, nor anyone I know can quite describe what that means, but at a high level it feels like a virtuous goal. To me, the cell is the fundamental unit of life, and is one of the most fascinating things in the world. I remember seeing this video early in my career and being mesmerized.
You can’t watch this video and not have a million questions for how this happens. How does the neutrophil know where the bacterium is? How does it even move towards it directionally? When there are obviously two bacteria that it comes across, why and how does it stay with the original one it was chasing? How does the bacteria know to run away (or is the neutrophil just pushing it away)? Is there a model that could have predicted the trajectory of this video from where it started? If so, what would we need to understand about the system to build that model? Does the genome itself tell us about this type of cell behavior, or is there something else we need to consider? Could we use this model to engineer a cell to seek out and destroy whatever we so desired, in whatever medium (e.g., a cell that seeks and degrades microplastics in the ocean)? What other emergent behaviors of cells could we hope to predict and engineer if we had virtual cell models? Are these even the types of cell behaviors a virtual cell model should predict?
I’ve had a tough time answering these questions in my career. Importantly, are the models really a framework to encode our understanding and their consequences, or to give us a framework and toolset to build what could be? Over the years, I’ve concluded that there isn’t much difference, and that the only useful test of a model is the ability of it to guide us to understand or build something non-trivially mysterious.
The revolution we are seeing in computational protein models is a good example of what great progress starts looking like; first predicting and understanding existing protein structures, and very quickly moving towards novel functions like binding now. What will it take for us for something like this to happen for cells? Short story is I have no idea. The longer answer is not an answer, but at least an exposition of my experience over the last >25 years in trying to get there. I don’t think we are close, but I do think a lot changed and it’s an exciting time to try again in new ways.
My intro into virtual cell models
My first exposure to the computational modeling of cells came from meeting Adam Arkin in 1998. He was just starting a new lab at LBNL in Melvin Calvin’s space, who had just passed away. Adam had flyered the campus looking for researchers interested in the computational/mathematical modeling of biological systems. I emailed him, setup a meeting, and while I didn’t quite understand what he was doing, I was fascinated by the problems he was studying and spent the next three years immersing myself.
He had just published some really interesting work with Harley McAdams at Stanford that examined an early question in developmental biology. Phage lambda is a virus that infects E. coli. Early in its infection cycle, the virus makes a choice. Should it make a ton of copies of itself while killing the cell (lysis), or should it integrate into the cell, go dormant for a while until conditions change (lysogeny)? Figuring out how lambda does one or another was an early galvanizing problem for early molecular biology, and described beautifully in Mark Ptashne’s 1986 book, A Genetic Switch. That said, while we mechanistically knew how lysis or lysogeny played out, it was still unclear how the choice was being made. Was it random? Something else?
Adam and Harley wanted to test the idea that lambda was playing dice; that small fluctuations in timing of proteins binding to a single piece of DNA is what leads to going one way or another; and that these fluctuations are the natural consequences of have small numbers of individual molecules of protein and DNA. It’s like playing red or black on a roulette wheel; mutually exclusive outcomes that are determined because only the ball can only fall in one slot or the other randomly. But how do we computationally model this?
The Chemical Master Equation describes the probabilistic time evolution of a chemical system but is difficult to solve. Simplifications like Fokker- Planck can be used build continuous approximations that are differentiable, but often break down when the number of molecules is low and stochasticity dominates, as is the case with a single DNA molecule. They had heard a talk from Dan Gillespie from the Naval Air Weapons Station at China Lake, who was quietly building ways to solve the Chemical Master Equation explicitly through computational simulation (work that pretty much went unnoticed for a long while). They used this approach to show that random fluctuations could not only explain the lysis-lysogeny switch, but also explain other data such as how the balance changes as the multiplicity of infection changes (number of phage simultaneously infecting the cell). It was elegant work, only made possible by new algorithms and increasing computational power of high-performance parallel computing, and I was hooked on the idea that we could expand on this framing to examine other interesting questions in bacterial development. It is worth noting though, that this story could be wrong, but that’s not the point (see Footnote 1).
I started looking at all sorts of other decisions bacteria make. One of the best studied ones I worked on is how Bacillus subtilis responds to stress. The cells, in responses to various stressors (hunger, attacks, temperature), will coordinate amongst the cellular population a variety of often again mutually-exclusive responses. Some produce degradative enzymes (subtilisins - which are industrially useful in laundry detergents), some go dormant, while others go through apoptosis (kill themselves to help feed the rest), or sporulate (produce dormant but hearty cells that could wait it out). My first job was to go to the library, photocopy/read hundreds of papers, encode this knowledge into pathway diagrams and then mathematical formalisms, code and build ways to simulate them. There were other systems we looked at too including Caulobacter’s differentiation circuit, Myxococcus xanthus fruiting body formation and more. This was the early work that turned into the DARPA BioSpice project, which took inspiration from SPICE, which was an early attempt to build a simulation framework of cells from sequence, to expression, to protein networks, to behaviors. It was an attempt, but ultimately, I think it can be said in retrospect it was too early, we were still missing a lot, and it wasn’t the outcome we had dreamed about. I do think though, it remains the dream I hear when people say virtual cell models.
It’s this framing of the problem of a virtual cell that I got excited by; mechanistic models that can compute the emergence of interesting cellular behaviors from sequence. The clarity and importance of the idea was always compelling. DNA is the information carrier of life and ultimately encodes all those really interesting emergent cellular behaviors. That’s an immense power and one that I wanted to understand and master. Molecular biology and genetics were great foundations to stand on, but now we had to do the work to integrate them together in computational models to understand how this emergent cellular behavior comes about. If we could do that, we would then could engineer such behaviors to our heart’s content. Well, that’s not actually true; even a perfectly predictive model will never let us know what’s ultimately possible, which is a sad fate (see Footnote 2).
Anyways, there was a growing discontent that these mechanistic models weren’t actually working that well. While we could build mechanistic models, and we could measure individual parameters, these models often failed to predict the effects of perturbations, or giver deeper insights into how cells functioned. Drew Endy and Roger Brent wrote a piece discussing these issues and paths forward in 2001, when I was entering grad school. Drew moved to MIT in Jan 2002, and I joined his lab soon afterwards.
I set about to try to improve some of the issues we thought might be going on, but going back to the simplest organism we could think about, bacteriophage T7. It was likely the simplest and most understood organism that we could study and past attempts to model the life cycle left much to be desired. We built some new, more mechanistic simulators and models of the phage life cycle, better ways to measure expression levels and their evolution in development genome-wide, and put it together … and it was still underwhelming. While we made some progress, it was nowhere close to the ideal of saying: “Hey, we understand T7; we can arbitrarily understand the mapping between gene expression and developmental program of this phage, let’s move on to the next most complex organism.” We weren’t even close to getting there (you can read my thesis here if interested). If it wasn’t going to work for T7, what was it going to work for?
I think this was true for the field at large too. Thousands of modeling papers cropped up. Looking back, I think most of them are probably wrong (overfit) or just not that interesting (great R^2, no insights). It’s notable that these models didn’t predict that we were missing key players, discovered in ensuing years, in many of the systems being studied. At the same time, the field of HT genomic measurements was starting to explode as well. While we continued to accumulate data and identify players in these networks, the systems-wide understanding was lacking. Sydney Brenner wrote an essay in his inimitable voice talking about these issues called Sequences and Consequences. A few choice quotes from it:
We now have unprecedented means of collecting data at the deepest molecular level of living systems and we have relatively cheap and accessible computer power to store and analyse this information. There is, however, a general sense that understanding all this information has lagged far behind its accumulation, and that the sheer quantity of new published material that can be accessed only by specialists in each field has produced a complete fragmentation of the science. No use will be served by regretting the passing of the golden years of molecular genetics when much was accomplished by combining thought with a few well-chosen experiments in simple virus and bacterial systems; nor is it useful to decry the present approach of ‘low input, high throughput, no output’ biology which dominates the pages of our relentlessly competing scientific journals. We should welcome with open arms everything that modern technology has to offer us but we must learn to use it in new ways. Biology urgently needs a theoretical basis to unify it and it is only theory that will allow us to convert data to knowledge.
Everybody understood that getting the sequence would be really easy, only a question of 3M Science—enough Money, Machines and Management. Interpreting the sequence to discover the functions of its coding and regulatory elements and understanding how these are integrated into the complex physiology of a human being was always seen as a difficult task, but since it is easier to go on collecting data the challenge has not really been seriously taken up. I am sure that there will be many readers who will deny this and claim that there already is a way of confronting this problem through a new branch of biological research called Systems Biology. This is precisely the main target of my article; I want to show that the claims of radical systems biology cannot, in reality, be met and that it will not be possible to generate unifying theories on that basis. There is a watered-down version of systems biology which, to my mind, does nothing more than give a new name to physiology, the study of function and the practice of which, in a modern experimental form, has been going on since at least the beginnings of the Royal Society in the seventeenth century.
I for one felt this malaise. What were we doing? What came about from all this work? Why do we not feel closer to getting where we wanted to be? I don’t think 15 years since he wrote this, despite even more progress in technology, that we are much closer to our goals of building virtual cell models. The more detailed our maps and understanding has gotten, the more mysterious these emergent behaviors got.
This frustration led me down a different road in building computable models. It started with a second parallel approach with trying to understand T7. We thought, could we build a virus that still had the behaviors we were interested in (development), but was simpler to computationally model and ultimately a better starting point to explore more. That’s what launched the idea of refactoring the T7 genome, to try to make it easier to model and manipulate. It was my first foray into experimental biology and ways to think about computable models differently.
The map is not the territory, the territory is the territory
There are ~10^9 bacterial cells/mL of culture. If every cell could be programmed with a different sequence and I could readout all the functional consequences, then in theory I could do 10^9 simulations at a time; not in a computational model, but a physical one. I got obsessed with the idea and worked to make this happen over many years. It required:
DNA Synthesis Technologies - This encodes the design or hypothesis we want to evaluate. I worked to get de novo gene synthesis to be cheap by leveraging DNA produced off of DNA microarrays (reviewed here).
DNA Editing Technologies - It was also important to get that designed DNA to be installed in the right context in library formats. We worked on new ways to edit genomics across many organisms including helping develop several technologies including TALE, Crispr, Serine Integrases, MAGE, etc. Importantly, focusing on getting things working in pooled library formats was imperative.
Multiplexed Measurements - Finally, the most interesting is how we can get relevant information of each design in a pooled experiment. We wanted to take advantage of the scale/density of cells to act as a platform for this simulation at scale. The big unlock here was next-gen sequencing. If we could connect a readout to a sequencing readout, we could test all these hypotheses/designs in a pooled way.
I spent the last 15 or so years working on these problems, and it finally started to come together. One of the first times it felt like this approach was working was a project with Dan Goodman on trying to understand why rare codons were enriched at the N-terminal region of most organism’s genes. To disentangle the various hypotheses that people put forth to explain these observations, we designed tens of thousands synthetic genes that were designed to alter codon usage and amino acid content systematically in a controlled manner to distinguish between the different causal hypotheses. We built these reporters in a single pool using microarray-derived oligo libraries and cloned them into a custom reporter construct. Using new multiplexed methods we developed to measure the DNA, RNA, and protein levels in a pooled way, we were able to measure the consequences of the designs for the whole library in a single experiment. We showed that rare N-terminal codons increase expression ~14 fold on average when compared to using common ones. Because our constructs were designed to do so, we were able to build statistical models for how each codon’s choice affected expression, and showed that rare codons were increasing expression by reducing mRNA secondary structure and not due to hypothesized factors such as tRNA pool availability or ribosomal speed. This is a consequence of the structure of the genetic code in GC rich organisms of 50% and above.
In effect, the territory itself was what we used to navigate; a physical model (see Footnote 3) we could compute on at scale, to answer really interesting questions about causality of mechanisms that underlie statistical genomic correlations. I trusted these experimental models much more so than my computational ones, and I was able to compute on them at similar scales. It’s also something that is unique to biology. How would engineering planes be different if I could build millions of planes with different designs and test them all in wind tunnels. In fact, in no other physical medium is this possible. I can’t build a million different robot designs and test them all in the physical world. I can’t do a million different chemical syntheses and test them all easily (don’t come after me DEL people, you are still using DNA). But I can do that in biology. As such, biology might be the best way to understand how to engineer the physical world, due to the scale that is unmatched in any other area, and most similar to the scale of computational approaches themselves.
I spent many years expanding these ideas in my lab across bacterial gene regulation, yeast complex traits, identifying causal variants in GWAS, sequence determinants of mRNA splicing, and bacterial and human protein function. It definitely was exciting, but I still felt pretty distant from the original hopes of cell models that I brought up at the top of this post. While this is now a good approach at looking at individual components and mechanisms, we are still pretty distant from these really cool cellular behaviors that I still do not really understand nor know how to program. This too felt like a capitulation of the original problem I was hoping to solve.
Over the last 6 years at Octant, I’ve been working on taking what I’ve learned thus far and applying it to some more practical questions of building drugs. I guess that’s for both personal reasons of being impacted by disease as one grows older, but also more existential ones: I didn’t think I’d live to see the day where I’d really get to virtual cell models that I cared about, so I should spend my time on something more useful.
What we are doing does borrow from these ideas though. There are an amazing number of emergent cellular behaviors that we are realizing can be modulated by small molecules (e.g., expression modulators, correctors, biased signalers, induced proximity, etc) and I think there is an underlying platform that lets us build models (experimental and computational) at scale across this chemical biology interface to build great (and weird) new drugs. Maybe I’ll write about this in a future post.
All this said, I do still continue to think about virtual cell models and follow some of the great work in the space. I think the excitement is warranted, and the more people that get into the field from the outside the better. Sometimes being ignorant of approaches people took in the past (and advice from them) is the best way to move forward on a problem that has been intractable, so caveat emptor.
Good reasons to attempt the problem again: I think it’s pretty obvious progress in protein foundation models is reason for optimism. That there might be real breakthroughs in algorithmic learning to solve problems. These algorithms are super-human in capabilities, and that’s exciting, and maybe it will spill over to other biological phenomena. At the same time, our ability to produce data in cells has grown many orders of magnitude due to the explosion of functional genomics techniques, the ability to write/edit DNA, and the ability of doing multiplexed assays and automation. These points alone are worth the investment at a societal level to reattempt building virtual cell models.
The metrics aren’t the goal. Most people know of Goodhart’s law, and I think we all sometimes get caught up in this when building cell models. In protein folding, the metrics on predicting novel protein structure are clearly what we went after, and it’s been helpful to the some of the goals. But the real goals are still elusive. Can I design an enzyme that is stable in hydrophobic solvent? Can I build the amazing molecular machines responsible for converting light to chemical energy? I would posit the measures of progress we should focus on is about how good we are at getting to what we want. We are seeing the protein foundation model field evolve to this in building designed binders for instance. The point isn’t to predict gene expression response to a particular cell type or perturbation, but to be able to do some of the amazing things cells do and explain how they do them. That’s the progress we should all be focused on.
Measure what matters. Is gene expression even the right thing to predict for building virtual cell models? Prediction of gene expression based on sequence, cell type/state and perturbation is likely where everyone is going, because it’s the easiest thing to measure. That said, so many of the amazing things cells do seem independent of good control of gene expression levels (eg, the video up front on chemotaxis). I’m continually amazed about how well cells work despite massive noise in gene expression; you can mess up a lot and the behavior is still there. We have thousands of cell types each each expressing several thousand genes; and there are only 20,000 genes. My hunch is that progress in understand and engineering really interesting cellular behaviors won’t come from the precise prediction and understanding of how sequence controls gene expression.
So what does matter? I’m not sure. Is it chemiosmotic or electrochemical potential maps? What about phase separation, molecular machines, compartmentalization and transport, redox states, post-translational mods, protein-protein interactions, chemical transformations, redox and pH state, et al? Are these maps part of the virtual cell? If it’s just expression, can we hope to do anything really amazing without these other maps?Mechanism Matters: My guess for ultimately where we want to go, the point of these models is to uncover the causal mechanisms that these behaviors arise through. I’ll leave this story here, from the early days of AI:
There are so many cellular behaviors that are obviously mysterious and wonderful, and our jobs as Minsky puts it, is to figure out ways we can plan to do them (or design an organism to do them). My guess is a first step is using these models to figure out mechanistically how cells do the mysterious things we can currently observe, before being able to reliably engineer cells to do similarly mysterious things. I do want to caveat this though: I’m not sure point is true; it’s just a hunch. We were pretty good at engineering steam engines before figuring out entropy and thermodynamics (and in fact discovering the science required the engineering first).
Experimental Design: The area I’m perhaps most excited about is the use of virtual cell models to guide experimental design at scale. It goes back to the example I was using above in codon usage. I think there are algorithmic ways to think about the exploration through this vast space to help constrain mechanism or build what you want. It’s my best guess at how we will make progress beyond just learning on observational data. There are many ways to think about this: build the most uncertain sequence predictions and measure them, enumerate many different mechanistic models and build the ones that best differentiate them, multi-armed bandit problems, active learning, design of experiments.. I don’t really care, let’s do them all and see what wins. I’m hoping we get to explore some of these ideas at Octant in our collaborative OpenADMET work.
Dreams of a biological future: Biology is pretty amazing. It’s like an alien technology that landed in our laps, and can do things that we could only dream of at scale. It terraforms at planetary scales. It computes with power efficiencies and capabilities we only dream about having. It’s conscious. It has atomic precision to build materials with performance capabilities, efficiency and scale we can’t match. It has control of chemical and energetic transformations that are just nuts. It’s control of movement is remarkable. It’s ability to sense and interact with the physical world through sight, smell and touch is unparalleled. The mastery of these capabilities will teach us new ways to interact with and shape the physical world. That said, it’s clear it will be a new field of engineering, not one with bad analogies to practices in current forms engineering (e.g., modularity), but something wholly new that we are yet to discover. We are really just getting started, and I hope I get to see it.
I always dreamed of being the Willy Wonka of fruit, getting to express my creativity in the form of deliciousness. I don’t think I’ll live to see that day, but I hope someone does. I do think virtual cell models will be a small part of that journey, and even that as I explain here still seems hazy and distant. I hope we can accelerate our pace, and will always welcome and encourage new people and resources trying to get to that goal. So if you read this piece as a “get-off-my-lawn” piece, it’s not that. It’s more of a “please don’t repeat the same mistakes I made, and I if there are ways I can help you, please let me know” piece. And if you succeed, I want to work on making the perfect mango.
Footnotes:
An interesting paper from work done by Francois St Pierre and Drew Endy (had a front row seat in the lab to much of it) posited that actually, the switch is a volume sensor of the cell when the phage infected, and the switch is actually a deterministic result of that mechanism. It just looks random because the cell size is a distribution at the population level, and can change dependent on things like growth rate.
The vastness of sequence space is hard to grok, and ultimately we will never explore, much less understand, even a fraction of what’s possible. For example, consider a small bacterial genome like E. coli (4M bases). There are 4^4,000,000 possibilities, which is about 10^2,400,000. Imagine we built an oracle of all phenotypes in every atom in known universe (Eddington number, 10^80) that could compute the phenotype in the smallest known unit of time (Planck Time, 5 * 10^-44) and let it compute for the age of the universe (4.4 * 10^17). Even then, we would explore only an infinitesimally small portion of the possibilities (10^141 / 10^2,400,000). Clearly, evolution has only explored only an infinitesimally smaller portion of this imaginary oracle’s exploration. This is a bit sad. It means we will never have the complete picture of what’s possible in a bacterial genome (or really anything beyond a few hundred bases), in the same way we will never explore all of the Library of Babel. Another implication is that life is likely possible in a large potential fraction of this space, otherwise it would have never been possible to be evolved in the first space. Finally, I think it implies if we are ever able to build a perfect oracle, it would likely require complete mechanistic understanding of the system.
Physical models are cool. One of my favorites is the Mississippi River Basin Model built by the Army Corps of Engineers during World War II. It was used to model floods in the Mississippi River, and where to place interventions (like dams) to prevent future floods. My ideas around them really starting formulating on following and interacting with Michael Elowitz and his approaches in studying biology, which to this day I’m still inspired by.
Not sure if you’re aware of this: My graduate student Ben Jack read every word of your thesis and reimplemented (and then extended) the T7 simulator. We’ve since continued to add features. It still doesn’t work that great though. :-)
Thanks so much for this piece! I always love getting to learn about something new through stories of how a field came to be and the various explorations and experiments (both the fruitful and not-so-fruitful kind!) that led to the present state of things. This was honestly such a delight to read through, though it took a while to finish because I kept finding myself going down all the different rabbit holes that you allude to throughout the piece!