AlphaFold, A Tentative Review
A few days ago I wrote a half review half cry for help in understanding AlphaFold. Since then some nice people gave me a few hints and criticism. More importantly, someone pointed out to me that I needn't struggle with the code since there's a book-length supplement explaining how it works.
In roughly 2 years, AlphaFold went from niche curiosity at the intersection of molecular biology and ml to Nature paper. I abused the original AlphaFold as an example of how ML can outperform expert models, so I'm curious how this latest iteration works.
i - What it isn't
Before I go into what this new approach is, I should go into what this new approach isn't.
AlphaFold, in its current iteration, is not a generic machine learning model tweaked and prepared to work on structural predictions. It's a hugely complex pipeline designed specifically for the task at hand. It uses elements from the kind of models that work in language and vision problems, but every single bit is highly customized for the task at hand.
This makes me somewhat sad. The original model was basically something that a pair of good students that knows machine learning and molecular biology could build together.
Hey, we took this resnet and added a few extra bits and input/output formating and it washes the floor with 40+ years of complex theory-loaded modelling in terms of accuracy and compute.
It was a showcase of the power of automatic differentiation when applied to an equation that was friendly to this method and contained a lot of parameters.
I tried just understanding this model, together with a friend that knows molecular biology. We spent an hour reading the paper, another hour discussing it, then I spend 4 more re-reading the paper, the supplement and a helpful review.
Is this good? I don't know, part of me feels like having specialized systems for important problems is good. Part of me feels like the original magic is kinda lost in this approach.
Speaking of important problems...
The other thing that AlphaFold predicts in-vitro protein structure. This is something that can be said about any protein structure prediction model. But it's still worth saying.
I'm unable to find any conclusive information as to how different these structures are from those of proteins in the body. In part, because figuring out anything about in-vivo protein structure is very difficult and relies on assumptions based on in-vitro structures.
Proteins can be "trapped" in an "intermediary" structure (i.e. not the lowest energy state possible) by another proteins or lipid. Lack of chaperones or mutated chaperones can result in more misfolding and failure to discard misfolded proteins.
There's also a significant difference between the medium in which folding happens in-vitro versus in-vivo. But the little information I could find on this seems to indicate this has little or no effect on structure.
It's also unclear to me if some misfolded proteins fulfil different functions, or if they can all be thought of as either garbage or fulfilling the same function less efficiently.
What people mean by "structure of a protein" is:
The approximate structure determined for a genetic sequence turned into a protein by an organism not "meant" to generate it (e.g. yeast or ecoli) then painstakingly aggregated to form a giant crystal, often enough using various impurities interlaced with the actual proteins.
After digging into this subject more the only thing I can say is that I am unsure and confused as to how accurate these representations are. But answers seem to indicate a "common" range of opinion from in-vitro folding being negligibly different to it being significantly different but in a way that shouldn't matter for most applications.
On a personal level, I am somewhat sceptical based on the history of DNA function and of every single physical model involving "the definitive lowest energy state". But I think it's fair to say the expert consensus here says I'm a dumbass.
The one criticism of AlphaFold vis-a-vis all of this is that it's not based on other underlying models for which we have higher certainty. In theory, a "perfect" physics simulation for protein structure could not only serve to predict structure but could predict it in a variety of environments and detect mistakes in existing observations. In practice, this seems impossible. More so, AlphaFold seems like it could actually address some of those problems, at least as well as other real models.
ii - What it is
Alpha fold takes the sequence of a given protein as input and then uses some databases lookups to generate an MSA representation and a pair(wise) representation.
The MSA data represents the original protein and presumed variations of that protein across multiple species. So you've got the protein in humans expressed as a function of similar proteins in yeast, parrots and fish. Then the protein in yeast is expressed as a function of similar proteins in humans, parrots and fish... etc. The "most relevant" sequences are picked through a domain-specific algorithm and then they get randomly sampled to reduce input size.
The pair representation consists of a 2D template that reflects a 3D structure with some (1.25 Å) degree of accuracy at the level of distances between different amino acids (the 3D structure in comparison is based on the coordinates for individual atoms). This 2D structure is constructed based on a template derived from various PDB similarity lookups.
This seems somewhat(?) sketchy if we want to use this method for predicting more novel proteins with no known close analogues. Happily, enough the most relevant proteins do have well-known analogues and they do train the network with empty templates and show that it still works decently.
The fact that the database lookups are undifferentiable and use theory ladened assumptions I take as a minus. But then again, the very first alpha fold also (if I recall correctly) used a more primitive version of the MSA lookup. So no that much has changed here.
These representations are feed into two modules. One formed of "evoformer" blocks that output a single MSA representation and a pair representation. Both of these representations are then fed to the next block and feed into a "recycling" process that makes the whole differentiable part of the model recurrent. This evoformer-based module is actually an ensemble, which means the result of its component evoformer-based models.
The structural prediction module determines the 3d structure (coordinates of all non-hydrogen atoms) from the single MSA and the pairwise representation. This 3d structure is also recycled.
The whole cycle repeats a bunch of times and we end up with a final 3rd structure.
The interesting (?) feature of the evoformer transformer block is that it's structured in such a way to encourage/force certain constraints between the pairwise representation and the MSA representation, as well as constraints regarding the pair representation outputs. Though I'm still unclear enough on how this works to explain it in more detail.
The structure prediction block seems more straightforward, with the only caveat being that it generates a backbone representation (angles of certain structural elements) and an improved single-sequence MSA representation (single-sequence meaning the MSA features of the "original" protein rather than that + that of analogues in other reference organisms). Then it combined these two to form the final 3d structure.
The recycling step takes the final 3D structure and uses it to update the pair representation and it takes the single MSA. There's a linear layer used to compute the updates to the pair representation and two layer norms, one applied to the last dimensions of this update, one to the MSA. Finally, the evoformer-generated MSA and the evoformer-generated and predicted 3D structure updated pair representations are concatenated to the inputs.
In the first pass, all zero vectors are used instead.
The loss function is only applied to the final output. The loss doesn't concern itself with upholding global constraints (i.e. heavy/inf penalties for structures that have coordinates that are physically impossibly or close-to-impossibly close). It only concerns itself with optimizing the local structure. Allowing for nonsensical structures from a global perspective, that are fairly accurate on an atom-by-atom basis. Global constraints upon the structure are imposed afterwards using a separate library (Amber); Which as far as I can tell uses to approximate a close structure that respects global constraints.
iii - The rest
I'm unsure what the accuracy numbers signify. Since I'm not familiar with the error metrics in those domains, and distances expressed in angstroms hold little to no significance for me. As far as I understand this wipes the floor with all other competition in CASP14, even including previous AlphaFold versions.
As to whether these structures will be useful, I guess that remains to be seen. Deepmind has released a lot of them, so that's a good start. And the code is open source, so presumably, anyone could generate their own... potentially with better accuracy, especially for larger or otherwise more complex proteins. Based on the experiment there are hints that increased network size and or recycling runs alone might yield further minor improvements.
I was most impressed by the experiment with distillation, not to reduce network size, but rather to improve accuracy on harder proteins.
The network generated predictions on a database of proteins somewhat dissimilar to the training database. Then they used various masking techniques and trained another network to generate those same predictions. Then the accuracy of the distilled network was higher on various other known structures (that is to say, higher than the accuracy of the network which generated the original predictions based on which distillation was done).
This seems like a signal that the networks learn non-superficial rules about structure. At least if you believe the analogous claim can be made about e.g. text or images when distillation improves performance.
Another interesting note is that based on the backbone and the 3d structure they are able to apply a function called "predicted local-distance difference test (pLDDT)" in order to generate confidence for these predictions. Experimentally this confidence seems to "predict" the accuracy of the network on various subclasses of proteins. So it's potentially a good measure for disseminating proteins (or at least classes of proteins) for which we could assume structures are reliable in lieu of having any known structures for those protein types.
For some proteins (usually simpler ones?) a single iteration is enough to get to the "best" possible structure the model can generate (or very close to it). For other (usually more complex?) sequences the structure is improved significantly in all iterations. I think you could concoct another "model understand the problem deeply" story from this.
Finally, I find it curious (hopeful?) that the model generates all of these fairly accurate intermediate representations. In that, it seems to address the "black box" problem one may have had with previous approaches. It seems like you could ask this model a lot of the interesting questions that could be asked of a physics-simulation based optimization.
The model itself seems to be surprisingly small in terms of "lines of code" and component modules. But it's both complicated and uses a mixture of tf, jax and haiku, which made it impossible to understand the code globally. I can only hope that in the future research derives a neuroscience-based heuristic by which I can relax my local understand of various modules into a coherent global understanding of the model.
For me personally, this leap marks the end of my interest in the current protein folding paradigm, the bastards solved it using a very complex model and from now on it's going to be 0.x% improvements and theory-landed models of ever-increasing complexity.
I may come back to the problem in 5 or 10 years to see if it can be elegantly solved by a "domain agnostic" architecture that's very popular at that time, much like people are doing nowadays with the bleeding edge of 5 years ago.
Until then, I do hope you find this review at least mildly interesting if not useful.