Predicting equilibrium distributions for molecular systems with deep learning – Nature.com

Deep neural networks have been demonstrated to predict accurate molecular structures from descriptors ({{{mathcal{D}}}}) for many molecular systems1,5,6,9,10,11,12. Here, DiG aims to take one step further to predict not only the most probable structure but also diverse structures with probabilities under the equilibrium distribution. To tackle this challenge, inspired by the heatingannealing paradigm, we break down the difficulty of this problem into a series of simpler problems. The heatingannealing paradigm can be viewed as a pair of reciprocal stochastic processes on the structure space that simulate the transformation between the system-specific equilibrium distribution and a system-independent simple distribution psimple. Following this idea, we use an explicit diffusion process (forward process; Fig. 1b, orange arrows) that gradually transforms the target distribution of the molecule ({q}_{{{{mathcal{D}}}},0}), as the initial distribution, towards psimple through a time period . The corresponding reverse diffusion process then transforms psimple back to the target distribution ({q}_{{{{mathcal{D}}}},0}). This is the generation process of DiG (Fig. 1b, blue arrows). The reverse process is performed by incorporating updates predicted by deep neural networks from the given ({{{mathcal{D}}}}), which are trained to match the forward process. The descriptor ({{{mathcal{D}}}}) is processed into node representations ({{{mathcal{V}}}}) describing the feature of each system-specific individual element and a pair representation ({{{mathcal{P}}}}) describing inter-node features. The ({{{{mathcal{V}}}},{{{mathcal{P}}}}}) representation is the direct input from the descriptor part to the Graphormer model10, together with the geometric structure input R to produce a physically finer structure (Supplementary Information sections B.1 and B.3). Specifically, we choose ({p}_{{{mbox{simple}}}}:= {{{mathcal{N}}}}({{{bf{0}}}},{{{bf{I}}}})) as the standard Gaussian distribution in the state space, and the forward diffusion process as the Langevin diffusion process targeting this psimple (OrnsteinUhlenbeck process)40,41,42. A time dilation scheme t (ref. 43) is introduced for approximate convergence to psimple after a finite time . The result is written as the following stochastic differential equation (SDE):

$${{{rm{d}}}}{{{{bf{R}}}}}_{t}=-frac{{beta }_{t}}{2}{{{{bf{R}}}}}_{t},{{{rm{d}}}}t+sqrt{{beta }_{t}},{{{rm{d}}}}{{{{bf{B}}}}}_{t}$$

(1)

where Bt is the standard Brownian motion (a.k.a. Wiener process). Choosing this forward process leads to a psimple that is more concentrated than a heated distribution, hence it is easier to draw high-density samples, and the form of the process enables efficient training and sampling.

Following stochastic process theory (see, for example, ref. 44), the reverse process is also a stochastic process, written as the following SDE:

$${{{rm{d}}}}{{{{bf{R}}}}}_{bar{t}}=frac{{beta }_{bar{t}}}{2}{{{{bf{R}}}}}_{bar{t}},{{{rm{d}}}}bar{t}+{beta }_{bar{t}}nabla log {q}_{{{{mathcal{D}}}},bar{t}}({{{{bf{R}}}}}_{bar{t}}),{{{rm{d}}}}bar{t}+sqrt{{beta }_{bar{t}}},{{{rm{d}}}}{{{{bf{B}}}}}_{bar{t}}$$

(2)

where (bar{t}:= tau -t) is the reversed time, ({q}_{{{{mathcal{D}}}},bar{t}}:= {q}_{{{{mathcal{D}}}},t = tau -bar{t}}) is the forward process distribution at the corresponding time and ({{{{bf{B}}}}}_{bar{t}}) is the Brownian motion in reversed time. Note that the forward and corresponding reverse processes, equations (1) and (2), are inspired from but not exactly the heating and annealing processes. In particular, there is no concept of temperature in the two processes. The temperature T mentioned in the PIDP loss below is the temperature of the real target system but is not related to the diffusion processes.

From equation (2), the only obstacle that impedes the simulation of the reverse process for recovering ({q}_{{{{mathcal{D}}}},0}) from psimple is the unknown (nabla log {q}_{{{{mathcal{D}}}},bar{t}}({{{{bf{R}}}}}_{bar{t}})). Deep neural networks are then used to construct a score model ({{{{bf{s}}}}}_{{{{mathcal{D}}}},t}^{theta }({{{bf{R}}}})), which is trained to predict the true score function (nabla log {q}_{{{{mathcal{D}}}},t}({{{bf{R}}}})) of each instantaneous distribution ({q}_{{{{mathcal{D}}}},t}) from the forward process. This formulation is called a diffusion-based generative model and has been demonstrated to be able to generate high-quality samples of images and other content27,28,45,46,47. As our score model is defined in molecular conformational space, we use our previously developed Graphormer model10 as the neural network architecture backbone of DiG, to leverage its capabilities in modelling molecular structures and to generalize to a range of molecular systems. Note that the score model aims to approximate a gradient, which is a set of vectors. As these are equivariant with respect to the input coordinates, we designed an equivariant vector output head for the Graphormer model (Supplementary Information section B.4).

With the ({{{{bf{s}}}}}_{{{{mathcal{D}}}},t}^{theta }({{{bf{R}}}})) model, drawing a sample R0 from the equilibrium distribution of a system ({{{mathcal{D}}}}) can be done by simulating the reverse process in equation (2) on N+1 steps that uniformly discretize [0,] with step size h=/N (Fig. 1b, blue arrows), thus

$$begin{array}{ll}&{{{{bf{R}}}}}_{N} sim {p}_{{{mbox{simple}}}},\ &{{{{bf{R}}}}}_{i-1}=frac{1}{sqrt{1-{beta }_{i}}}left({{{{bf{R}}}}}_{i}+{beta }_{i}{{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{i})right)+{{{mathcal{N}}}}({{{bf{0}}}},{beta }_{i}{{{bf{I}}}}),,i=N,cdots ,,1,end{array}$$

where the discrete step index i corresponds to time t=ih, and i:=ht=ih. Supplementary Information section A.1 provides the derivation. Note that the reverse process does not need to be ergodic. The way that DiG models the equilibrium distribution is to use the instantaneous distribution at the instant t=0 (or (bar{t}=tau)) on the reverse process, but not using a time average. As RN samples can be drawn independently, DiG can generate statistically independent R0 samples for the equilibrium distribution. In contrast to MD or MCMC simulations, the generation of DiG samples does not suffer from rare events that link different states and can thus be far more computationally efficient.

DiG can be trained by using conformation data sampled over a range of molecular systems. However, collecting sufficient experimental or simulation data to characterize the equilibrium distribution for various systems is extremely costly. To address this data scarcity issue, we propose a pre-training algorithm, called PIDP, which effectively optimizes DiG on an initial set of candidate structures that need not be sampled from the equilibrium distribution. The supervision comes from the energy function ({E}_{{{{mathcal{D}}}}}) of each system ({{{mathcal{D}}}}), which defines the equilibrium distribution ({q}_{{{{mathcal{D}}}},0}({{{bf{R}}}})propto exp (-frac{{E}_{{{{mathcal{D}}}}}({{{bf{R}}}})}{{k}_{{{{rm{B}}}}}T})) at the target temperature T.

The key idea is that the true score function (nabla log {q}_{{{{mathcal{D}}}},t}) from the forward process in equation (1) obeys a partial differential equation, known as the FokkerPlanck equation (see, for example, ref. 48). We then pre-train the score model ({{{{bf{s}}}}}_{{{{mathcal{D}}}},t}^{theta }) by minimizing the following loss function that enforces the equation to hold:

$$begin{array}{rc}&mathop{sum }limits_{i=1}^{N}frac{1}{M}mathop{sum }limits_{m=1}^{M}leftVert frac{{beta }_{i}}{2}left(nabla left({{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)}cdot {{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)})right)right.+nabla leftVert {{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)})rightVert ^{2}+nabla left(nabla cdot {{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)})right)right)\ &left.-frac{partial }{partial t}{{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }left({{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)}right)rightVert^{2}+frac{{lambda }_{1}}{M}mathop{sum }limits_{m=1}^{M}leftVertfrac{1}{{k}_{{{{rm{B}}}}}T}nabla {E}_{{{{mathcal{D}}}}}left({{{{bf{R}}}}}_{{{{mathcal{D}}}},1}^{(m)}right)+{{{{bf{s}}}}}_{{{{mathcal{D}}}},1}^{theta }left({{{{bf{R}}}}}_{{{{mathcal{D}}}},1}^{(m)}right)rightVert^{2}end{array}$$

Here, the second term, weighted by 1, matches the score model at the final generation step to the score from the energy function, and the first term implicitly propagates the energy function supervision to intermediate time steps (Fig. 1b, upper row). The structures ({{{{{{bf{R}}}}}_{{{{mathcal{D}}}},i}^{(m)}}}_{m = 1}^{M}) are points on a grid spanning the structure space. Since these structures are only used to evaluate the loss function on discretized points, they do not have to obey the equilibrium distribution (as is required by structures in the training dataset), therefore the cost of preparing these structures can be much lower. As structure spaces of molecular systems are often very high dimensional (for example, thousands for proteins), a regular grid would have intractably many points. Fortunately, the space of actual interest is only a low-dimensional manifold of physically reasonable structures (structures with low energy) relevant to the problem. This allows us to effectively train the model only on these relevant structures as R0 samples. Ri samples are produced by passing R0 samples through the forward process. See Supplementary Information section C.1 for an example on acquiring relevant structures for protein systems.

We also leverage stochastic estimators, including Hutchinsons estimator49,50, to reduce the complexity in calculating derivatives of high order and for high-dimensional vector-valued functions. Note that, for each step i, the corresponding model ({{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }) receives a training loss independent of other steps and can be directly back-propagated. In this way, the supervision on each step can improve the optimizing efficiency.

In addition to using the energy function for information on the probability distribution of the molecular system, DiG can also be trained with molecular structure samples that can be obtained from experiments, MD or other simulation methods. See Supplementary Information section C for data collection details. Even when the simulation data are limited, they still provide information about the regions of interest and about the local shape of the distribution in these regions; hence, they are helpful to improve a pre-trained DiG. To train DiG on data, the score model ({{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{i})) is matched to the corresponding score function (nabla log {q}_{{{{mathcal{D}}}},i}) demonstrated by data samples. This can be done by minimizing ({{mathbb{E}}}_{{q}_{{{{mathcal{D}}}},i}({{{{bf{R}}}}}_{i})}{parallel {{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{i})-nabla log {q}_{{{{mathcal{D}}}},i}({{{{bf{R}}}}}_{i})parallel }^{2}) for each diffusion time step i. Although a precise calculation of (nabla log {q}_{{{{mathcal{D}}}},i}) is impractical, the loss function can be equivalently reformulated into a denoising score-matching form51,52

$$frac{1}{N}mathop{sum }limits_{i=1}^{N}{{mathbb{E}}}_{{q}_{{{{mathcal{D}}}},0}({{{{bf{R}}}}}_{0})}{{mathbb{E}}}_{p({{{{mathbf{epsilon }}}}}_{i})}{parallel {sigma }_{i}{{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({alpha }_{i}{{{{bf{R}}}}}_{0}+{sigma }_{i}{{{{mathbf{epsilon }}}}}_{i})+{{{{mathbf{epsilon }}}}}_{i}parallel }^{2}$$

where ({alpha }_{i}:= mathop{prod }nolimits_{j = 1}^{i}sqrt{1-{beta }_{j}}), ({sigma }_{i}:= sqrt{1-{alpha }_{i}^{2}}) and p(i) is the standard Gaussian distribution. The expectation under ({q}_{{{{mathcal{D}}}},0}) can be estimated using the simulation dataset.

We remark that this score-predicting formulation is equivalent (Supplementary Information section A.1.2) to the noise-predicting formulation28 in the diffusion model literature. Note that this function allows direct loss estimation and back-propagation for each i in constant (with respect to i) cost, recovering the efficient step-specific supervision again (Fig. 1b, bottom).

The computation of many thermodynamic properties of a molecular system (for example, free energy or entropy) also requires the density function of the equilibrium distribution, which is another aspect of the distribution besides a sampling method. DiG allows for this by tracking the distribution change along the diffusion process45:

$$begin{array}{l}log {p}_{{{{mathcal{D}}}},0}^{theta }({{{{bf{R}}}}}_{0})=log {p}_{{{mbox{simple}}}}left({{{{bf{R}}}}}_{{{{mathcal{D}}}},tau }^{theta }({{{{bf{R}}}}}_{0})right)\qquadqquadqquad;;-displaystyleintnolimits_{0}^{tau }frac{{beta }_{t}}{2}nabla cdot {{{{bf{s}}}}}_{{{{mathcal{D}}}},t}^{theta }left({{{{bf{R}}}}}_{{{{mathcal{D}}}},t}^{theta }({{{{bf{R}}}}}_{0})right),{{{rm{d}}}}t-frac{D}{2}intnolimits_{0}^{tau }{beta }_{t},{{{rm{d}}}}tend{array}$$

where D is the dimension of the state space and ({{{{bf{R}}}}}_{{{{mathcal{D}}}},t}^{theta }({{{{bf{R}}}}}_{0})) is the solution to the ordinary differential equation (ODE)

$${{{rm{d}}}}{{{{bf{R}}}}}_{t}=-frac{{beta }_{t}}{2}left({{{{bf{R}}}}}_{t}+{{{{bf{s}}}}}_{{{{mathcal{D}}}},t}^{theta }({{{{bf{R}}}}}_{t})right),{{{rm{d}}}}t$$

(3)

with initial condition R0, which can be solved using standard ODE solvers or more efficient specific solvers (Supplementary Information section A.6).

There is a growing demand for the design of materials and molecules that possess desired properties, such as intrinsic electronic band gaps, elastic modulus and ionic conductivity, without going through a forward searching process. DiG provides a feature to enable such property-guided structure generation, by directly predicting the conditional structural distribution given a value c of a microscopic property.

To achieve this goal, regarding the data-generating process in equation (2), we only need to adapt the score function from (nabla log {q}_{{{{mathcal{D}}}},t}({{{bf{R}}}})) to ({nabla }_{{{{bf{R}}}}}log {q}_{{{{mathcal{D}}}},t}({{{bf{R}}}}| c)). Using Bayes rule, the latter can be reformulated as ({nabla }_{{{{bf{R}}}}}log {q}_{{{{mathcal{D}}}},t}({{{bf{R}}}}| c)=nabla log {q}_{{{{mathcal{D}}}},t}({{{bf{R}}}})+{nabla }_{{{{bf{R}}}}}log {q}_{{{{mathcal{D}}}}}(c| {{{bf{R}}}})), where the first term can be approximated by the learned (unconditioned) score model; that is, the new score model is

$${{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{i}| c)={{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }({{{{bf{R}}}}}_{i})+{nabla }_{{{{{bf{R}}}}}_{i}}log {q}_{{{{mathcal{D}}}}}(c| {{{{bf{R}}}}}_{i})$$

Hence, only a ({q}_{{{{mathcal{D}}}}}(c| {{{bf{R}}}})) model is additionally needed45,46, which is a property predictor or classifier that is much easier to train than a generative model.

In a normal workflow for ML inverse design, a dataset must be generated to meet the conditional distribution, then an ML model will be trained on this dataset for structure distribution predictions. The ability to generate structures for conditional distribution without requiring a conditional dataset places DiG in an advantageous position when compared with normal workflows in terms of both efficiency and computational cost.

Given two states, DiG can approximate a reaction path that corresponds to reaction coordinates or collective variables, and find intermediate states along the path. This is achieved through the fact that the distribution transformation process described in equation (1) is equivalent to the process in equation (3) if ({{{{bf{s}}}}}_{{{{mathcal{D}}}},i}^{theta }) is well learned, which is deterministic and invertible, hence establishing a correspondence between the structure and latent space. We can then uniquely map the two given states in the structure space to the latent space, approximate the path in the latent space by linear interpolation and then map the path back to the structure space. Since the distribution in the latent space is Gaussian, which has a convex contour, the linearly interpolated path goes through high-probability or low-energy regions, so it gives an intuitive guess of the real reaction path.

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Read this article:

Predicting equilibrium distributions for molecular systems with deep learning - Nature.com

Related Posts

Comments are closed.