This website contains information regarding the paper Modular Flows: Differential Molecular Generation.

TL;DR: We propose generative graph normalizing flow models, based on a system of coupled node ODEs, that repeatedly reconcile locally toward globally aligned densities for high quality molecular generation

Please cite our work if you find it useful:

@misc{https://doi.org/10.48550/arxiv.2210.06032,
  doi = {10.48550/ARXIV.2210.06032},
  url = {https://arxiv.org/abs/2210.06032},
  author = {Verma, Yogesh and Kaski, Samuel and Heinonen, Markus and Garg, Vikas},
  keywords = {Machine Learning (cs.LG), Emerging Technologies (cs.ET), Biomolecules (q-bio.BM), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Modular Flows: Differential Molecular Generation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Problem of Molecular Generation

Generating new molecules is fundamental to advancing critical applications such as drug discovery and material synthesis. A key challenge of molecular generative models is to be able to generate valid molecules, according to various criteria for molecular validity or feasibility. It is a common practice to use external chemical software as rejection oracles to reduce or exclude invalid molecules, or do validity checks as part of autoregressive generation [1,2,3] . An important open question has been whether generative models can learn to achieve high generative validity intrinsically, i.e., without being aided by oracles or performing additional checks. We circumvent the issues with novel physics-inspired co-evolving continuous-time flows that induces useful inductive biases for a highly complex combinatorial setting. Our method is inspired by graph PDEs, that repeatedly reconcile locally toward globally aligned densities.

Continuous Normalizing Flows

Normalizing flow have seen widespread use for density modeling, generative modeling, etc which provides a general way of constructing flexible probability distributions. It is defined by a parameterized invertible deterministic transformation from a base distribution $\mathcal{Z}$ (e.g., Gaussian distribution) to real-world observational space $X$ (e.g. images and speech). When the dynamics of transformation is governed by an ODE, the method is known as Continous Normalizing Flows (CNFs). The process starts by sampling from a base distribution $\mathbf{z}_0 \sim p_0(\mathbf{z}_0)$, then solving the IVP $\mathbf{z}(t_0) = \mathbf{z}_0$, $\dot{\mathbf{z}}(t) = \frac{\partial \mathbf{z}(t)}{\partial t} = f(\mathbf{z}(t),t;\theta)$, where ODE is defined by the parametric function $f(\mathbf{z}(t),t;\theta)$ to obtain $\mathbf{z}(t_1)$ which constitutes our observable data. Then, using the instantaneous change of variables formula change in log-density under this model is given as:

$$\frac{\partial \log p_t(\mathbf{z}(t))}{\partial t} = -\texttt{tr} \left( \frac{\partial f}{\partial \mathbf{z}(t)} \right)$$

Given a datapoint $\mathbf{x}$, we can compute both the point $\mathbf{z}_{0}$ which generates $\mathbf{x}$, as well as $\log p_1(\mathbf{x})$ by solving the initial value problem which integrates the combined dynamics of $\mathbf{z}(t)$ and the log-density of the sample resulting in the computation of $\log p_{1}(\mathbf{x})$.

Modular Flows

Representation

We represent molecule as a graph $G = (V,E)$, where each vertex takes value from an alphabet on atoms: $v \in \mathcal{A} = \{ \texttt{C},\texttt{H},\texttt{N},\texttt{O},\texttt{P},\texttt{S},\ldots \}$; while the edges $e \in \mathcal{B} = \{1,2,3\}$ abstract the type of bond (i.e., single, double, or triple). We assume the following decomposition of the graph likelihood, over vertices conditioned on the edges and given the latent representations,

$$p(G) := p(V | E,\{ z\}) = \prod_{i=1}^M \texttt{Cat}(v_i | \sigma(\mathbf{z}_i))$$

We can obtain an alternative representation by decomposing a moleculer graph into a tree, by contracting certain vertices into a single node such that the molecular graph $G$ becomes acyclic. We followed a similar decompositon as JT-VAE[4], but restrict these clusters to ring-substructures, in addition to the atom alphabet. Thus, we obtain an extended alphabet vocabulary as $\mathcal{A}_{\mathrm{tree}} = \{ \texttt{C},\texttt{H},\texttt{N}, \ldots, \texttt{C}_{1},\texttt{C}_{2},\ldots \}$, where each cluster label $\texttt{C}_{r}$ corresponds to the some ring-substructure in the label vocabulary $\chi$

Differential Modular Flows

Based on the general recipie of normalizing flows, we propose to model the node scores $\mathbf{z}_{i}$ as a Continuous-time Normalizing Flow (CNF)[7] over time $t \in \mathrm{R}_+$. We assume the initial scores at time $t=0$ follow an uninformative Gaussian base distribution $\mathbf{z}_i(0) \sim \mathcal{N}(0,I)$ for each node $i$. Node scores evolve in parallel over time by a differential equation,

$$\dot{\mathbf{z}}_{i}(t) := \frac{\partial \mathbf{z}_i(t)}{\partial t} = f_\theta\big( t, \mathbf{z}_i(t), \mathbf{z}_{\mathcal{N}_i}(t),\mathbf{x}_{i}, \mathbf{x}_{\mathcal{N}_i} \big), \qquad i = 1, \ldots, M$$

where $\mathcal{N}_{i} = \{ \mathbf{z}_{j} : (i,j) \in E \}$ is the set of neighbor scores at time $t$, $\mathbf{x}$ is the spatial information (2D/3D), and $\theta$ are the parameters of the flow function $f$ to be learned.

By collecting all node differentials we obtain a modular joint, coupled ODE, which is equivalent to a graph PDE [9,10], where the evolution of each node only depends on its immediate neighbors.

$$\dot{\mathbf{z}}_{i}(t) = \begin{pmatrix} \dot{\mathbf{z}}_{i}(t)_1(t) \\ \vdots \\ \dot{\mathbf{z}}_{i}(t)_M(t) \end{pmatrix} = \begin{pmatrix} f_\theta\big( t, \mathbf{z}_1(t), \mathbf{z}_{\mathcal{N}_1}(t),\mathbf{x}_{i}, \mathbf{x}_{\mathcal{N}_i} \big) \\ \vdots \\ f_\theta\big( t, \mathbf{z}_M(t), \mathbf{z}_{\mathcal{N}_M}(t),\mathbf{x}_{i}, \mathbf{x}_{\mathcal{N}_i} \big) \end{pmatrix} $$

Equivariant local differential

The goal is to have a function $f_{\theta}$ such that it satisfies natural equivariances and invariances of molecules like translation, rotational, reflection equivariances. Therefore, we chose to use E(3)-Equivariant GNN (EGNN)[11] which satisfies all the above criteria.

Training Objective

We reduce the learning problem to maximizing the score cross-entropy $\mathrm{E}_{\hat{p}_{\mathrm{data}}(\mathbf{z}(T))}[\log p_\theta(\mathbf{z}(T))]$, where we turn the observed set of graphs $\{G_{n}\}$ into a set of scores $\{\mathbf{z}_{n}\}$ by using one-hot encoding

$$\mathbf{z}_n (G_n; \epsilon) = (1-\epsilon)~\mathrm{onehot}(G_n) ~+~ \dfrac{\epsilon}{|\mathcal{A_s}|} \textbf{1}_{M(n)} \textbf{1}_{|\mathcal{A_s}|}^{\top}~,$$

where $\mathrm{onehot}(G_{n})$ is a matrix ($M(n) \times |\mathcal{A_{s}}|$), such that $G_{n}(i, k)$ = 1 if $v_{i} = a_{k} \in \mathcal{A_{s}}$, that is if the vertex $i$ is labeled with atom $k$, and 0 otherwise; $\textbf{1}_{q}$ is a vector with $q$ entries each set to 1; $\mathcal{A_{s}} \in \{\mathcal{A}, \mathcal{A}_{\rm tree} \}$; and $\epsilon \in [0,1]$ is added to model the noise in estimating the posterior $p({\mathbf{z}(T)|G})$. This is due to short-circuiting the inference process from $G$ to $\mathbf{z}(T)$ skipping the intermediate dependencies, as shown in the plate diagram.

We exploit the non-reversible composition of the argmax and softmax to transition from continous space to discrete graph space, but short-circuit in reverse direction as shown in the figure below. This indeed allows to keep the forward and backward flows aligned. We thus maximize an objective over $N$ training graphs,

$$\texttt{argmax}_\theta \qquad \mathcal{L} = \mathcal{E}_{\hat{p}_{\mathrm{data}}(\mathbf{z})} \log p_\theta(\mathbf{z}) \approx \frac{1}{N} \sum_{n=1}^N \log p_T\big( \mathbf{z}(T) = \mathbf{z}_n \big)$$

Molecule Generation

We generate novel molecules by sampling an initial state $\mathbf{z}(0) \sim \mathcal{N}(0,I)$ based on structure, and running the modular flow forward in time until $\mathbf{z}(T)$. This procedure maps a tractable base distribution $p_0$ to some more complex distribution $p_T$. We follow argmax to pick the most probable label assignment for each node as shown below.

Results

Density Estimation

We demonstrated the power of our method on learning highly discontinous patterns on 2D grid graphs. We considered patterns corresponding to two-variants of chess-board pattern as $4 \times 4$, where every node has opposite value to its neighbors and $16 \times 16$ grid where blocks of $4 \times 4$ nodes have uniform values, but opposite across blocks. At last, we also considered alternate stripes pattern over $20 \times 20$ grid.

Molecular Experiments

We trained the model on QM9[6] and ZINC250K[5] dataset, where molecules are in kekulized form with hydrogens removed by the RDkit[8] software. We adopt common quality metrics to evaluate molecular generation as,

Validity: Fraction of molecules that satisfy chemical valency rule
Uniqueness: Fraction of non-duplicate generations
Novelty: Fraction of molecules not present in training data
Reconstruction: Fraction of molecules that can be reconstructed from their encoding

Apart from these metrics, we also evaluated our method on MOSES metrics. These are:

FCD: measures diversity and chemical and biological property alignment
SNN: quantifies closeness of generated molecules to true molecule manifold
Frag: measures distance between the fragment frequencies generated and reference
IntDiv: diversity by computing pairwise similarity of the generated molecules

Some of the generated molecules via $\texttt{ModFlow}$ are also shown above. We visually evaluate the generated structures via out method via properties distribution. We utilize kernel density estimation of these distributions to visualize these distributions. We use

Molecular Weight: Sum of the individual atomic weights of a molecule.
LogP: Ratio of concentration in octanol-phase to aqueous phase, also known as the octanol-water partition coefficient.
Synthetic Accessibility Score (SA): Estimate describing the synthesizability of a given molecule
Quantitative Estimation of Drug-likeness (QED): Value describing likeliness of a molecule as a viable candidate for a drug

Property-targeted Molecular Optimization

We performed Property-targeted Molecular Optimization, to search for molecules, having a better chemical properties. Specifically, we choose quantitative estimate of drug-likeness (QED) as our target chemical property, which measures the potential of a molecule to be characterized as a drug. We used a pre-trained ModFlow model $f$, to encode a molecule $\mathcal{M}$ and get the embedding $Z = f(\mathcal{M})$, and further used linear regression to regress these embeddings to the QED scores and interpolated in the latent space space of a molecule along the direction of increasing QED. This is done via gradient ascend method, $Z' = Z + \lambda*\frac{dy}{dZ}$ where $y$ is the QED score and $\lambda$ is the length of the search step. The above method is conducted for $K$ steps, and the new embedding $Z'$ is decoded back to molecule space via reverse mapping $\mathcal{M}' = f^{-1}(\mathcal{Z}')$.

The above figures represent the molecules decoded from the learned latent space with linear regression for successful molecular optimization.

Ablation Studies

We performed ablation experiments to gain further insights about $\texttt{ModFlow}$. Specifically, we conducted ablation study to quantify the effect of incorporating the symmetries in our model as E(3) Equivariant vs Not Equivariant, where we compare the results to a 3-layer GCN and investigated whether including 3D coordinate information 2D vs 3D, improves the model and evaluate the benefit of including the geometric information.

Conclusion

We propose Physics-inspired co-evolving continuous-time flows, inspired by graph PDEs as $\texttt{ModFlow}$, where multiple flows interact locally according to a modular coupled ODE system.

The coupled dynamics results in accurate modeling of graph densities and high quality molecular generation without any validity checks or correction.

Interesting avenues open up, including the design of (a) more nuanced mappings between discrete and continuous spaces, and (b) extensions of modular flows to (semi-)supervised settings.

References

Youzhi Luo, Keqiang Yan, and Shuiwang Ji. Graphdf: A discrete flow model for molecular graph generation,2021
Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flow-based autoregressive model for molecular graph generation
Mariya Popova, Mykhailo Shvets, Junier Oliva, and Olexandr Isayev. Molecularrnn: Generating realistic molecular graphs with optimized properties,2019
Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation
John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012
Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014
Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models
Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling, 2013
Valerii Iakovlev, Markus Heinonen, and Harri Lähdesmäki. Learning continuous-time pdes from sparse data with graph neural networks.
Ben Chamberlain, James Rowbottom, Maria I Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi Grand: Graph neural diffusion. In International Conference on Machine Learning, pages 1407–1418. PMLR, 2021
Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2021

Modular Flows Differential Molecular Generation

In Neural Information Processing Systems (NeurIPS), 2022

Yogesh Verma, Samuel Kaski, Markus Heinonen, Vikas Garg