网站对latex的支持一言难尽,所以以后不会再更新论文笔记了
architecture:主干网络仍然是Transformer,上半部分处理2D数据:度、边以及最短路径;下半部分处理3D数据:结点间的距离信息。将与结点有关的信息加到结点的特征里去,起到一个特征增强的作用。将与结点对/边有关的信息当作一个偏置项加到attention上,起到一个相对位置编码的作用。
本文只在Ying C, Cai T, Luo S, et al. Do transformers really perform badly for graph representation?[J]. Advances in Neural Information Processing Systems, 2021, 34: 28877-28888. 的基础上加了3D数据的信息(NIPS21 这篇论文发表时3D数据还没放出来),鉴定为水。
Background
分子数据可以由不同的形式表示,比如可以用类似化学分子式这样的字符串表示,可以把原子看作结点、化学键看作边表示成二维图(graph)的形式,也可以用各原子在三维空间中的位置表示。所以本文想用一个模型同时输入2D和3D数据,得到有意义的表征。
molecules can naturally be characterized using different chemical formulations.
molecules naturally have different chemical formulations.
in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations.
TRANSFORMER-M
输入:3D数据提供了每个原子的笛卡尔坐标系(三维)
两个通道,分别处理2D数据和3D数据。2D通道将最短路径encoding和边encoding加入attention的偏置,将度encoding加入结点特征;3D通道将结点间的欧式距离的encoding加入偏置,将结点与其它结点的距离之和的encoding加入特征。
位置编码对非序列数据很重要
Shortly, many works realized that positional encoding plays a crucial role in extending standard Transformer to more complicated data structures beyond language.
Transformer layer
$$
\begin{equation}
\boldsymbol{A}^h\left(\boldsymbol{X}^{(l)}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{X}^{(l)} \boldsymbol{W}_Q^{l, h}\left(\boldsymbol{X}^{(l)} \boldsymbol{W}_K^{l, h}\right)^{\top}}{\sqrt{d}}\right)
\end{equation}
$$
$$
\begin{gathered}
\hat{\boldsymbol{X}}^{(l)}=\boldsymbol{X}^{(l)}+\sum_{h=1}^H \boldsymbol{A}^h\left(\boldsymbol{X}^{(l)}\right) \boldsymbol{X}^{(l)} \boldsymbol{W}_V^{l, h} \boldsymbol{W}_O^{l, h} ; \
\boldsymbol{X}^{(l+1)}=\hat{\boldsymbol{X}}^{(l)}+\operatorname{GELU}\left(\hat{\boldsymbol{X}}^{(l)} \boldsymbol{W}_1^l\right) \boldsymbol{W}_2^l
\end{gathered}
$$
Encoding pair-wise relations in E
最短路径和边的encoding都是一个n*n的矩阵,里面每一个值都表示两个结点之间的关系。
Denote $\Phi^{\text{SPD}}$ and $\Phi^{\text{Edge}}$ as the matrix form of the SPD encoding and edge encoding, both of which are of shape $n\times n$.
最短路径
$$\text{SP}_{ij}=(\vec{e}_1,\vec{e}_2,…,\vec{e}_N)$$
边encoding
$$\Phi^{\text{Edge}}_{ij}=\frac{1}{N}\sum_{n=1}^{N} \vec{e}_n(w_{n})^T$$Encoding pair-wise relations in R
$$\Phi^{\text{3D Distance}}$$ 也是一个n\*n的矩阵,里面每个值都反映了两个结点间的空间关系。 $$ {\Phi^{\text{3D Distance}}_{ij}}=\operatorname{GELU}\left(\boldsymbol{\psi}_{(i, j)} \boldsymbol{W}_D^1\right) \boldsymbol{W}_D^2 $$其中
$$ \boldsymbol{\psi}_{(i, j)}=\left[\psi_{(i, j)}^1 ; \ldots ; \psi_{(i, j)}^K\right]^{\top}, \boldsymbol{W}_D^1 \in \mathbb{R}^{K \times K}, \boldsymbol{W}_D^2 \in \mathbb{R}^{K \times 1} $$K个高斯核函数
$$ {\psi}^{k}_{{i,j}}=-\frac{1}{\sqrt{2 \pi}\left|\sigma^k\right|} \exp \left(-\frac{1}{2}\left(\frac{\gamma_{(i, j)}\left\|\mid \mathbf{r}_i-\mathbf{r}_j\right\|+\beta_{(i, j)}-\mu^k}{\left|\sigma^k\right|}\right)^2\right), k=1, \ldots, K $$将上面三个encoding当作偏置项加入attention中
$$ \boldsymbol{A}(\boldsymbol{X})=\operatorname{softmax}(\frac{\boldsymbol{X} \boldsymbol{W}_Q\left(\boldsymbol{X} \boldsymbol{W}_K\right)^{\top}}{\sqrt{d}}+\underbrace{\Phi^{\mathrm{SPD}}+\Phi^{\text {Edge }}}_{2 \mathrm{D} \text { pair-wise channel }}+\underbrace{\Phi^{3 \mathrm{D} \text { Distance }}}_{\text {3D pair-wise channel }}) $$Encoding atom-wise structural information in E
度的encoding,是一个n*d的矩阵,一般会有一个可学习的embedding,根据每个结点的度去得到一个向量,而且一般会设置一个阈值,度数大于某个值的向量就一样了。
$$ \Psi^{\text{Degree}}=[\Psi^{\text{Degree}}_{1},\Psi^{\text{Degree}}_{2},...,\Psi^{\text{Degree}}_{n}] $$Encoding atom-wise structural information in R
距离和的encoding,是一个n*d的矩阵
$$ \Psi^{\text{Sum of 3D Distance}}_{i}=\sum_{j\in [n]}{{\psi}_{{i,j}}}\boldsymbol{W}_D^3 $$ $$ \boldsymbol{X}^{(0)}=\boldsymbol{X}+\underbrace{\Psi^{\text {Degree }}}_{2 \mathrm{D} \text { atom-wise channel }}+\underbrace{\Psi^{\text {Sum of 3D Distance }}}_{3 \mathrm{D} \text { atom-wise channel }} $$2D数据和3D数据具有一定互补性。比如2D数据只包含了化学键的类型,3D数据包含了细粒度的化学键的长度以及角度。
For example, the 2D graph structure only contains bonds with bond type, while the 3D geometric structure contains fine-grained information such as lengths and angles. As another example, the 3D geometric structures are usually obtained from computational simulations like Density Functional Theory (DFT) (Burke, 2012), which could have approximation errors. The 2D graphs are constructed by domain experts, which to some extent, provide references to the 3D structure.
EXPERIMENTS
训练的时候根据一个预定义的分布随机采用一种数据类型(2D、3D、2D+3D),类似dropout机制。
we provide three modes for each data instance: (1) activate the 2D channels and disable the 3D channels (2D mode); (2) activate the 3D channels and disable the 2D channels (3D mode); (3) activate both channels (2D+3D mode). The mode of each data instance during training is randomly drawn on the fly according to a pre-defined distribution
训练的时候除了监督学习的目标函数还使用了一种自监督的目标函数
Besides, we also use a self-supervised learning objective called 3D Position Denoising
During training, if a data instance is in the 3D mode, we add Gaussian noise to the position of each atom and require the model to predict the noise from the noisy input.
PCQM4MV2 PERFORMANCE (2D)
分子属性预测,回归问题
the 2D-3D joint training with shared parameters indeed helps the model learn more chemical knowledge.
PDBBIND PERFORMANCE (2D & 3D)
回归问题
数据集:
one of the most widely used datasets for structurebased virtual screening
PDBBind dataset consists of protein-ligand complexes as data instances, which are obtained in bioassay experiments associated with the pKa (or − log Kd, − log Ki) affinity values.
The task requires models to predict the binding affinity of protein-ligand complexes
It is worth noting that data instances of the PDBBind dataset are protein-ligand complexes, while our model is pre-trained on simple molecules, demonstrating the transferability of Transformer-M.
QM9 PERFORMANCE (3D)
回归问题
数据集:
QM9 is a quantum chemistry benchmark consisting of 134k stable small organic molecules.
Each molecule is associated with 12 targets covering its energetic, electronic, and thermodynamic properties.
In particular, Transformer-M performs best on HUMO, LUMO, and HUMO-LUMO predictions. This indicates that the knowledge learned in the pre-training task transfers better to similar tasks.
ABLATION STUDY
Impact of the pre-training tasks
It can be seen that the joint pre-training significantly boosts the performance on both datasets. Besides, the 3D Position Denoising task is also beneficial, especially on the QM9 dataset.