Method of MiLoPYP

(Exploration module) Self-supervised pattern mining and exploration module

Goal:

learn an embedding from the input set of tomograms

facilitate the identification of abundant protein species

three steps:

3D tomogram pre-processing

2D CNN feature representation learning based on self-supervised contrastive learning

visualization of learned features

Preprocessing, coordinate generation and filtering

applying Gaussian denoising and histogram equalization, enhancing contrast

得到3D candidate coordinates

Gaussian filtered tomogram T_i
for every T_i，using progressively higher std, 得到n Gaussian smoothed tomograms
calculating the difference between two gaussian smoothed tomograms

LoG vs DoG

离散形式的拉普拉斯算子：就是一维拉普拉斯方程等于0（二阶导数等于0）引申到超平面进行理解，一维函数有线性条件（两个点的坐标平均值等于中间点），同样超平面也类推：函数在任意点的值，等于该点左右、前后、上下等共2n个等距离点所取值的平均值

blog.csdn.net

https://blog.csdn.net/wp1351553202/article/details/80876796

SIFT算法原理：

SIFT算法原理详解 - Alliswell_WP - 博客园

SIFTSIFT四步骤和特征匹配及筛选：步骤一：建立尺度空间，即建立高斯差分(DoG)金字塔dog_pyr；步骤二：在尺度空间中检测极值点，并进行精确定位和筛选创建默认大小的内存存储器；步骤三：特征点方向赋值，完成此步骤后，每个特征点有三个信息：位置、尺度、方向；步骤四：计算特征描述子 SIFT后特

https://www.cnblogs.com/Alliswell-WP/p/SIFT.html

拉普拉斯算子：

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/376436061

DoG高斯图像差分解释：

图像算法：Difference of Gaussian(DOG) 高斯函数差分-CSDN博客

文章浏览阅读2.7w次，点赞19次，收藏107次。概念Difference of Gaussian(DOG)是高斯函数的差分。它是可以通过将图像与高斯函数进行卷积得到一幅图像的低通滤波结果，即去噪过程，这里的Gaussian和高斯低通滤波器的高斯一样，是一个函数，即为正态分布函数。同时，它对高斯拉普拉斯LoG的近似，在某一尺度上的特征检测可以通过对两个相邻高斯尺度空间的图像相减，得到DoG的响应值图像。基本理论首先，高斯函数..._difference of gaussian

https://blog.csdn.net/sss_369/article/details/84674639

DoG和LoG算子

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/49447503

非极大值抑制：在邻域内找到局部极大值，并筛除邻域内其余的值。（去冗余、找最优）

抑制梯度不够大的像素点，只保留最大的梯度。

通过DoG和NMS，可以在3D tomogram中确定一系列3D coordinates

选择高于mean voxel intensity 0.5*std的（至于为什么要这样并没有提，猜就是因为通常而言有生物特征的图像区域都要比平均pixel value高，于是简单做了一下筛选）

得到final 3D coordinate后，计算其在tilt为angle theta时的2D coordinate

一般来说-15到15度contrast比较高，所以只计算这个范围内的2D 坐标。

于是，对于1个3D coordinate，得到一系列theta角处的2D coordinates，将这部分tilt series图像extract出来，average，得到final 2D tilt series input ()

同时对于1个3D coordinate，我们extract它在tomogram中的一片slice，即

（对于低质量tilt-series或者z方向有大量overlap的数据，只用作input）

NMS是非常关键的步骤，用于根据邻近点选择最突出的点

为了防止重复选择邻近的voxel，NMS需要用户定义一个radius，一旦存在一个点已经被确定了，在这个radius周围的点都会被去掉。

用户需要根据particle density确定这一radius。对于densely populated tomogram，通常需要选择一个较小的radius。

Overall architecture and contrastive learning loss function

2D CNNs for feature representation learning

self-supervised contrastive learning - Siamese neural network design

孪生神经网络

【深度學習】另一個簡單而神奇的結構：孿生神經網路

孿生神經網路 Siamese Neural Networks 如果你的中文不是太差的話，看到「孿生」這兩個字應該會知道它指的是「雙胞胎」，那可能就有人會想了，雙胞胎? 啥意思? 長的一模一樣? Bingo Bingo 思路基本上對了！說穿了，所謂的孿生神經網路，就是由兩個權值共享（Shared Weights）的子網路所建構出的一個網路，如下圖所示：

https://jason-chen-1992.weebly.com/home/siamese-network

为什么使用孪生神经网络？

如果使用经典神经网络，不外乎就是分类、回归、聚类、或者降维等方法，对于EM images中的蛋白，我们是不清楚其中到底有多少类蛋白质，而且有imbalanced data问题（某些蛋白或者组分训练样本极少），使用经典神经网络方法比较麻烦。但是孪生神经网络简化了这样的问题，它只管两个input像不像，所以只要这个particle在神经网络中比对过一次，下次再遇到类似particle时就能够识别到相似了。这就是单样本学习One-shot learning。

通过这种方法，就能把具有相似结构的生物大分子在feature space中挨在一起，而结构不相似的生物大分子在feature space中就能分开。

Model training, inference and data augmentation

SGD

迁移学习：initiate params using pretrained ResNet18 weights on ImageNet

learning rate: cosine decay schedule

training set: 5 to 10 tomograms

training time: 1* V100 32GB, 300 epochs, < 2hours

在训练后，model就可以用于整个dataset

两个输入之间使用random augmentation

horizontal and vertical flipping

intensity jittering, varying the brightness and contrast, 这样可以防止model学习到灰度值特征

random resize crop, 减小对于particle大小的敏感度

corner dropout, 将image corner处的像素值随机设置为0，使model更加关注image中心的更重要的particle特征。

rotations (multiples of 90)：限制在90度的原因是，防止interpolation带来的artifacts对于shape learning的影响

Dimensionality reduction, visualization and module output

2D grid visualization:

1、数据降维

使用UMAP，将128维的feature space降维到2D

UMAP降维算法：

UMAP降维算法原理详解和应用示例-CSDN博客

文章浏览阅读2.4w次，点赞36次，收藏201次。降维不仅仅是为了数据可视化。它还可以识别高维空间中的关键结构并将它们保存在低维嵌入中来克服“维度诅咒”本文将介绍一种流行的降维技术Uniform Manifold Approximation and Projection (UMAP)的内部工作原理，并提供一个 Python 示例。(UMAP) 如何工作的？分析 UMAP 名称让我们从剖析 UMAP 名称开始，这将使我们对算法应该做什么有一个大致的了解。以下描述不是官方定义，而是我总结出来的可帮助我们理解 UMAP 的要点。Projection_umap降维

https://blog.csdn.net/deephub/article/details/121302082

机器学习流形数据降维：UMAP 降维算法

UMAP 简介 UMAP（Uniform Manifold Approximation and Projection）是一种先进的非线性降维技术，用于将高维数据集转换为低维空间中的表示，同时尽可能保留原始数据的复杂结构和拓扑特性。它特别适用于可视化分析和机器学习领域的预处理步骤。理论基础流形

https://blog.marquis.eu.org/posts/895d4074/

UMAP降维算法原理详解和应用示例-腾讯云开发者社区-腾讯云

降维不仅仅是为了数据可视化。它还可以识别高维空间中的关键结构并将它们保存在低维嵌入中来克服“维度诅咒”

https://cloud.tencent.com/developer/article/1901726

使用K-nearest-neighbors，k=40。使用更小的k能够分析到finer local structure，更大的k更加关注overall structure

2、labeling

因为在这个过程中并不需要知道确切的class数，我们更希望类似的蛋白分到一块，不同的蛋白需要区分开。因此使用over-clustering and label reassignment

（1）普通k-means，k远大于实际的class数

（2）将k个cluster centers重新聚类到h个class中，h是用户定义值。c<<h<k

（3）此时learned embeddings分类到了h classes中

在interactive session中，用户可以选择感兴趣的类别

选择类别后可以得到对应的3D coordinate标签

我们当然希望能够选取所有的同类别的particles，来进行后续处理，比如平均等等。但是很显然，当选择的subregion变大后，精度就会下降。

因此，为了得到数量可观的、精准的坐标，需要refinement stage

refinement stage选取少数已标记的样本（直接从interactive session中获取patches，不需要再手动标记）

(localization module) Few-shot particle localization module

1、3D tomogram preprocessing, Gaussian denoising, histogram equalization

2、few-shot learning-based neural-network training

3、extraction of particle coordinates

（1）using NMS, for globular-shaped particles

（2）using polynomial fitting, for fiber-like proteins or microtubules

只需要很少量的annotation，可以使用之前的exploration module或者手动再挑

Framework for particle detection

semi-supervised protein localization 半监督学习

不像一般的object detection（使用box把object框住），EM最重要的是知道protein的中心坐标。

Feature extraction and center localization modules

1、encoder–decoder feature extraction backbone

2、protein center localization module

3、a debiased voxel-level contrastive feature regularization module 去偏

xy slices有最有用的信息，xz和yz相对而言信息就比较有限。

extracted features 放到center localization module和contrastive feature regularization module

center localization module 新的目标函数：positive unlabeled focal loss，减少了在positive data上训练时的overfitting

contrastive feature regularization module 使用debiased infoNCE loss，学习最大化feature representation similarities for voxels belonging to the same protein class，最小化protein和non-proteins之间的相似度。

这两者使得在训练集annotations少时能够精准训练。

feature extraction backbone：

2D UNet-based per-slice feature extraction backbone

3D convolutional layer, 将per-slice feature融合到单个3D representation中

用户可以调节encoder和decoder blocks的数量：

对于小蛋白，推荐最多选择4个encode blocks

对于大蛋白，推荐使用5个encode blocks

网络结构的整体仍然按照孪生神经网络进行设计，input为tomogram和augmented tomogram

从feature extraction backbone中得到pair M and augmented M

使用这一对再生成heat map pair和projected feature maps pair

center localization module是一个3D convolutional layer, kernel size 1*1*1

input为tomogram，output为heat map

可以将该步骤看作per-voxel classification：input为voxel，output为对应的分数（0-1）

如果将p(v)看作underlying data distribution from which v_i,j,k is sampled，那么p(v)可以被分解为：

是positive class conditional probability of protein voxels 类条件概率

是negative class conditional probability of background voxels

和是class prior probabilities （类先验概率）

先验概率、条件概率

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/89531065

是loss

当所有的voxel被标记后，这就是一个二分类问题，可以用标准的positive-negative (PN) learning approach解决

P-N学习：

blog.csdn.net

https://blog.csdn.net/crazyice521/article/details/52873694#:~:text=PN学习对分类器,条件，停止训练。

blog.csdn.net

https://blog.csdn.net/eternity1118_/article/details/51499368

通俗理解TP、FP、TN、FN：

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/498846393

P-U学习：

当只有少数voxel标记为positive，其他的data都为unlabeled，这时问题就变成了P-U问题。

由于未标记的数据部分同时包含0和1，因此如果将无标记部分天真地视为0并执行传统监督学习算法将低估了正例的可能性(Ward et al. 2009; Yang et al. 2012)。但是，如果简单排除这些无标记数据，即在训练集中只有结果为1，却没有结果为0的样本，则无法直接使用已经非常成熟的监督学习方法。‣

机器学习—— PU-Learning算法_pu learning-CSDN博客

文章浏览阅读2.5k次，点赞29次，收藏24次。在本篇博客中，简单介绍了PU-Learning算法的基本概念、基本流程和基本方法，并简单讨论了Two-step PU Learning算法和无偏PU Learning算法的具体流程。通过示例代码，展示了如何用Python实现一个简单的PU-Learning分类器。PU-Learning是解决类别不平衡问题的有效方法，在实际应用中具有广泛的应用前景。_pu learning

https://blog.csdn.net/weixin_39753819/article/details/137141332

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/82556263

PU learning半监督学习-CSDN博客

文章浏览阅读1.8k次。半监督学习Positive-unlabeled learning什么是半监督学习让学习器不依赖外界交互、自动地利用未标记样本来提升学习性能，就是半监督学习（semi-supervised learning）。要利用未标记样本，必然要做一些将未标记样本所揭示的数据分布信息与类别标记相联系的假设。假设的本质是“相似的样本拥有相似的输出”。半监督学习可进一步划分为纯（pure）半监督学习和直推学习（transductive learning），前者假定训练数据中的未标记样本并非待测的数据，而后者则假定_pu learning

https://blog.csdn.net/zmm0628/article/details/119082573

经验风险最小化：

机器学习理论之经验风险最小化（Empirical Risk Minimization）-CSDN博客

文章浏览阅读1.7w次，点赞15次，收藏81次。该理论探讨的是模型在training set上的error 与 generation error的关系。训练模型时，需要多少个样本，达到什么精度，都是由理论依据的。理论点：偏差方差权衡（Bias/variance tradeoff）训练误差和一般误差（Training error & generation error）经验风险最小化（Empiried risk mi..._经验风险最小化

https://blog.csdn.net/qq_35865125/article/details/87738989

代价函数、损失函数、风险函数、目标函数

blog.csdn.net

https://blog.csdn.net/xq151750111/article/details/121863185

1、对于globular-shaped particle detection：

ground-truth heat map is splatted with Gaussian kernels，label不一定是binary

positive labels分成两类：

TP是，是每个Gaussian Kernel的中心（蛋白中心）
SP (soft positive)是，在中心周围

Unlabeled voxel是-1

2、对于tubular-shaped particle：

只有positive voxels和unlabeled voxels，没有soft positive

因此positive labels只有TP

Debiased contrastive regularization and final training loss

我们并不是使用直接从feature extraction backbone得到的M，而是使用一个projection head（包含一个3D convolution layer）得到projected feature map F and augmented F.

如果，以及其投影应该encode和particle-related features

对于一个partially-annotated tomogram, f应该可以被分解为positive classes and unlabeled classes

因此，voxel-level contrastive regularization loss应该被分解为positive supervised and unlabeled self-supervised debiased contrastive terms。

在个annotated proteins（训练数据）的feature vectors中，对于每一个feature vector ，剩下的个feature vectors天然就是augmented pairs

对于unlabeled feature vectors，（自然也包括augmented），当作negatives来处理

但是unlabeled里面也会包含positive，因此原本的（包括之前exploration module使用的，因为exploration只需要比对相似度，没有positive之分）supervised contrastive loss会引入bias。因此，作者使用了modified debiased supervised contrastive loss.

对于unlabeled ，唯一知道的positive就是augmented pair ，剩下的都当作negatives处理

但是unlabeled feature vector有多少class不知道。

因此将unlabeled voxels分到3个组中，pseudo-positive (voxel probability>0.9), pseudo-negative (<0.3), neutral (剩下的)

loss function的两大goal

contrastive term最大化同类特征的相似度，如果特征在不同的组别，最小化不同类特征的相似度

heat map loss term使得预测的蛋白可能性提高，如果说预测的位置与真中心位置接近的话

使用NMS去除预测的heat map，NMS使用的是3D maximum pooling with a user-defined kernel size

对于tubular-shaped macromolecules，kernel size固定在3，因为在polynomial fitting过程中需要更多的coordinates

Post-processing for tubular-shaped macromolecules

这一步是specific for tubular-shaped macromolecule

grouping of extracted coordinates

second-order polynomial fitting

resampling based on the fitted polynomial

1、Grouping

使用extracted particles创建一个无向图，每个vertex就是一个coordinate

用户可以设置一个distance，当两个节点之间的欧氏距离小于distance时，edge就会生成

连接起来的components形成了groups

2、Fitting

对于每一个connected component，进行second-order polynomial fitting

分别对xy和yz进行拟合，计算fitted residue和maximum curvature

如果residue和curvature都低于用户设置的值，我们就认为这是tubular-shaped macromolecule的candidates

3、Resampling

对于每个fitted polynomial，都进行resampling：计算y的最大值和最小值，然后step=2进行采样

生成更多的coordinates

Comparison with state-of-the-art approaches

为了比较localization module与其他pick particles方法的表现，使用EMPIAR-10304和10499进行比较。

1、首先比较和template matching方法的表现

F1 score (accuracy)表现更好

2、然后比较别的deep learning方法

DeepFinder, DeePiCt, TomoTwin

EMPIAR-10304

EMPIAR-10987

使用不同的neighbors进行对比。显然，neighbors越少，出来的finer local details就越多。当neighbors较大时，图像反映出来的就是overall structure。当然，这只是for visualization，每个sub-volume的vector representation并没有受到影响。

Model training, inference, evaluation and ablation studies

training sample：并不是整个tomogram作为input，而是64*64*5的subtomogram，这样训练时间就与input size无关了

Adam optimizer, LR=0.001, 每5 epochs使得LR为原来的1/10

10 epochs training花费3-5 min

inference speed: <1 min / full tomogram on a single V100

lambda_1 = 0.1, lambda_2 = 0.5, lambda_3 = 0.1

在定义TP的时候，为了将small variations也考虑在内，如果监测到的particle position在ground truth position的某个半径范围内，我们也认为这是positive match（但是这样是不是会使得TP虚高？？）

ablation studies：去除某一个模块，看模型准确率变化。

Contribution of localization module to detection accuracy

为了评估protein-specific particle localization module对方法表现带来的提升，对比了只用exploration module以及exploration+localization module。