MiLoPYP: self-supervised molecular pattern mining and particle localization in situ

PDF：

MiLoPYP: two-step dataset-specific contrastive learning-based framework

Fast molecular pattern mining

accurate protein localization

Background

Current solutions:

Template matching

prone to model bias and computationally intensive
especially when dealing with large datasets and multiple protein species

Fully supervised methods

CNN-based approaches
require extensive data annotation and generalize poorly to previously unseen datasets

unsupervised clustering-based approach, DISCA

classify subtomograms by modeling data distributions using expectation-maximization
suffers from training instability
assumes a discrete number of clusters
requires that large continuous structures such as membranes be separately annotated.

TomoTwin, a representation learning-based approach, fully supervised

map each voxel and its surrounding content into high-dimensional subspace and the relative location of macromolecules is determined by measuring similarities between the embeddings.
requires extensive training. tens of thousands of simulated tomograms
have limited success on non-globular and unseen proteins
computationally intensive, prone to distortions (missing-wedge)

MiLoPYP

cellular content exploration

embed subtomograms into a high-dimensional representation in a self-supervised fashion using contrastive representation learning.
subtomograms ⇒ embedding space

similar ones are close to each other to allow user to identify frequently occurring proteins.

protein-specific particle localization

identify specific proteins across entire datasets
using few-shot particle detection that only requires sparse labels for training, computationally efficient.
Training data are obtained by interactively selecting a subset of points from the output of the previous step. Eliminating the need for time-consuming manual labeling.
To minimize the effects of the missing wedge and reduce computational costs

use averages of 2D projections from the raw tilt series and averages of consecutive tomographic slices (instead of 3D tomograms)

To detect globular-shaped targets

accurately identify membrane-attached proteins and fiber-like proteins.

Results

Overview of workflow

deep-learning framework, cellular content mining and exploration module, protein-specific particle localization

The cellular content exploration module

an embedding for each subtomogram and corresponding patches from the aligned tilt series

2D grid visualization

3D tomogram visualization

embeddings in 3D space

embeddings?

In the cellular content exploration module, instead of using a naive sliding-window approach centered around each voxel

MiLoPYP use Difference of Gaussian (DoG) pyramid to identify crucial coordinates of interest

DoG pyramid to identify crucial coordinates of interest

Train the network, subvolume ⇒ augmented counterparts, no need to have ground truth labels.

group proteins with similar shapes.

2D grid visualization: 2D feature vectors

3D tomogram visualization:

mapping the structure diversity present in dataset by assigning a distinct color to each voxel in the tomogram, based on its normalized 2D representation

similar color voxels means structurally homogeneous.

3D embedding:

embeddings are assigned discrete labels using an over-clustering algorithm

colored according to embedding coordinates

Users can interactively select specific regions of the embedding space and map them to original tomogram positions.

Since the DoG has low precision, a refinement step is necessary.

refinement step: semi-supervised manner

heat map that represents the likelihood of a given protein being present at each voxel in the tomogram.

Non-maximum suppression (NMS), post-processing and thresholding

the output from the refinement step can be used for subsequent SPT refinement.

Accurate detection and localization of proteins imaged in vitro

tilt series from purified Escherichia coli 70S ribosome EMPIAR-10304

2D grid visualization: 70S ribo, gold fiducials, background elements, contamination

3D tomogram visualization

Select cluster manually to train the refinement module.

The resolution reaches 5A.

F1: the quality of particle picking.

Detection of molecular patterns and particle identification in situ

map proteins within crowded cellular environments using tomograms from Mycoplasma pneumoniae bacterial cells. (EMPIAR-10499)

2D grid visualization: cell membrane, 70S ribo, smaller-sized particles and vesicles (limited copy number in dataset)

select 195 70S ribosome, then picked 23285 particles, 3D refinement and reconstruction in nextPYP, get a 5.4A structure.

benchmarking:

crowded sample 2: RuBisCo from Chlamydomonas reinhardtii (EMPIAR-10694)

relatively homogeneous

select 86 annotations ⇒ finally identify 36345 particles.

11.0 A RuBisCo

accurately detect targets in situ with minimal user intervention

Identification of native membrane proteins attached to viral particles

more diverse shapes, SARS-CoV-2 virus particles (EMPIAR-10453)

membrane-attached spikes

free-floating spikes

144 membrane-attached spikes ⇒ 23388 membrane-bound spikes.

open state 5.6A 9194 particles

closed state 9.3A 3740 particles

consistent with findings reported in the original study where approximately 1000 virions were manually selected

Simultaneous detection of multiple protein species from cellular lamellae

tubular-shaped complexes, thinned lamellae of mouse mammary gland epithelial EpH4 cells (EMPIAR-10987)

membrane, fibers, microtubules and 80S ribosomes

137 ribo ⇒ 6068 80S ribo 8.6A

102 microtubules ⇒ 1761 particles 37A

Accurate detection and localization of large membrane proteins in cells

large membrane proteins FIB/SEM-milled Saccharomyces cerevisiae EMPIAR-11658

revealed the presence of complexes bound to mitochondrial cristae 线粒体嵴

compared to other datasets, this one required more particle to be selected (247)

total 4105 particles. 13A ATPase from 2577 particles

can detect large membrane proteins present at low concentration.

Discussion

the lack of computational tools to effectively sort through the intrinsic complexity of crowded cellular environments captured by CET.

MiLoPYP

unique features

molecular pattern mining

computationally efficient

require mininal user intervention

limitations

may struggle to distinguish proteins of similar shape or size. like gold fiducials and ribosomes on EMPIAR-10304

class imbalance problem, rare proteins may be blended with more abundant proteins of similar shape. maybe tackled by downstream SPT analysis using 3D classification.

MiLoPYP could be combined with 3D segmentation based on deep-learning.

MiLoPYP’s exploration module ⇒ biologically relevant regions

improved pattern mining performance and higher detection accuracy

Summary

offers a convenient tool to map the interior of cells and find the location of multiple protein species within their native environment

effectively map entire sets of tomograms

facilitating the interpretation, discovery and selection of target macromolecules

in addition to globular-shaped macromolecules, it can accurately detect membrane-bound and tubular-shaped complexes, a versatile tool for in situ molecular pattern mining

computationally efficient

incorporated into nextPYP