MiLoPYP: self-supervised molecular pattern mining and particle localization in situ

 
PDF:
 
Method of MiLoPYP
 
MiLoPYP: two-step dataset-specific contrastive learning-based framework
  • Fast molecular pattern mining
  • accurate protein localization
 

Background

 
Current solutions:
  • Template matching
    • prone to model bias and computationally intensive
    • especially when dealing with large datasets and multiple protein species
  • Fully supervised methods
    • CNN-based approaches
    • require extensive data annotation and generalize poorly to previously unseen datasets
  • unsupervised clustering-based approach, DISCA
    • classify subtomograms by modeling data distributions using expectation-maximization
    • suffers from training instability
    • assumes a discrete number of clusters
    • requires that large continuous structures such as membranes be separately annotated.
  • TomoTwin, a representation learning-based approach, fully supervised
    • map each voxel and its surrounding content into high-dimensional subspace and the relative location of macromolecules is determined by measuring similarities between the embeddings.
    • requires extensive training. tens of thousands of simulated tomograms
    • have limited success on non-globular and unseen proteins
    • computationally intensive, prone to distortions (missing-wedge)
 
MiLoPYP
  • cellular content exploration
    • embed subtomograms into a high-dimensional representation in a self-supervised fashion using contrastive representation learning.
    • subtomograms ⇒ embedding space
      • similar ones are close to each other to allow user to identify frequently occurring proteins.
  • protein-specific particle localization
    • identify specific proteins across entire datasets
    • using few-shot particle detection that only requires sparse labels for training, computationally efficient.
    • Training data are obtained by interactively selecting a subset of points from the output of the previous step. Eliminating the need for time-consuming manual labeling.
    • To minimize the effects of the missing wedge and reduce computational costs
      • use averages of 2D projections from the raw tilt series and averages of consecutive tomographic slices (instead of 3D tomograms)
    • To detect globular-shaped targets
      • accurately identify membrane-attached proteins and fiber-like proteins.
 

Results

Overview of workflow

 
deep-learning framework, cellular content mining and exploration module, protein-specific particle localization
notion imagenotion image
The cellular content exploration module
an embedding for each subtomogram and corresponding patches from the aligned tilt series
  • 2D grid visualization
  • 3D tomogram visualization
  • embeddings in 3D space
 
embeddings?
notion imagenotion image
 
In the cellular content exploration module, instead of using a naive sliding-window approach centered around each voxel
MiLoPYP use Difference of Gaussian (DoG) pyramid to identify crucial coordinates of interest
notion imagenotion image
  • DoG pyramid to identify crucial coordinates of interest
  • Train the network, subvolume ⇒ augmented counterparts, no need to have ground truth labels.
  • group proteins with similar shapes.
 
2D grid visualization: 2D feature vectors
3D tomogram visualization:
  • mapping the structure diversity present in dataset by assigning a distinct color to each voxel in the tomogram, based on its normalized 2D representation
  • similar color voxels means structurally homogeneous.
3D embedding:
  • embeddings are assigned discrete labels using an over-clustering algorithm
  • colored according to embedding coordinates
Users can interactively select specific regions of the embedding space and map them to original tomogram positions.
 
Since the DoG has low precision, a refinement step is necessary.
 
refinement step: semi-supervised manner
  • heat map that represents the likelihood of a given protein being present at each voxel in the tomogram.
  • Non-maximum suppression (NMS), post-processing and thresholding
  • the output from the refinement step can be used for subsequent SPT refinement.
 

Accurate detection and localization of proteins imaged in vitro

 
tilt series from purified Escherichia coli 70S ribosome EMPIAR-10304
 
notion imagenotion image
 
2D grid visualization: 70S ribo, gold fiducials, background elements, contamination
3D tomogram visualization
 
Select cluster manually to train the refinement module.
notion imagenotion image
The resolution reaches 5A.
 
F1: the quality of particle picking.
notion imagenotion image
 

Detection of molecular patterns and particle identification in situ

 
map proteins within crowded cellular environments using tomograms from Mycoplasma pneumoniae bacterial cells. (EMPIAR-10499)
 
notion imagenotion image
 
2D grid visualization: cell membrane, 70S ribo, smaller-sized particles and vesicles (limited copy number in dataset)
 
notion imagenotion image
 
select 195 70S ribosome, then picked 23285 particles, 3D refinement and reconstruction in nextPYP, get a 5.4A structure.
notion imagenotion image
 
benchmarking:
notion imagenotion image
 
crowded sample 2: RuBisCo from Chlamydomonas reinhardtii (EMPIAR-10694)
relatively homogeneous
notion imagenotion image
select 86 annotations ⇒ finally identify 36345 particles.
11.0 A RuBisCo
notion imagenotion image
accurately detect targets in situ with minimal user intervention
 

Identification of native membrane proteins attached to viral particles

more diverse shapes, SARS-CoV-2 virus particles (EMPIAR-10453)
  • membrane-attached spikes
  • free-floating spikes
notion imagenotion image
144 membrane-attached spikes ⇒ 23388 membrane-bound spikes.
notion imagenotion image
open state 5.6A 9194 particles
closed state 9.3A 3740 particles
notion imagenotion image
consistent with findings reported in the original study where approximately 1000 virions were manually selected
 

Simultaneous detection of multiple protein species from cellular lamellae

 
tubular-shaped complexes, thinned lamellae of mouse mammary gland epithelial EpH4 cells (EMPIAR-10987)
notion imagenotion image
membrane, fibers, microtubules and 80S ribosomes
notion imagenotion image
137 ribo ⇒ 6068 80S ribo 8.6A
102 microtubules ⇒ 1761 particles 37A
notion imagenotion image
 
notion imagenotion image
 

Accurate detection and localization of large membrane proteins in cells

large membrane proteins FIB/SEM-milled Saccharomyces cerevisiae EMPIAR-11658
revealed the presence of complexes bound to mitochondrial cristae 线粒体嵴
notion imagenotion image
compared to other datasets, this one required more particle to be selected (247)
total 4105 particles. 13A ATPase from 2577 particles
notion imagenotion image
can detect large membrane proteins present at low concentration.
 

Discussion

the lack of computational tools to effectively sort through the intrinsic complexity of crowded cellular environments captured by CET.
MiLoPYP
unique features
  • molecular pattern mining
  • computationally efficient
  • require mininal user intervention
limitations
  • may struggle to distinguish proteins of similar shape or size. like gold fiducials and ribosomes on EMPIAR-10304
  • class imbalance problem, rare proteins may be blended with more abundant proteins of similar shape. maybe tackled by downstream SPT analysis using 3D classification.
 
MiLoPYP could be combined with 3D segmentation based on deep-learning.
MiLoPYP’s exploration module ⇒ biologically relevant regions
improved pattern mining performance and higher detection accuracy
 

Summary

  • offers a convenient tool to map the interior of cells and find the location of multiple protein species within their native environment
  • effectively map entire sets of tomograms
  • facilitating the interpretation, discovery and selection of target macromolecules
  • in addition to globular-shaped macromolecules, it can accurately detect membrane-bound and tubular-shaped complexes, a versatile tool for in situ molecular pattern mining
  • computationally efficient
  • incorporated into nextPYP