Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 4.47 KB

README.md

File metadata and controls

26 lines (21 loc) · 4.47 KB

Sampling to Capture Single-Cell Heterogeneity

Paper:

This is the source code associated with the paper "Sampling to Capture Single-Cell Heterogeneity" by Rajaram et al., slated to appear in Nature Methods.The paper provides a rational strategy to determine how many samples are required to capture the heterogeneity of a specimen.

Code Overview

The version here is a snapshot representing the code at time of paper acceptance. To obtain the latest version with additions and corrections, please visit : https://github.com/AltschulerWu-Lab/SamplingForHeterogeneity

The code supplied here provides 3 pieces of functionality:

  1. It provides the MATLAB source code to generate the main figures of the paper.
  2. This source code also automatically generates the source data for the figures, supplied as Excel files with the publication. Please note that the algorithm involves random sampling, thus there will be minor random fluctuations between the xls data points (generated by this code and also deposited as source data) and the points seen in the publication figure.
  3. R code to calculate the KS-Prime, a novel measure to compare the distribution arising from the who specimen to that from a sample. The KS' is inspired by the Kolmogorov-Smirnov statistics, and attempts to improve its sensitivity at the tails of distributions.

How to use:

The main parameters an data file locations are specified in the file GetParams.m. These will need to be changed, as explained in the code to enable it to access the data. Then from the root code directory just run each of the figure files Fig1.m (lung cancer tissue), Fig2abc.m (liver cancer tissue) and Fig2c.m (cell-culture) to generate the corresponding figures. The data for Fig1.m can be downloaded at: Zenodo (link to be activated at publication), while the other data can be provided upon reasonable request from the corresponding author.

Analysis pipeline mapped to code

While the sampling methodology described in the paper is completely general, our applications are all in image data. We consider images from tissue and cell culture, and the analysis pipelines for these differ, but involve the following basic steps :

  1. Generating "feature" values from the images: This itself involves two steps, identification of single cells (segmentation) and characterizing the identified cells (feature extraction), which are performed differently for our two imaging data types.
    • Tissue: Instead of identifying single cells. we identify nuclei based on the DAPI expression, and quantify marker expression by the average intensity in pixels belonging to a nucleus. The code to perform this for the lung-cancer tissue is in Fig1.m and a very similar procedure is used for the liver cancer samples.
    • Cell-Culture: Here full nuclear and cytoplasmic segmentation is performed and a large panel of features extracted for each cell. We use our standard lab-pipeline for this as has been published previously (e.g. in Kang et al, Nature Biotech 2016).
  2. Sub-Sampling: The process of randomly selecting cell subsets of varying size whose feature distribution is compared to the whole.
    • Tissue: Here, we randomly place virtual non-overlapping cores on the tissue ( code implementation illustrated in Functions/Engine/GenerateCorePositionsNew.m). The for each set of cores, we identify cells belonging to the cores and pool their feature values as demonstrated in Fig1.m.
    • Cell-Culture: Here sub-sampling is performed at a well level. Since wells are imaged differently, the feature extraction already partitions cells into different wells. We then pool cells together as described in Functions/Engine/Generate_Well_KSPs.m
  3. Comparing distributions: Distributions are compared based on the KS-prime statistic as shown in (Functions/Engine/KSP_Calculator.m and Functions/Engine/CalculateDistDiff.m). Although not used in the paper, R code to calculate the KS-Prime is in R_KSP_Code/KSP.R and an example of sub-sampling and comparing distributions in R_KSP_Code/Test_RKSP.R
  4. Effect of number of samples: Given the KS-Prime scores their distributions are calculated for each choice of number of samples, to provide confidence estimates, given a provided KS-Prime threshold for good sampling. This procedure is provided for each data-set in their corresponding figure files: Fig1.m (lung cancer tissue), Fig2abc.m (liver cancer tissue) and Fig2c.m (cell-culture).