Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Centrality estimation

Gabriele Girelli edited this page Aug 23, 2018 · 2 revisions

The worfklow behind gpseqc_estimate is the following:

  1. Identify & sort chromosomes.
    Identifies chromosomes present in the current analysis and their size.
  2. Remove outliers and mask input bed files.
  3. Generate bins.
    Generates a temporary bed file containing the bins.
  4. Group cutsites (intersect).
    Intersects (with bedtools) an empty group bed file with the input ones. Performed only if grouping is requested (-g).
  5. Normalize over last condition.
    Performes over-night normalization as described in 2.e.I, only if requested (-n).
  6. Prepare cutsite domain.
    Prepares single/grouped-cutsite domains based on the selected approach (see 2.c).
  7. Assign reads to bins (intersect).
    Intersects (with bedtools) the single/grouped (filtered) cutsites with the empty bins.
  8. Calculate bin statistics.
    Calculates (with datamash) sum, mean, sample standard deviation and count of each bin.
  9. Combine condition into a single table.
    Generates combined table, with chromosome, start, end, number of reads per condition, number of reads per bin, average reads per bin, sample standard deviation of number of reads per bin, number of considered cutsites per bin and condition ID.
  10. Estimate centrality.
    Estimates centrality (with gawk).
  11. Mask centrality track.
  12. Rank bins.
    Sorts estimates independently.
  13. Rescale centrality scores.
  14. Write output.
    Writes output: combined, estimated and ranked.

Binning

Centrality can be estimated for any kind of genomic regions, which for simplicity are defined in gpseqc_estimate as bins. These bins can be:

  • Chromosome-wide (default)
  • Sub-chromosome non-overlapping
  • Sub-chromosome overlapping
  • A custom set of bins (be careful not to use bins of different size)

Grouping

The centrality scores are based on the fact that GPSeq captures restriction events falling nearby a cutsite. Thus, during the analysis cutsites are grouped together if the sequencing depth is not high enough to reach a single-cutsite resolution (i.e., if the cutsites in a condition are not enough populated with reads). In this case, the unit of measure of GPSeq are not the single-cutsites, but the grouped sites instead.

Given a reachable resolution R (in bp) at the current sequencing depth, we can group cutsites in non-overlapping R bp regions. The formulas for a group are the same as for a genomic region; for example:

NRdef

where s_i is the i-th restricted cutsite in g. Also, a group g containing k cutsites, can be defined as a region:

gdef

Normalization

As the digestion step is affected by both centrality and accessibility of the cutsite, the analysis needs to remove the influence of accessibility to provide a more accurate centrality estimate.

As restriction times higher than a few hours appear to produce similar rim profiles (from image data analysis), we assume that such high conditions represent a saturation scenario where the enzyme has explored the whole nuclear space and restricted cutsites based only on their accessibility.

Thus, the absolute number or the probability of restriction events can be normalized by dividing it by the probability of restriction in the ON condition.

Specifically, we set the over-night as last condition (D_n) and define the accessibility of a region as its probability of being digested over-night:

access

Note that, if the normalization is applied, the last condition is excluded from further analysis and we consider as D_n the previous condition.

Cutsite domains

For more details on the ways of considering different cutsites, check out the Cutsite domain page.

Centrality scores

For more details on the available centrality scores, check out the Centrality scores page.