Centrality estimation

The worfklow behind gpseqc_estimate is the following:

Identify & sort chromosomes.
Identifies chromosomes present in the current analysis and their size.
Remove outliers and mask input bed files.
Generate bins.
Generates a temporary bed file containing the bins.
Group cutsites (intersect).
Intersects (with bedtools) an empty group bed file with the input ones. Performed only if grouping is requested (-g).
Normalize over last condition.
Performes over-night normalization as described in 2.e.I, only if requested (-n).
Prepare cutsite domain.
Prepares single/grouped-cutsite domains based on the selected approach (see 2.c).
Assign reads to bins (intersect).
Intersects (with bedtools) the single/grouped (filtered) cutsites with the empty bins.
Calculate bin statistics.
Calculates (with datamash) sum, mean, sample standard deviation and count of each bin.
Combine condition into a single table.
Generates combined table, with chromosome, start, end, number of reads per condition, number of reads per bin, average reads per bin, sample standard deviation of number of reads per bin, number of considered cutsites per bin and condition ID.
Estimate centrality.
Estimates centrality (with gawk).
Mask centrality track.
Rank bins.
Sorts estimates independently.
Rescale centrality scores.
Write output.
Writes output: combined, estimated and ranked.

Binning

Centrality can be estimated for any kind of genomic regions, which for simplicity are defined in gpseqc_estimate as bins. These bins can be:

Chromosome-wide (default)
Sub-chromosome non-overlapping
Sub-chromosome overlapping
A custom set of bins (be careful not to use bins of different size)

Grouping

The centrality scores are based on the fact that GPSeq captures restriction events falling nearby a cutsite. Thus, during the analysis cutsites are grouped together if the sequencing depth is not high enough to reach a single-cutsite resolution (i.e., if the cutsites in a condition are not enough populated with reads). In this case, the unit of measure of GPSeq are not the single-cutsites, but the grouped sites instead.

Given a reachable resolution (in bp) at the current sequencing depth, we can group cutsites in non-overlapping bp regions. The formulas for a group are the same as for a genomic region; for example:

NRdef

where s_i is the -th restricted cutsite in . Also, a group containing cutsites, can be defined as a region:

gdef

Normalization

As the digestion step is affected by both centrality and accessibility of the cutsite, the analysis needs to remove the influence of accessibility to provide a more accurate centrality estimate.

As restriction times higher than a few hours appear to produce similar rim profiles (from image data analysis), we assume that such high conditions represent a saturation scenario where the enzyme has explored the whole nuclear space and restricted cutsites based only on their accessibility.

Thus, the absolute number or the probability of restriction events can be normalized by dividing it by the probability of restriction in the ON condition.

Specifically, we set the over-night as last condition ( D_n ) and define the accessibility of a region as its probability of being digested over-night:

access

Note that, if the normalization is applied, the last condition is excluded from further analysis and we consider as D_n the previous condition.

Cutsite domains

For more details on the ways of considering different cutsites, check out the Cutsite domain page.

Centrality scores

For more details on the available centrality scores, check out the Centrality scores page.

Introduction
Background
- Centrality estimation
  - Cutsite domain
  - Centrality scores
- Ranks comparison
  - Distances
Installation
Usage
- Estimate centrality
- Compare ranks
Output
- gpsqec_estimate
- gpsqec_compare
Known issues
Contributing
- Contributing Guidelines
- Code of conduct

Provide feedback

Saved searches