-
Notifications
You must be signed in to change notification settings - Fork 2
Centrality estimation
The worfklow behind gpseqc_estimate
is the following:
-
Identify & sort chromosomes.
Identifies chromosomes present in the current analysis and their size. - Remove outliers and mask input bed files.
-
Generate bins.
Generates a temporary bed file containing the bins. -
Group cutsites (intersect).
Intersects (withbedtools
) an empty group bed file with the input ones. Performed only if grouping is requested (-g
). -
Normalize over last condition.
Performes over-night normalization as described in 2.e.I, only if requested (-n
). -
Prepare cutsite domain.
Prepares single/grouped-cutsite domains based on the selected approach (see 2.c). -
Assign reads to bins (intersect).
Intersects (withbedtools
) the single/grouped (filtered) cutsites with the empty bins. -
Calculate bin statistics.
Calculates (withdatamash
) sum, mean, sample standard deviation and count of each bin. -
Combine condition into a single table.
Generatescombined
table, with chromosome, start, end, number of reads per condition, number of reads per bin, average reads per bin, sample standard deviation of number of reads per bin, number of considered cutsites per bin and condition ID. -
Estimate centrality.
Estimates centrality (withgawk
). - Mask centrality track.
-
Rank bins.
Sorts estimates independently. - Rescale centrality scores.
-
Write output.
Writes output:combined
,estimated
andranked
.
Centrality can be estimated for any kind of genomic regions, which for simplicity are defined in gpseqc_estimate
as bins. These bins can be:
- Chromosome-wide (default)
- Sub-chromosome non-overlapping
- Sub-chromosome overlapping
- A custom set of bins (be careful not to use bins of different size)
The centrality scores are based on the fact that GPSeq captures restriction events falling nearby a cutsite. Thus, during the analysis cutsites are grouped together if the sequencing depth is not high enough to reach a single-cutsite resolution (i.e., if the cutsites in a condition are not enough populated with reads). In this case, the unit of measure of GPSeq are not the single-cutsites, but the grouped sites instead.
Given a reachable resolution (in bp) at the current sequencing depth, we can group cutsites in non-overlapping
bp regions. The formulas for a group are the same as for a genomic region; for example:
where is the
-th restricted cutsite in
. Also, a group
containing
cutsites, can be defined as a region:
As the digestion step is affected by both centrality and accessibility of the cutsite, the analysis needs to remove the influence of accessibility to provide a more accurate centrality estimate.
As restriction times higher than a few hours appear to produce similar rim profiles (from image data analysis), we assume that such high conditions represent a saturation scenario where the enzyme has explored the whole nuclear space and restricted cutsites based only on their accessibility.
Thus, the absolute number or the probability of restriction events can be normalized by dividing it by the probability of restriction in the ON condition.
Specifically, we set the over-night as last condition () and define the accessibility of a region as its probability of being digested over-night:
Note that, if the normalization is applied, the last condition is excluded from further analysis and we consider as the previous condition.
For more details on the ways of considering different cutsites, check out the Cutsite domain page.
For more details on the available centrality scores, check out the Centrality scores page.
GPSeqC v2.3.3
is published under the MIT License - Copyright (c) 2017-18 Gabriele Girelli