Skip to content

Commit

Permalink
Added heterozygosity article, fixed He comparision in benchmarking
Browse files Browse the repository at this point in the history
  • Loading branch information
kfarleigh committed Feb 21, 2024
1 parent 9d60a84 commit 22ef885
Show file tree
Hide file tree
Showing 7 changed files with 168 additions and 27 deletions.
16 changes: 8 additions & 8 deletions R/Heterozygosity.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#'
#' @param data Character. String indicating the name of the vcf file, geno file or vcfR object to be used in the analysis.
#' @param pops Character. String indicating the name of the population assignment file or dataframe containing the population assignment information for each individual in the data. This file must be in the same order as the vcf file and include columns specifying the individual and the population that individual belongs to. The first column should contain individual names and the second column should indicate the population assignment of each individual. Alternatively, you can indicate the column containing the individual and population information using the individual_col and population_col arguments.
#' @param statistic Character. String or vector indicating the statistic to calculate. Options are any of: all; all of the statistics; Ho, observed heterozygosity; He, expected heterozygosity; PHt, proportion of heterozygous loci; StHe, heterozygosity standardized by the average expected heterozygosity; StHo, heterozygosity standardized by the average observed heterozygosity; IR, internal relatedness; HL, homozygosity by locus.
#' @param statistic Character. String or vector indicating the statistic to calculate. Options are any of: all; all of the statistics; Ho, observed heterozygosity; He, expected heterozygosity; PHt, proportion of heterozygous loci; Hs_exp, heterozygosity standardized by the average expected heterozygosity; Hs_obs, heterozygosity standardized by the average observed heterozygosity; IR, internal relatedness; HL, homozygosity by locus.
#' @param missing_value Character. String indicating missing data in the input data. It is assumed to be NA, but that may not be true (is likely not) in the case of geno files.
#' @param write Boolean. Whether or not to write the output to files in the current working directory. There will be one or two files for each statistic. Files will be named based on their statistic such as Ho_perpop.csv or Ho_perloc.csv.
#' @param prefix Character. Optional argument. String that will be appended to file output. Please provide a prefix if write is set to TRUE.
Expand All @@ -23,7 +23,7 @@
#'
#' \href{https://onlinelibrary.wiley.com/doi/10.1111/j.1365-294X.2006.03111.x}{Aparicio, J. M., Ortego, J., & Cordero, P. J. (2006)}. What should we weigh to estimate heterozygosity, alleles or loci?. Molecular Ecology, 15(14), 4659-4665.
#'
#' \bold{Heterozygosity standardized by expected (StHe) and observed heterozygosity (StHo):}
#' \bold{Heterozygosity standardized by expected (Hs_exp) and observed heterozygosity (Hs_obs):}
#'
#' \href{https://academic.oup.com/evolut/article/53/4/1259/6757148}{Coltman, D. W., Pilkington, J. G., Smith, J. A., & Pemberton, J. M. (1999)}. Parasite‐mediated selection against Inbred Soay sheep in a free‐living island populaton. Evolution, 53(4), 1259-1267.
#'
Expand Down Expand Up @@ -220,10 +220,10 @@ Heterozygosity <- function(data, pops, statistic = 'all', missing_value = NA, wr
ExpHet_res_avg <- mean(ExpHet_res_avg)

St_He <- PropHt(Dat)/ExpHet_res_avg
rownames(St_He) <- 'StHe'
rownames(St_He) <- 'Hs_exp'
return(St_He)
}
if("StHe" %in% statistic | statistic == "all"){
if("Hs_exp" %in% statistic | statistic == "all"){
StHe_res_perind <- lapply(Dat_perpop, StHe)
StHe_res_perind <- mapply(rbind, StHe_res_perind, "Pop"=names(StHe_res_perind), SIMPLIFY=F)

Expand All @@ -241,10 +241,10 @@ Heterozygosity <- function(data, pops, statistic = 'all', missing_value = NA, wr
ObsHet_res_avg <- stats::na.omit(ObsHet_res_perloc)
ObsHet_res_avg <- mean(ObsHet_res_avg )
St_Ho <- PropHt(Dat)/ObsHet_res_avg
rownames(St_Ho) <- 'StHo'
rownames(St_Ho) <- 'Hs_obs'
return(St_Ho)
}
if("StHo" %in% statistic | statistic == "all"){
if("Hs_obs" %in% statistic | statistic == "all"){
StHo_res_perind <- lapply(Dat_perpop, StHo)
StHo_res_perind <- mapply(rbind, StHo_res_perind, "Pop"=names(StHo_res_perind), SIMPLIFY=F)

Expand Down Expand Up @@ -450,10 +450,10 @@ Heterozygosity <- function(data, pops, statistic = 'all', missing_value = NA, wr
Output <- list(Obs_Het_res_avg, ObsHet_res_perloc, ExpHet_res_avg, ExpHet_res_perloc,
PropHt_res_perind, StHe_res_perind, StHo_res_perind, IR_perind, HL_perind)

names(Output) <- c("Ho_perpop", "Ho_perloc", "He_perpop", "He_perloc", "PHt", "StHe", "StHo", "IR", "HL")
names(Output) <- c("Ho_perpop", "Ho_perloc", "He_perpop", "He_perloc", "PHt", "Hs_exp", "Hs_obs", "IR", "HL")

# Set list of possible statistics
Stat <- c("Ho", "He", "PHt", "StHe", "StHo", "IR", "HL")
Stat <- c("Ho", "He", "PHt", "Hs_exp", "Hs_obs", "IR", "HL")
Stat_idx <- c(1,1,2,2,3,4,5,6,7)

if(length(statistic) == 1 && statistic == "all"){
Expand Down
2 changes: 1 addition & 1 deletion docs/pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ articles:
PopGenHelpR_createQmatrix: PopGenHelpR_createQmatrix.html
PopGenHelpR_heterozygosity: PopGenHelpR_heterozygosity.html
PopGenHelpR_vignette: PopGenHelpR_vignette.html
last_built: 2024-02-20T21:35Z
last_built: 2024-02-21T17:02Z
urls:
reference: https://kfarleigh.github.io/PopGenHelpR/reference
article: https://kfarleigh.github.io/PopGenHelpR/articles
Expand Down
2 changes: 1 addition & 1 deletion docs/search.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions man/Heterozygosity.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 9 additions & 9 deletions vignettes/articles/PopGenHelpR_benchmarking.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ We will compare the performance of `PopGenHelpR` with other R packages available

- [hierfstat](https://cran.r-project.org/web/packages/hierfstat/index.html) ([Goudet, 2005](https://doi.org/10.1111/j.1471-8286.2004.00828.x))

- [adegenet](https://cran.r-project.org/web/packages/adegenet/index.html) ([Jombart, 2008](https://doi.org/10.1093/bioinformatics/btn129))



### Let's Begin
Expand Down Expand Up @@ -151,7 +153,7 @@ Estimate's are very similar between `PopGenHelpR` and `mmod`, we will move onto

## *Expected* and *Observed Heterozygosity* Comparison

We will compare `PopGenHelpR` and `hierfstat`. Again, both use the same formula's from Nei (1987), so we expect similar if not identical results. `hierfstat` uses it's own format, so we will convert the data before calculations. Luckily we can convert the genind object from *Jost's D* comparisons.
We will compare `PopGenHelpR`, `hierfstat`, and `adegenet`. Again, the packages use the same formula's, so we expect similar if not identical results. `hierfstat` uses it's own format, so we will convert the data before calculations. Luckily we can convert the genind object from *Jost's D* comparisons.

```{r Heterozygosity, echo=TRUE, eval=TRUE}
Hstat <- genind2hierfstat(Genind)
Expand All @@ -163,11 +165,11 @@ PGH_He <- Heterozygosity(data = HornedLizard_VCF, pops = HornedLizard_Pop, stati
PGH_Ho <- Heterozygosity(data = HornedLizard_VCF, pops = HornedLizard_Pop, statistic = "Ho")
Hstat_hets <- basic.stats(Hstat)
Hstat_He <- colMeans(Hstat_hets$Hs)
Hstat_Ho <- colMeans(Hstat_hets$Ho)
PGH_He$He_perpop$Expected.Heterozygosity-Hstat_He
He_adnet <- Hs(Genind)
PGH_He$He_perpop$Expected.Heterozygosity-He_adnet
PGH_Ho$Ho_perpop$Observed.Heterozygosity-Hstat_Ho
```

Expand All @@ -180,19 +182,17 @@ Please reach out to Keaka Farleigh (farleik@miamioh.edu) if you have any questio

## References

Farleigh, K., Vladimirova, S. A., Blair, C., Bracken, J. T., Koochekian, N., Schield, D. R., ... & Jezkova, T. (2021). The effects of climate and demographic history in shaping genomic variation across populations of the Desert Horned Lizard (Phrynosoma platyrhinos). Molecular Ecology, 30(18), 4481-4496.
Farleigh, K., Vladimirova, S. A., Blair, C., Bracken, J. T., Koochekian, N., Schield, D. R., ... & Jezkova, T. (2021). The effects of climate and demographic history in shaping genomic variation across populations of the Desert Horned Lizard (Phrynosoma platyrhinos). Molecular Ecology, 30(18), 4481-4496.

Goudet, J. (2005). hierfstat, a package for R to compute and test hierarchical F‐statistics. Molecular ecology notes, 5(1), 184-186.

Jost, L. (2008). GST and its relatives do not measure differentiation. Molecular ecology, 17(18), 4015-4026.

Knaus, B. J., & Grünwald, N. J. (2017). vcfr: a package to manipulate and visualize variant call format data in R. Molecular ecology resources, 17(1), 44-53.

Nei, M. (1987). Molecular evolutionary genetics. Columbia university press.
Knaus, B. J., & Grünwald, N. J. (2017). vcfr: a package to manipulate and visualize variant call format data in R. Molecular ecology resources, 17(1), 44-53.

Nei, M. (1972). Genetic distance between populations. The American Naturalist, 106(949), 283-292.

Pembleton, L. W., Cogan, N. O., & Forster, J. W. (2013). St AMPP: An R package for calculation of genetic differentiation and structure of mixed‐ploidy level populations. Molecular ecology resources, 13(5), 946-952.
Pembleton, L. W., Cogan, N. O., & Forster, J. W. (2013). St AMPP: An R package for calculation of genetic differentiation and structure of mixed‐ploidy level populations. Molecular ecology resources, 13(5), 946-952.

Weir, B. S., & Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. evolution, 1358-1370.

Expand Down
11 changes: 7 additions & 4 deletions vignettes/articles/PopGenHelpR_createQmatrix.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Let's generate some q-matrices now that you know what they are! We will create q

### sNMF

We will start with sNMF because it is impletmented in the R package [LEA](https://bioconductor.org/packages/release/bioc/html/LEA.html) (Frichot & Francois, 2015). After running sNMF (see [this tutorial](https://bioconductor.org/packages/release/bioc/vignettes/LEA/inst/doc/LEA.pdf) if you need help) you just need to use the `Q` function.
We will start with sNMF because it is implemented in the R package [LEA](https://bioconductor.org/packages/release/bioc/html/LEA.html) (Frichot & Francois, 2015). After running sNMF (see [this tutorial](https://bioconductor.org/packages/release/bioc/vignettes/LEA/inst/doc/LEA.pdf) if you need help) you just need to use the `Q` function.

```{r sNMF, echo=TRUE, eval=FALSE}
# If I have a sNMF project named sNMFobject with K number of ancestral populations (genetic clusters), and my best run is run 1 (determined as the run with the lowets cross-entropy)
Expand All @@ -53,8 +53,11 @@ This will output a file with the .Q extension, which contains ancestry coefficie
Please email Keaka Farleigh (farleik@miamioh.edu) if you need help generating a q-matrix or with anything else.

## References
- Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664.
- Frichot, E., & François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929.
- Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics, 196(4), 973-983.

Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664.

Frichot, E., & François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929.

Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics, 196(4), 973-983.


Loading

0 comments on commit 22ef885

Please sign in to comment.