Skip to content

Commit

Permalink
Readme update (#102)
Browse files Browse the repository at this point in the history
* Update mkdocs.yml
* Update Cloud computing readthedocs
* Update Simulation readthedocs
  • Loading branch information
tjstruck authored Apr 11, 2024
1 parent 84839ac commit 810a049
Show file tree
Hide file tree
Showing 4 changed files with 126 additions and 35 deletions.
2 changes: 1 addition & 1 deletion docs/paper-resources/data-preperation.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Ke
Li H (2021) Twelve years of SAMtools and BCFtools. GigaScience 10:giab008.

Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput se-
quencing data. Nucleic Acids Research 38:e164.
quencing data. Nucleic Acids Research 38:e164.
64 changes: 45 additions & 19 deletions docs/userguide/cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,71 +2,97 @@

## Using Work Queue for distributed inference with dadi-cli

`dadi-cli`'s subcommands `InferDM` and `InferDFE` have built in options to work with Cooperative Computing Tools (`CCTools`)'s `Work Queue` for launching independent optimizations across multiple machines. To use `Work Queue`, users can use conda to install the required packages:
`dadi-cli`'s subcommands `InferDM` and `InferDFE` have built in options to work with [Cooperative Computing Tools (`CCTools`)](https://cctools.readthedocs.io/en/stable/about/) `Work Queue`, which facilitates launching independent optimizations across multiple machines, through several workload managers.
To use `Work Queue`, users can use conda to install the required packages:

``` bash
conda install -c conda-forge dill ndcctools
```
Or go to the [CCTools Documentation](https://cctools.readthedocs.io/en/stable/install/). CCTools is only avalible for Mac and Linux computers.

This example has been tested for submitting jobs to a `Slurm Workload Manager`. First we want to submit a factory.
Or go to the [CCTools Documentation](https://cctools.readthedocs.io/en/stable/install/) for more installation instructions.
Note that CCTools is only avalible for Mac and Linux computers.

This example has been tested for submitting jobs to a `SLURM Workload Manager`.
First we want to submit a factory to SLURM:
```bash
work_queue_factory -T local -M dm-inference -P ./tests/mypwfile --workers-per-cycle=0 --cores=1
slurm_submit_workers -M dminf -P pwfile -p '--time=00:10:00 --nodes=1 --ntasks=1' 5 --workers-per-cycle=0 --cores=1
```

`dm-inference` is the project name and `mypwfile` is a file containing a password, both of which are needed for `dadi-cli` use. They can be passed into `dadi-cli` with the `--work-queue` flag, where users pass in the project name and then the password file. `--workers-per-cycle` can be set to zero, as `dadi-cli`'s `--optimizations` argument will determine the total number of workers requested from the factory. `--cores` controls how many CPUs each worker use and can be set to 1, as each worker will preform a singular optimization. Next users will want to submit jobs from `dadi-cli`. By default, `work_queue_factory` will request as many CPUs as avalible, users can control the number of CPUs used by controling the number of workers with `work_queue_factory`'s `--min-workers` and `--max-workers` arguments.
`dminf` is the project name and `pwfile` is a file containing a password, both of which are needed for `dadi-cli` use. They can be passed into `dadi-cli` with the `--work-queue` flag, where users pass in the project name and then the password file. `--workers-per-cycle` can be set to zero, as `dadi-cli`'s `--optimizations` argument will determine the total number of workers requested from the factory. `--cores` controls how many CPUs each worker use and can be set to 1, as each worker will preform a singular optimization.

Next users will want to submit jobs from `dadi-cli`:
```bash
dadi-cli InferDM --fs ./examples/results/fs/1KG.YRI.CEU.20.synonymous.snps.unfold.fs --model split_mig --p0 1 1 .5 1 .5 --ubounds 10 10 1 10 1 --lbounds 10e-3 10e-3 10e-3 10e-3 10e-5 --grids 60 80 100 --output ./examples/results/demog/1KG.YRI.CEU.20.split_mig.demog.work_queue.params --optimizations 5 --maxeval 200 --check-convergence 5 --work-queue dm-inference ./tests/mypwfile
dadi-cli InferDM --fs 1KG.YRI.CEU.20.syn.fs --model split_mig
--lbounds 1e-3 1e-3 0 0 0 --ubounds 10 10 1 10 0.5 --force-convergence 10
--output 1KG.YRI.CEU.20.split_mig
--work-queue dminf pwfile
```

`dadi-cli` will send the number of workers as the number of optimizations you request. The `check-convergence` and `force-convergence` options work with `Work Queue` as well.

If users want to use another batch system with Work Queue, there are similar commands for Condor, SGE, PBS, and Torque found under [Worker Submission Scripts](https://cctools.readthedocs.io/en/latest/man_pages/).
Users can also try to submit a [`work_queue_factory`](https://cctools.readthedocs.io/en/latest/man_pages/work_queue_factory/), which allows for automation of workers but can require more CPUs when requesting a large number of optimizations with `dadi-cli`.

## Terraform cloud computing for dadi-cli

The `dadi-cli` GitHub source code comes with a folder called `terraform`, which containes scripts users can use to launch Amazon Web Services (AWS) Elastic Clompute Cloud (EC2) instances to remotely run `dadi-cli` and `Work Queue`. Users will need to install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) and the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). If users have not already signed up for AWS and gotten an access key ID and secret access key, more information can be found [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-prereqs.html).
### Setup Terraform

The [`dadi-cli` GitHub source code](https://github.com/xin-huang/dadi-cli) comes with a folder called `terraform` ([here](https://github.com/xin-huang/dadi-cli/tree/master/terraform)), which containes scripts users can use to launch Amazon Web Services (AWS) Elastic Clompute Cloud (EC2) instances to remotely run `dadi-cli` and `Work Queue`.
Users will need to install [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) and the [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
If users have not already signed up for AWS and gotten an access key ID and secret access key, more information can be found [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-prereqs.html).

Users will need to create an SSH Key to connect Terraform:
```bash
ssh-keygen -f ssh-key
```
The above command will create a private SSH Key file "ssh-key" and a public SSH Key file "ssh-key.pub".
Users will need to edit the "dadi.auto.tfvars" to setup Terraform to connect to AWS and run `dadi-cli` and work queue.
The above command will create a private SSH Key file "ssh-key" and a public SSH Key file "ssh-key.pub".

For AWS, users need to choose the [instance_type](https://aws.amazon.com/ec2/instance-types/), the region, and the content of the public SSH Key file.
If users want to run `dadi-cli`, set `run = true` and fill in the "parameters" with the `dadi-cli` subcommand (`dadi-cli` command minus `dadi-cli` portion) the user wants to run. Ex:
### Terraform variables file

Users will need to edit the Terraform variables file, ["dadi.auto.tfvars" template](https://github.com/xin-huang/dadi-cli/blob/master/terraform/dadi.auto.tfvars), to setup Terraform to connect to AWS and run `dadi-cli` and work queue.

In the ".tfvars" file, to setup AWS users can change choose the [instance_type](https://aws.amazon.com/ec2/instance-types/) and the region, and add the content of the public SSH Key file (lines 4, 7, and 10 of the template).
If users want to run `dadi-cli`, set `run = true` (line 15) and fill in the "parameters" (line 22) with the `dadi-cli` subcommand (`dadi-cli` command minus `dadi-cli` portion) the user wants to run.
Users will want to include any data they will use in the "uploads" folder, which will be placed in the directory that `dadi-cli` is executed from.
If users want to fit a model to a frequency spectrum, "experimental-data.fs", they place in the "uploads" folder, they would fill the following for "parameters" in the ".tfvars" file:
```bash
InferDM --fs two_epoch_syn.fs --model two_epoch --p0 1 1 --ubounds 10 10 --lbounds 10e-3 10e-3 --grids 30 40 50 --output terra.two_epoch.demog.params --optimizations 2 --nomisid
parameters = InferDM --fs uploads/experimental-data.fs --model two_epoch --p0 1 1 --ubounds 10 10 --lbounds 10e-3 10e-3 --grids 30 40 50 --output terra.two_epoch.demog.params --optimizations 2 --nomisid
```
Users will want to include any data they will use in the "uploads" folder, which will be placed in the directory that `dadi-cli` is executed from.
To get results, users will need to SSH into the AWS instances Terraform launches. An easy way to SSH into the AWS instance is, from inside the "terraform" folder, to run:
```bash
ssh dadi@$(terraform output -raw public_ip) -i ssh-key
```

If users want to run work_queue_factory on an AWS instance, set `run = true`, and fill in the `project_name` and `workqueue_password`. This can be ran independently if users want Terraform to launch an AWS instance to be a dedicated work queue factory.
If users want to run work_queue_factory on an AWS instance, set `run = true` (line 27), and fill in the `project_name` (line 31) and `workqueue_password` (line 34). This can be ran independently if users want Terraform to launch an AWS instance to be a dedicated work queue factory.

If users named the SSH Key something besides "ssh-key" or if it is in a different directory than the "terraform" folder, line 129 in "main.tf", `private_key = "${file("ssh-key")}"`, will need to be edited to the PATH and file name.

Users might get an error if the requested region has too many instances running:
```consol
Error: error creating EC2 VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
```
Means that the requested region has too many instances running.


## Cacao cloud computing for dadi-cli

Another resource for cloud computing with `dadi-cli` is the University of Arizona CyVerse's [Cacao](http://cacao.jetstream-cloud.org/), which provides a convinient GUI for launching instances to run `dadi-cli` and work queue factories. Cacao is built on Jetstream2, and users will need an account with Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) and register for allocation.
Another resource for cloud computing with `dadi-cli` is the University of Arizona CyVerse's [Cacao](http://cacao.jetstream-cloud.org/), which provides a convinient GUI for launching instances to run `dadi-cli` and work queue factories. Cacao is built on Jetstream2, and users will need an account with [Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS)](https://access-ci.org/) and register for allocation.

A step-by-step guide for getting started on Cacao can be found [here](https://docs.jetstream-cloud.org/ui/cacao/getting_started/#1-login-to-cacao). For researchers that need more out of Cacao/Jetstream2, an overview of ACCESS can be found [here](https://allocations.access-ci.org/get-started-overview) and information on allocating resources for Jetstream2 can be found [here](https://docs.jetstream-cloud.org/alloc/overview/).

Once the user has access to Cacao, they can go to "Deployments" > "Add Deployment" > "launch a DADI OpenStack instance" and choose a region.
If users want the instance to automatically run `dadi-cli` after it launches, they will need to fill in the `dadi-cli` subcommand in "Parameters". There is no easy way for users to upload frequency spectrum, as such `dadi-cli` can read https links that contain raw text data for the frequency spectrum, ex. https://tinyurl.com/u38zv4kw.
Users can also launch instances that run a work queue factory with or without `dadi-cli`, as such users can run one instance as a work queue factory and another instance running `dadi-cli` with work queue.

Notably, while direct frequency spectrum uploads are limited, dadi-cli can interpret raw text data from HTTPS links, such as links to raw text files uploaded to a [GitHub repository](https://github.com/) or [GitHub Gist](https://gist.github.com/).
Flexibility is paramount, as users can deploy instances to function exclusively as a Work Queue factory or run in tandem with dadi-cli.
For results, users can access the Cacao deployment Webshell or Webdesktop or SSH into their instance, provided they have shared a public ssh-key under credentials.

Users can access results via the Cacao deployment's Webshell or Webdesktop. If users provide a public ssh-key under [credentials](https://cacao.jetstream-cloud.org/credentials), they can also ssh into the instance with:
```
ssh USERNAME@PUBLIC_IP -i SSHKEY
```
Where `USERNAME` is the username for Cacao, `PUBLIC_IP` is the public IP of the deployment, and `SSHKEY` is the file that contains the private ssh-key information paired with the public key used for the deployment.


## Snakemake

Finally, as a command-line tool, dadi-cli is straightforward to integrate within workflow managers like [Snakemake](https://snakemake.readthedocs.io/en/stable/).
For example, we provide a [Snakemake workflow](https://github.com/xin-huang/dadi-cli-analysis/tree/main/workflows) that fits DFE models to all populations within the 1000 Genomes Project data.
This also allows for efficient cloud computing through Snakemake across diverse platforms, including Google Cloud Life Sciences and Azure Batch.
93 changes: 79 additions & 14 deletions docs/userguide/simulation.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,94 @@
# Simulation

Users can simulate frequence spectra based on dadi demography or DFE code or on [Demes](https://popsim-consortium.github.io/demes-spec-docs/main/introduction.html) YMAL files.
`dadi-cli` can simulate frequence spectra based on dadi demography or DFE code or on [Demes](https://popsim-consortium.github.io/demes-spec-docs/main/introduction.html) YMAL files with two subcommands: `SimulateDM` and `SimulateDFE`.

`dadi-cli` can simulate dadi demography with `dadi-cli SimulateDM`. Users need to pass in a `--model` (and `--model-file` if it is a custom model), `--sample-sizes`, parameters for the model (`--p0`), and spectrum file name (`--output`). Running
For example, users can simulate the AFS of a single population with a two-epoch demographic model using the following command:
```bash
dadi-cli SimulateDM --model two_epoch
--sample-sizes 20 --p0 10 0.1 --nomisid
--output two_epoch.simDM.fs
```
dadi-cli SimulateDM --model two_epoch --sample-sizes 20 --p0 10 0.1 --nomisid --output two_epoch.simDM.fs
```
Will produce a file with the simulated demography `two_epoch.simDM.fs`. If users want to generate caches and simulate a DFE based on a simulated demography, users can include `--inference-file` which will produce a file based on the text passed in `--output`, ex the command
Here, the `--p0` argument specifies the values for the two demographic parameters in the two-epoch model, and the `--nomisid` argument tells dadi-cli exclude the parameter for the ancestral allele misidentification during the simulation.



Users can simulate the AFS of a single population with a two-epoch demographic model using the following command:
```bash
dadi-cli SimulateDM --model two_epoch
--sample-sizes 20 --p0 10 0.1 --nomisid
--output two_epoch.simDM.fs
```
dadi-cli SimulateDM --model three_epoch --sample-sizes 20 --p0 10 5 0.02 0.1 --nomisid --output three_epoch.simDM.fs --inference-file
A file with the simulated demography `two_epoch.simDM.fs` will be produced.

If users want to generate caches and simulate a DFE based on a simulated demography, users can include `--inference-file` which will produce a file based on the text passed in `--output`, ex the command
```bash
dadi-cli SimulateDM --model three_epoch
--sample-sizes 20 --p0 10 5 0.02 0.1
--nomisid --output three_epoch.simDM.fs --inference-file
```
will produce the frequency spectrum `three_epoch.simDM.fs` and the optimization file `three_epoch.simDM.fs.SimulateDM.pseudofit`.

Users can also simulate demography frequency spectrum with Demes. To simulate with Demes, users will need to install it:
Users can also simulate the AFS with a DFE model, if they have a cache file from the `GenerateCache` subcommand. For example, if users had the cache from the [DFE Inference guide](https://dadi-cli.readthedocs.io/en/latest/userguide/dfe/#generating-caches-for-dfe-inference), they can run:
```
pip install demes
dadi-cli SimulateDFE --cache1d 1KG.YRI.CEU.20.split_mig.sel.single.gamma.spectra.bpkl
--pdf1d lognormal --ratio 2.31 --p0 2 4 --nomisid
--output lognormal.split_mig.simDFE.fs
```
When users have a [Demes YAML file](https://popsim-consortium.github.io/demes-spec-docs/main/tutorial.html) made, they can simulate frequency spectra that is readable by dadi:

Users can also simulate demography frequency spectrum with a Demes YAML file. To simulate with Demes, users will need to install it:
```
pip install demes
```
dadi-cli SimulateDemes --demes-file examples/data/gutenkunst_ooa.yml --pop-ids YRI --sample-sizes 30 --output ooa.YRI.30.fs

An [example Demes file](https://github.com/popsim-consortium/demes-python/tree/main/examples) `gutenkunst_ooa.yml` is below:
```
A file, `ooa.YRI.30.fs`, with the spectrum will be made.
description: The Gutenkunst et al. (2009) OOA model.
doi:
- https://doi.org/10.1371/journal.pgen.1000695
time_units: years
generation_time: 25
demes:
- name: ancestral
description: Equilibrium/root population
epochs:
- {end_time: 220e3, start_size: 7300}
- name: AMH
description: Anatomically modern humans
ancestors: [ancestral]
epochs:
- {end_time: 140e3, start_size: 12300}
- name: OOA
description: Bottleneck out-of-Africa population
ancestors: [AMH]
epochs:
- {end_time: 21.2e3, start_size: 2100}
- name: YRI
description: Yoruba in Ibadan, Nigeria
ancestors: [AMH]
epochs:
- start_size: 12300
- name: CEU
description: Utah Residents (CEPH) with Northern and Western European Ancestry
ancestors: [OOA]
epochs:
- {start_size: 1000, end_size: 29725}
- name: CHB
description: Han Chinese in Beijing, China
ancestors: [OOA]
epochs:
- {start_size: 510, end_size: 54090}
Users can simulate a DFE frequency spectrum if they have the caches (`--cache1d` and/or `--cache2d`). Users will also need to define the PDF(s) (`--pdf1d` and/or `--pdf2d`), the `--ratio` of nonsynonymous to synonymous mutation rate, and the file name (`--output`). Running:
migrations:
- {demes: [YRI, OOA], rate: 25e-5}
- {demes: [YRI, CEU], rate: 3e-5}
- {demes: [YRI, CHB], rate: 1.9e-5}
- {demes: [CEU, CHB], rate: 9.6e-5}
```
dadi-cli SimulateDFE --cache1d examples/results/caches/1KG.YRI.CEU.20.split_mig.sel.single.gamma.spectra.bpkl --pdf1d lognormal --ratio 2.31 --p0 2 4 --nomisid --output lognormal.split_mig.simDFE.fs
Users can simulate the AFS of YRI with the above Demes file and the following command:
```bash
dadi-cli SimulateDemes --demes-file gutenkunst_ooa.yml
--pop-ids YRI --sample-sizes 30 --output ooa.YRI.30.fs
```
will produce a frequency spectrum file based on a lognormal DFE, `lognormal.split_mig.simDFE.fs`.

Users can learn more about making Demes YAML files [here](https://popsim-consortium.github.io/demes-spec-docs/main/tutorial.html).
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,5 @@ nav:
- Models: 'userguide/models.md'
- Plotting: 'userguide/plot.md'
- Publication Resources:
- Data Preperation: 'docs/paper-resources/data-preperation.md'
- Data Preperation: 'paper-resources/data-preperation.md'
- References: 'references.md'

0 comments on commit 810a049

Please sign in to comment.