Merge pull request #205 from seqeralabs/main-fusion-docs-audit

Fusion docs overhaul
seqeralabs · Sep 19, 2024 · bbd7cd0 · bbd7cd0
2 parents 093a5a8 + a821916
commit bbd7cd0
Show file tree

Hide file tree

Showing 18 changed files with 625 additions and 406 deletions.
diff --git a/fusion_docs/faq.mdx b/fusion_docs/faq.mdx
@@ -2,53 +2,67 @@
 title: FAQ
 ---
 
-# Frequently Asked Questions
+### Which cloud object stores does Fusion support?
 
-## Which object storage are supported by Fusion
+Fusion supports AWS S3, Azure Blob, and Google Cloud Storage. Fusion can also be used with local storage solutions that support the AWS S3 API. 
 
-Fusion currently supports AWS S3 and Google Storage. In the near future it will also support Azure Blob storage.
+### How does Fusion work?
 
-## Can I use Fusion with Minio?
+Fusion implements a FUSE driver that mounts the cloud storage bucket in the job execution context as
+a POSIX file system. This allows the job script to read and write data files in cloud object storage as if they were local files.
 
-Yes. [Minio](https://min.io/), implements a S3-compatible API, therefore it can be used in place of AWS S3.
-See the documentation how to configure your pipeline execution to use Fusion and Minio. (link to guide TBD).
+### Why is Fusion faster than other FUSE drivers?
 
-## Can I download Fusion?
+Fusion is not a general purpose file system. It has been designed to optimize the data transfer of bioinformatics pipelines by taking advantage of the Nextflow data model.
 
-No. Currently, Fusion can only be used by enabling Wave containers in the configuration of your Nextflow pipeline.
+### Why do I need Wave containers to use Fusion?
 
-## Why I need Wave containers to use Fusion?
+Fusion is designed to work at the job execution level. This means it must run in a containerized job execution context.
 
-Fusion is designed to work at level of job executions. For this reason, it needs to run in containerised job
-execution context.
+Downloading and installing Fusion manually would require you to rebuild all the containers used by your data pipeline to include the Fusion client each time a new version of the client is released. You would also need to maintain a custom mirror or existing container
+collections, such as [BioContainers](https://biocontainers.pro/).
 
-This would require to rebuild all containers used by your data pipeline to include the Fusion client each time a new
-version of the Fusion client is released, and it would make necessary to maintain a custom mirror or existing containers
-collections, such as [BioContainers](https://biocontainers.pro/) which is definitively not desirable.
+Wave enables you to add the Fusion client to your pipeline containers at deploy time, without the need to rebuild them or
+maintain a separate container image collection.
 
-Wave allows adding the Fusion client in your pipeline containers at deploy time, without having to rebuild them or
-to maintainer a separate container images collection.
+### Can Fusion mount more than one bucket in the job's file system?
 
-## How Fusion works behind the scene?
+Yes. Any access to cloud object storage is automatically detected by Fusion and the corresponding buckets are mounted
+on demand.
 
-Fusion is implemented a FUSE driver that mounts the storage bucket in the job execution context as
-a POSIX file system. This allows the job script to read and write data over the object storage like it were local files.
+### Can Fusion mount buckets of different vendors in the same execution?
 
-## Can Fusion mount more than one bucket in job file system
+No. Fusion can mount multiple buckets per execution, but all from the same vendor, such as AWS S3 or Google Cloud Storage.
 
-Yes. Fusion any access to an object storage is automatically detected by Fusion and the corresponding bucket is mounted
-on-demand.
+### I tried Fusion, but I didn't notice any performance improvement. Why?
 
-## Can Fusion mount buckets of different vendors in the same execution?
+If you didn’t notice any performance improvement with Fusion, the bottleneck may lie in other factors, such as network latency or memory limitations. Fusion’s caching strategy relies heavily on NVMe SSD or similar storage technology, so ensure your computing nodes are using the recommended storage. Check your Platform compute environment page for optimal instance and storage configurations:
 
-No. Fusion can mount multiple buckets but the must be of the vendor e.g. AWS S3 or Google Storage.
+- [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch)
+- [Azure Batch](https://docs.seqera.io/platform/latest/compute-envs/azure-batch)
+- [Google Cloud Batch](https://docs.seqera.io/platform/latest/compute-envs/google-cloud-batch)
+- [Amazon EKS](https://docs.seqera.io/platform/latest/compute-envs/eks)
+- [Google GKE](https://docs.seqera.io/platform/latest/compute-envs/gke)
 
-## How Fusion can be faster of other existing FUSE driver?
+### Can I pin a specific Fusion version to use with Nextflow?
 
-Fusion is not a general purpose file system. Instead, it has been designed to optimise the data transfer of Nextflow
-data pipeline taking advantage of the data model used by Nextflow. [to be improved]
+Yes. Add the Fusion version's config URL using the `containerConfigUrl` option in the Fusion block of your Nextflow configuration (replace `v2.4.2` with the version of your choice):
 
-## I tried Fusion, but I didn't notice any performance improvement. Why?
+```groovy
+fusion {
+  enabled = true
+  containerConfigUrl = 'https://fusionfs.seqera.io/releases/v2.4.2-amd64.json' 
+}
+```
 
-Make sure the computing nodes in your cluster have NVMe SSD storage or equivalent technology. Fusion implements an
-aggressive caching strategy that requires the use of local scratch storage bases on solid-state disks.
+:::note
+For ARM CPU architectures, use https://fusionfs.seqera.io/releases/v2.4.2-arm64.json. 
+:::
+
+### Can I use Fusion with Minio?
+
+Yes. [Minio](https://min.io/) implements an S3-compatible API, therefore it can be used instead of AWS S3. See [Local execution with Minio](https://www.nextflow.io/docs/latest/fusion.html#local-execution-with-minio) for more information.
+
+### Can I download Fusion?
+
+No. Fusion can only be used directly in supported [Seqera Platform compute environments](https://docs.seqera.io/platform/latest/compute-envs/overview), or by enabling [Wave containers](https://docs.seqera.io/wave) in your Nextflow configuration.
diff --git a/fusion_docs/get-started.mdx b/fusion_docs/get-started.mdx
@@ -0,0 +1,56 @@
+---
+title: Get started
+description: "Use the Fusion v2 file system in Seqera Platform and Nextflow"
+date: "23 Aug 2024"
+tags: [fusion, storage, compute, file system, posix, client]
+---
+
+Use Fusion directly in Seqera Platform compute environments, or add Fusion to your Nextflow pipeline configuration.
+
+### Seqera Platform
+
+Use Fusion directly in the following Seqera Platform compute environments:
+- [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch)
+- [Azure Batch](https://docs.seqera.io/platform/latest/compute-envs/azure-batch)
+- [Google Cloud Batch](https://docs.seqera.io/platform/latest/compute-envs/google-cloud-batch)
+- [Amazon EKS](https://docs.seqera.io/platform/latest/compute-envs/eks)
+- [Google GKE](https://docs.seqera.io/platform/latest/compute-envs/gke)
+
+See the Platform compute environment page for your cloud provider for Fusion configuration instructions and optimal compute and storage recommendations. 
+
+### Nextflow
+
+:::note
+Fusion requires Nextflow `22.10.0` or later.
+:::
+
+Fusion integrates with Nextflow directly and does not require any installation or change in pipeline code. It only requires to use of a container runtime or a container computing service such as Kubernetes, AWS Batch, or Google Cloud Batch.
+
+#### Nextflow installation
+
+If you already have Nextflow installed, update to the latest version using this command:
+
+```bash
+nextflow -self-update
+```
+
+Otherwise, install Nextflow with this command:
+
+```bash
+curl get.nextflow.io | bash
+```
+
+#### Fusion configuration
+
+To enable Fusion in your Nextflow pipeline, add the following snippet to your `nextflow.config` file:
+
+```groovy
+fusion.enabled = true
+wave.enabled = true
+tower.accessToken = '<your Platform access token>' //optional
+```
+
+:::tip
+The use of the Platform access token is not mandatory, however, it's required to enable access to private repositories
+and it allows higher service rate limits compared to anonymous users.
+:::
diff --git a/fusion_docs/guide.mdx b/fusion_docs/guide.mdx
@@ -1,5 +1,8 @@
 ---
 title: User guide
+description: "Overview of the Fusin v2 file system"
+date: "23 Aug 2024"
+tags: [fusion, storage, compute, file system, posix, client]
 ---
 
 # User guide
@@ -23,21 +26,21 @@ Fusion smoothly integrates with Nextflow and does not require any installation o
 
 ### Nextflow installation
 
-If you have already installed Nextflow, update to the latest version using this command::
+If you have already installed Nextflow, update to the latest version using this command:
 
 ```bash
 nextflow -self-update
 ```
 
-If you don't have Nextflow already installed, install it with the command below::
+If you don't have Nextflow already installed, install it with the command below:
 
 ```bash
 curl get.nextflow.io | bash
 ```
 
 ### Fusion configuration
 
-To enable Fusion in your Nextflow pipeline add the following snippet to your `nextflow.config` file::
+To enable Fusion in your Nextflow pipeline add the following snippet to your `nextflow.config` file:
 
 ```groovy
 fusion.enabled = true

diff --git a/fusion_docs/guide/aws-batch-s3.mdx b/fusion_docs/guide/aws-batch-s3.mdx
diff --git a/fusion_docs/guide/aws-batch.mdx b/fusion_docs/guide/aws-batch.mdx
@@ -0,0 +1,50 @@
+---
+title: AWS Batch
+description: "Use Fusion with AWS Batch and S3 storage"
+date: "23 Aug 2024"
+tags: [fusion, storage, compute, aws batch, s3]
+---
+
+Fusion simplifies and improves the efficiency of Nextflow pipelines in [AWS Batch](https://aws.amazon.com/batch/) in several ways:
+
+- No need to use the AWS CLI tool for copying data to and from S3 storage.
+- No need to create a custom AMI or create custom containers to include the AWS CLI tool.
+- Fusion uses an efficient data transfer and caching algorithm that provides much faster throughput compared to AWS CLI and does not require a local copy of data files.
+- By replacing the AWS CLI with a native API client, the transfer is much more robust at scale.
+
+### Platform AWS Batch compute environments 
+
+Seqera Platform supports Fusion in Batch Forge and manual AWS Batch compute environments. 
+
+See [AWS Batch](https://docs.seqera.io/platform/latest/compute-envs/aws-batch) for compute and storage recommendations and instructions to enable Fusion.
+
+### Nextflow CLI
+
+:::tip
+Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in
+parallel to and from the object storage into the container-local temporary directory (`/tmp`). To achieve optimal performance, set up an SSD volume as the temporary directory.
+
+Several AWS EC2 instance types include one or more NVMe SSD volumes. These volumes must be formatted to be used. See [SSD instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) for details. Seqera Platform automatically formats and configures NVMe instance storage with the “Fast instance storage” option when you create an AWS Batch compute environment.
+:::
+
+1. Add the following to your `nextflow.conf` file:
+
+    ```groovy
+    process.executor = 'awsbatch'
+    process.queue = '<YOUR AWS BATCH QUEUE>'
+    process.scratch = false
+    process.containerOptions = '-v /path/to/ssd:/tmp' // Required for SSD volumes
+    aws.region = '<YOUR AWS REGION>'
+    fusion.enaled = true
+    wave.enabled = true
+    ```
+
+    Replace `<YOUR AWS BATCH QUEUE>` and `<YOUR AWS REGION>` with your AWS Batch queue and region.
+
+1. Run the pipeline with the usual run command:
+
+    ```
+    nextflow run <YOUR PIPELINE SCRIPT> -w s3://<YOUR-BUCKET>/work
+    ```
+
+    Replace `<YOUR PIPELINE SCRIPT>` with your pipeline Git repository URI and `<YOUR-BUCKET>` with your S3 bucket.