Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression
This repository contains data, code, and figures generated for the manuscript:
Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). [Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression](https://www.biorxiv.org/content/10.1101/2023.12.11.571168). bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168
The preprint is posted on the bioRxiv: https://www.biorxiv.org/content/10.1101/2023.12.11.571168
💡 General tutorials with example data can be found on the kallisto bustools website:
When interpreting the presence of RdRP-like sequences / virus IDs, keep in mind that there will likely be many RdRP-like sequences introduced by contamination of laboratory reagents. A (non-comprehensive) list of virus IDs observed in blank sequencing data is available here.
The Notebooks folder contains notebooks to reproduce all of our analyses, starting with pre-processing of the raw data all the way to final figure generation. The notebooks are organized by figure (based on the bioRxiv preprint) and immediately executable via Google Colab.
Since the figure order was updated between the bioRxiv preprint and the subsequent publication of the manuscript in Nature Biotechnology, the Notebooks_Nature_Biotech folder links to the appropriate notebooks based on the figure numbering in the Nature Biotechnology version.
Large intermediary files that are generated/used in these notebooks are stored on Caltech Data and can be accessed under the DOIs 10.22002/krqmp-5hy81 and 10.22002/k7xqw-88d74.
Click here to view the interactive Krona plot showing all viruses expressed above the QC threshold in macaque cells that passed quality control, broken down by animal, timepoint, taxonomy, and fraction of positive cells occupied by each virus. Code to reproduce the Krona plot
The precomputed_refs folder contains precomputed reference indices for the detection of viral RNA in sequencing data (through alignment to the optimized PalmDB) and with masked human (or mouse) genome and transcriptome.
A description of kallisto, bustools, and kb-python including tutorials for their use can be found here: https://www.nature.com/articles/s41596-024-01057-0
# 1. Install kb-python (optional: install gget to fetch the host genome and transcriptome)
pip install kb-python gget
# 2. Download optimized PalmDB reference files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# 3. Create reference index (+ optional masking of the host, here human, genome using the D-list)
# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
--aa \
-k 55 \
--d-list $(gget ref --ftp -w dna,cdna homo_sapiens) \
-i index.idx --workflow custom \
# 4. Align sequencing reads
# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
--aa \
-k 55 \
-i index.idx -g palmdb_clustered_t2g.txt \
--parity single \
-x default \