Tutorial

This tutorial will walk you through the installation and usage of ViroProfiler. ViroProfiler is containerized with Docker, and can be used with multiple container engines, including Docker, Singularity, Podman, Shifter and Charliecloud (see docs). In this tutorial, we will use Singularity as an example.

Install ViroProfiler

ViroProfiler is built on Nextflow, which means it is very easy to install with a single command:

# Install ViroProfiler and set up the database
nextflow run deng-lab/viroprofiler -r main -profile singularity --mode "setup"

By default, database will be downloaded to the $HOME/viroprofiler directory. You can change database path using the --db parameter. For example, if you want to download the database to /db/path directory, you can run the following command,

nextflow run deng-lab/viroprofiler -r main -profile singularity --mode "setup" --db "/db/path"

It is recommended to set the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options when using the Singularity profile, so that singularity images can be stored and re-used from a central location for future pipeline runs.

If you are using Docker, please replace the -profile singularity with -profile docker.

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

nextflow pull deng-lab/viroprofiler

Run the ViroProfiler pipeline

Prepare input files

Samplesheet input

To execute the pipeline users must provide sequencing reads as input. You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline.

The samplesheet is a comma-separated file with 3 columns. The first column is the sample name, the second column is the full path to the first read file, and the third column is the full path to the second read file. The sample name can be any string, but it is recommended to use the sample ID. The sample name will be used as the prefix of the output files. The following is an example of a samplesheet with three samples:

sample,fastq_1,fastq_2
sampleID1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sampleID2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
sampleID3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz

Column	Description
`sample`	Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`).
`fastq_1`	Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".
`fastq_2`	Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".

An example samplesheet has been provided with the pipeline.

Use the --input parameter to specify its location, or set input in the params.yml file.

The sequencing reads can be raw reads or cleaned reads. By default, ViroProfiler will assume the input reads are raw reads. If you want to use cleaned reads, you can set parameter --reads_type "clean" in command line, or set reads_type: "clean" in the params.yml file. If the input reads are cleaned reads, the pipeline will skip the cleaning step (removing adapters, low quality reads and contaminant reads).

If you already have assembled contigs, you can skip the assembly step by setting --input_contigs parameter to specify the path to the contigs file. The contigs file should be in FASTA format.

Run the pipeline

The typical command for running the pipeline is as follows:

nextflow run deng-lab/viroprofiler \
    --input samplesheet.csv \
    -profile singularity

This will launch the pipeline with the singularity configuration profile. See Selecting NF profiles for more information about profiles.

You may create a config file to customize the parameters of the pipeline and use -c to load the config. Please check custom.config for an example. You may also specify the parameters in a file and use -params-file to load the parameters. Please check params.yml for an example. Then the command for running the pipeline is as follows:

nextflow run deng-lab/viroprofiler \
    --input samplesheet.csv \
    -profile singularity \
    -c custom.config \
    -params-file params.yml

Note that the pipeline will create the following files in your working directory:

work                # Directory containing the nextflow working files
output              # Output folder (can be modified with `--outdir` parameter)
.nextflow_log       # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Reproducible data analysis

For reproducibility, we recommend using a specific version of ViroProfiler. You can always run a specific version of ViroProfiler by specifying the version number. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the deng-lab/viroprofiler releases page and find the latest version number (eg. v0.2). Then specify this when running the pipeline with -r (one hyphen) - eg. -r v0.2. This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, to run version v0.2 of the pipeline:

nextflow run deng-lab/viroprofiler -r v0.2 -profile singularity

If the pipeline fails, you can resume the pipeline from the last successful step by adding -resume to the command. For example:

nextflow run deng-lab/viroprofiler -r v0.2 -profile singularity -resume

Description of pipeline options and parameters

To get a list of pipeline options and parameters, run the pipeline with the --help flag:

nextflow run deng-lab/viroprofiler --help

# get full list of options and parameters
nextflow run deng-lab/viroprofiler --help --show_hidden_params

Tip

All these parameters are configurable through a configuration file. We encourage users to use the configuration file since it will keep your execution cleaner and more readable. See a config example.

Core Nextflow options

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. See profile for more information.

`-resume`

Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

`-c`

Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.

Custom pipeline parameters

Input/output options

Parameter	Default	Description
`--input`	NA	Input samplesheet describing all the samples to be analysed
`--output`	output	Name of directory to store output values
`--db`	${HOME}/viroprofiler	Path containing required ViroProfiler databases

On/Off processes

Parameter	Default	Description
`--use_dram`	false	Use DRAM or not
`--use_abricate`	false	Use abricate or not
`--use_decontam`	false	Remove host contamination from reads or not
`--use_eggnog`	false	Use eggnog-mapper to annotate proteins or not
`--use_iphop`	false	Use iPhOP to predict host or not
`--use_kraken2`	false	Use kraken2 to classify reads or not
`--use_phamb`	false	Use phamb to bin contigs or not

Other parameters

Parameter	Default	Description
`--prot_cluster_min_similarity`	0.7	Minimum similarity of protein seqs in the same cluster
`--prot_cluster_min_coverage`	0.9	Minimum similarity of protein seqs in the same cluster
`--binning`	null	Which binning tool to use, `vRhyme`, `phamb` or `false`
`--binning_minlen_contig`	2000	Congits shorter than this value will not be used for binning
`--binning_minlen_bin`	2000	Bin size shorter than this value will be removed from down-stream analyses
`--dvf_qvalue`	0.1	q-value used in `DeepVirFinder`
`--virsorter2_groups`	"dsDNAphage"	Viral category detected by `VirSorter2`, could be any combination of `dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae`
`--contig_minlen_vcontact2`	10000	Contigs/Bins short than this value will not be used in `vConTACT2`
`--pc_inflation`	1.5	Protein cluster inflation value used in `vConTACT2`
`--vc_inflation`	1.5	Viral cluster inflation value used in `vConTACT2`
`--taxa_db_source`	"NCBI"	Taxonomy database, could be either `NCBI` or `ICTV`
`--replicyc`	"replidec"	Viral replication cycle prediction method, could be either `replidec` or `bacphlip`

Max job request options

Set the top limit for requested resources for any single job. If you are running on a smaller system, a pipeline step requesting more resources than are available may cause the Nextflow to stop the run with an error. These options allow you to cap the maximum resources requested by any single job so that the pipeline will run on your system.

Note

Note that you can not increase the resources requested by any job using these options. For that you will need your own configuration file. See the nf-core website for details.

Parameter	Default	Description
`--max_cpus`	4	Maximum number of CPUs that can be requested for any single job
`--max_memory`	20.GB	Maximum amount of memory that can be requested for any single job
`--max_time`	120.h	Maximum amount of time that can be requested for any single job

Outputs

A glimpse over the main outputs produced by ViroProfiler is given at output section.