Screen for mycoplasma contamination in DNA-seq and RNA-seq data: Part 2

If you're culturing cells, avoiding contamination is an extremely important priority. Mycoplasma are common contaminant that can change the behaviour of cells and don't cause any obvious morphological changes, so routine testing is key. As bioinformaticians we should be incorporating myco screening into all our workflows, but the available tools haven't been that great.

Even the shell script I wrote in a previous blog post isn't that great. It suffers from a few problems in miscounting reads, usability and leaving many files in the working directory. So it is high time for a refresh.

With this refresh, the idea was to make a command line tool that folks could easily incorporate into their workflows, would process single- and paired-end data, be quick and be concise in its output.

The result is `contam`. To give you an idea of what it can do, here is the help page:

contam is a script for the quantification of contaminant sequences in Illumina short read sequence data including genomic and transcriptome samples.

Usage: contam [-h] [-v] <-r HOST_REFERENCE_GENOME_FOLDER> <-c CONTAMINANTS_FOLDER> [-n NUM_READS] [-t NUM_THREADS] <-1 FASTQ_FILE_1> <-2 FASTQ_FILE_2> <-b FASTQ_BASE_NAME>

  -h  Help. Display this message and quit.

  -v  Version. Print version number and quit.

  -r  Reference sequence (host) folder. Must contain one gzip compressed fasta file with '.fa.gz' suffix.

  -c  Contaminant sequence folder. May contain one or many compressed fasta files.

  -n  Number of read pairs to analyze. Default is 1000000.

  -t  Number of parallel threads. Default is 2.

  -1  First fastq file corresponding to read 1 of the pair.

  -2  Second fastq file corresponding to read 2 of the pair. Omit for single-end sequencing runs.

  -b  Basename of the sequence dataset. This is the name of the folder where results will be saved.

Output: For each sequence dataset, a folder is created which contains the intermediate files
and final results, called "result.tsv".
The result.tsv file contains four columns, the contaminant name (abbreviated to 15 characters),
The number of reads that mapped to the host genome only, the number of reads that mapped to
the contaminant genome only and the number of reads that mapped to both.
Each line in the file corresponds to a different contaminant.

It can be run on the command line like this:

./contam -r ref -c contam_folder -n 10000 -t 8 -1 test_R1.fq.gz -2 test_R2.fq.gz -b test

Contam will map `n` reads to the host and contaminant genomes. You can screen as many potential contaminants as you like. Contam will count the number of reads that align uniquely to the host, the contaminants and those that map to both genomes. The tabular results look like this, where each potential contaminant genome is shown.

Contam example output

One of the issues with sharing shell scripts like this is that they aren't really portable. The script may not work properly on non-Ubuntu Linux distributions nor other Unix like OSs. There's Docker, which is useful in circumventing this limitation, but Docker requires root permission in most cases and researchers on shared HPC systems might not be able to convince their sysadmins about installing Docker.

For this reason, I have written an Apptainer based solution. If you haven't heard of it before, read my previous post on how to save your analytical environment as an Apptainer image. The benefit of using apptainer here is that it bundles the script and dependencies in a relatively small package that can be executed without root permission. Just look how easy it is to get started:

# download app

wget https://ziemann-lab.net/public/contam/contam.sif

# run it
apptainer run --bind $PWD:/contam contam.sif -h

Sure, apptainer itself needs to be installed, but the documentation is quite extensive and can be done multiple ways, for example through apt, from source and via conda.

The process of bulding an Apptainer app involves writing a definition file which begins with a base OS, then installs the dependencies, in this case skewer, bwa and samtools are required. It then fetches the contam script and makes sure it is available in the environment path. It sounds simple, but took a few hours of hair-pulling to get right. If you're keen to build a bash based command line tool of your own in this way, you can use the def file in the GitHub repository as a template.

Finally, the image is built using Apptainer and I made the image (77MB) available on my lab's website.

Here is a realistic example of what execution of the apptainer would look like:

apptainer run --bind $PWD:/contam contam.sif -r ref -c contam_folder -n 1000000 -t 8 -1 test_R1.fq.gz -2 test_R2.fq.gz -b test

In single end mode, typical execution time of 1 million 151 bp reads over 8 threads takes ~56 seconds. For paired end sequencing it can take ~95s. Indexing the host genome is the most time-consuming part, taking ~2 hrs for human.

For kicks, I tested the tool with some known positive data from NCBI SRA that were described in a previous paper in Nucleic Acids Research. Specifically, SRA accession SRP004637, which was published in Nature Biotech in 2011. The series consists of 58 runs and various experiments. I used the contam tools to screen 1 million reads from each single-end fastq file and got some pretty interesting results. Astonishingly, there were 10 datasets with over 20% mycoplasma (M. hyorhinis) reads suggesting a severe infection likely to invalidate any of the results derived from those particular runs. Of the other 48 datasets, seven of them had between 1% and 7% of reads mapping uniquely to M. hyorhinis, which is a concern. The other datasets weren't affected. Concerningly the average proportion of reads mapped to the human genome was only 61%, with 24 samples having less than 70% of reads mapped to the host indicating more potential quality control problems.

Mycoplasma contamination present in GEO Series GSE25183

Mycoplasma hyorhinis detected in GSE25183.

This work shows that contam is a useful command line tool for screening fastq files for known contaminants. Here we tested for mycoplasma, but as it is a general purpose tool, many other types of contaminants can be screened in parallel. We also demonstrate how Apptainer can be used to create lightweight containerised rootless tools for bioinformatics and other tasks.

I would like thank Dr Danielle Hiam from Deakin University for extensive testing and excellent suggestions to improve the contam tool!

Contam is available from GitHub: https://github.com/markziemann/contam/tree/main

Search This Blog

Genome Spot

Screen for mycoplasma contamination in DNA-seq and RNA-seq data: Part 2

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts