Screen for mycoplasma contamination in DNA-seq and RNA-seq data

M. haemofelis infection in blood smear. Courtesy of Wikipedia.
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In this post, I'll walk you through my effort to quantify Mycoplasma contamination in two fastq files, one previously identified to have high level contamination (from the NAR paper) and one from our own lab with unknown mycoplasma status.

First I'll need to download the two fastq files from EBI.

axel -n5 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488569/SRR488569.fastq.gz
axel -n5 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR944/SRR944282/SRR944282.fastq.gz

Next, here is the script I'll use. You'll see that it consists of 2 functions. The first function checks whether the myco genomes have already been downloaded and if necessary downloads them. It then performs a bit of formatting to the reference sequences and indexes them ready for BWA. The second function aligns the fastq reads to each myco genome and them to the full collection of 6 mycos together. The reason for this will be clear later. We only keep the aligned reads to save on disk space. The myco aligned reads are then mapped back to the host genome to see whether they could be from the host.

Before you try this at home, note I'm using the following:
  • Ubuntu 14.04 LTS with BASH
  • Enough RAM to align to the human genome with BWA (8 GB minimum)
  • BWA version 0.7.13
  • SamTools 1.3 (using htslib 1.3) <<-- this is a relatively mew one and the syntax for "sort" is different
  • bamtofastq (aka bamToFastq) from BedTools Version: v2.17.0
  • Single end reads

#!/bin/bash

#The mycoplasma genomes download function
dlmyco(){
wget -N ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_14_collection/acholeplasma_laidlawii_pg_8a/dna/Acholeplasma_laidlawii_pg_8a.GCA_000018785.1.29.dna.genome.fa.gz
echo '>Alaidlawii' > Alaidlawii.fa
zcat Acholeplasma_laidlawii_pg_8a.GCA_000018785.1.29.dna.genome.fa.gz \
| grep -v '>' >> Alaidlawii.fa

wget -N ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_19_collection/mycoplasma_fermentans_pg18/dna/Mycoplasma_fermentans_pg18.GCA_000209735.1.29.dna.genome.fa.gz
echo '>Mfermentans' > Mfermentans.fa
zcat Mycoplasma_fermentans_pg18.GCA_000209735.1.29.dna.genome.fa.gz \
| grep -v '>' >> Mfermentans.fa

wget -N ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_15_collection/mycoplasma_hominis_atcc_23114/dna/Mycoplasma_hominis_atcc_23114.GCA_000085865.1.29.dna.genome.fa.gz
echo '>Mhominis' > Mhominis.fa
zcat Mycoplasma_hominis_atcc_23114.GCA_000085865.1.29.dna.genome.fa.gz \
| grep -v '>' >> Mhominis.fa

wget -N ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_5_collection/mycoplasma_hyorhinis_sk76/dna/Mycoplasma_hyorhinis_sk76.GCA_000313635.1.29.dna.genome.fa.gz
echo '>Mhyorhinis' > Mhyorinis.fa
zcat Mycoplasma_hyorhinis_sk76.GCA_000313635.1.29.dna.genome.fa.gz \
| grep -v '>' >> Mhyorinis.fa

wget -N ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_33_collection/mycoplasma_arginini_7264/dna/Mycoplasma_arginini_7264.GCA_000367785.1.29.dna.genome.fa.gz
echo '>Marginini' > Marginini.fa
zcat Mycoplasma_arginini_7264.GCA_000367785.1.29.dna.genome.fa.gz \
| grep -v '>' >> Marginini.fa

wget -N ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000420105.1_ASM42010v1/GCF_000420105.1_ASM42010v1_genomic.fna.gz
echo '>Morale' > Morale.fa
zcat GCF_000420105.1_ASM42010v1_genomic.fna.gz \
|grep -v '>' >> Morale.fa

rm Myco.fa 2>/dev/null
cat *fa > Myco.fa
for i in *fa ; do
  bwa index $i
done
}
export -f dlmyco

myco(){
#set -x
MYCO=Myco.fa

###The reference genome path needs to be set
REF=/path/to/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
FQZ=$1
ZIP=`echo $FQZ | rev | cut -d '.' -f1 | rev`
echo $ZIP
if [ "$ZIP" == "gz" ] ; then UNZIP=zcat
elif [ "$ZIP" == "bz2" ] ; then UNZIP=bzcat
else echo "error zip not identified" ; exit 1
fi

FQ=`echo $FQZ | rev | cut -d '.' -f2- | rev`

#align reads to the myco
for FA in *fa ; do
$UNZIP $FQZ \
| bwa mem -t`nproc` $FA - \
| samtools view -F 4 -uSh - \
| samtools sort -o $FQ.$FA.bam -

samtools index $FQ.$FA.bam
#identify placed reads

samtools view -q20 $FQ.$FA.bam \
| cut -f3 \
| sort \
| uniq -c > $FQ.$FA.stats

#map reads back to host genome
bamToFastq -i $FQ.$FA.bam -fq /dev/stdout \
| bwa mem -t4 $REF - \
| samtools view -F 4 -uSh - \
| samtools sort -o $FQ.$FA.ref.bam -

#generate stats file
samtools index $FQ.$FA.ref.bam
samtools view -q20 $FQ.$FA.ref.bam \
| cut -f3 \
| sort \
| uniq -c > $FQ.$FA.ref.stats
done
}
export -f myco

##only download, format and index if required
if [ ! -f Myco.fa.bwt ] ; then
  dlmyco
fi

## You can increase the number of parallel jobs if you have

## enough memory
parallel -u -j1 myco ::: `ls | egrep '(fq|fastq)' | egrep '(gz$|bz2$)'`



At the end of all this data crunching you be left with a bunch of .bam files and a bunch of .stats files.

The bam files contain the mapped reads and we can query them quite simply using a mapQ20 threshold:

for BAM in *bam ; do
  CNT=`samtools view -c -q20 $BAM`
  echo $BAM $CNT
done



This way we can count the number of reliably mapped reads to each myco genome and the number of these that map back to the human genome. The two datasets I chose had similar numbers of reads. SRR488569 has 20,216,671 and SRR944282 has 20,877,558. As you can see in the first table,  SRR944282 has a high number of mycoplasma reads as it was identified previously in the NAR paper. On the other hand, SRR488569 from our lab showed much lower numbers of myco reads.

* All myco is the mapping of reads to a library of the 6 myco genomes, the number of mapped reads here is smaller as the mapQ filtering is likely to exclude a bunch of reads that map to conserved genes.

During the running of the script, I also generated a bunch of .stats files, which are just a summary of the placement of mapQ20 reads (from the "All myco" alignment). This info should in theory reveal which species is the most abundant. We can look at the data relatively simply:

head *.stats

** The breakdown shown here gives a slightly different representation as compared to the 1st table, but this one is likely to be more accurate.

In conclusion, here I demonstrate that a simple application of BWA allows us to screen for mycoplasma contamination and should be a routine analysis for all DNA-seq/RNA-seq data sets, especially those from cell cultures. You should correlate these results with the results of PCR or Elisa tests to establish a normal background level so that bona fide contamination can be more easily detected.

EDIT: For precision assignment of taxonomic labels to short DNA sequences (like metagenomic studies), try Kraken. It requires at least 160 GB of disk space & 75 GB so will need to give it a try on a bigger server... Updates soon! Thanks @torstenseemann 

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc