Posts

Showing posts with the label ChIP-seq

Extract TSS regions from a GTF file with GTFtools

Since stumbling upon GTFtools recently, I found that it has other another interesting use - to generate coordinate sets around transcriptional start sites (TSSs). This is really important for ChIP-seq analysis when we want to compare for example the strength of enrichment of histone modificaions at TSSs and compare it to RNA expression. Using GTFtools, it is a one line command to extract these positions:

gtftools.py -t Homo_sapiens.GRCh38.94.gtf.tss.bed -w 1000  Homo_sapiens.GRCh38.94.gtf

Where "-t" is the output file flag, "-w" is the desired TSS distance to cover, in this case +/- 1000 bp, and the last argument is the input gtf file which needs to be Ensembl or Gencode (other ones don't work due to differences in formatting) 
If I had to do this without GTFtools, it would end up being more complicated, as TSS positions (exon 1 starts) would need to be extracted from the GTF file separately for the top and bottom strands and then merged.

Minitalk: Understanding gene regulation in complex disease with deep sequencing

Image
Today I gave a presentation on experiment design and use of ChIP-seq and MBD-seq to understand gene regulation. The target audience consisted of biomedical scientists with little background in genomics but were curious to incorporate deep sequencing into their studies.

Link to the slides HERE.


As always I love getting feedback - so leave your questions and comments below!

Introducing the ENCODE Gene Set Hub

Image
TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data: https://sourceforge.net/projects/encodegenesethub/
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were drivi…

Count ChIP-seq reads across a promoter with featureCounts

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation.

FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis).

In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read (SAF format). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region  around the TSS by 3kbp.

#!/bin/bash
#Generate an SAF file for TSS intervals from a GTF

#Specify some parameters
GTF=Homo_sapiens.GRCh38.78.gtf

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

Image
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

T…

Data analysis step 9: Leverage ENCODE data for enhanced pathway analysis

Image
So far in this series of posts we've analysed the effect of azacitidine on AML3 cell gene expression using publicly available RNA-seq data. Our pathway analysis showed many interesting trends, but unravelling all the different mechanisms from this point can be really hard. In order to identify the major players at the chromatin level, it can be useful to integrate transcription factor binding data and see whether targets of a particular transcription factor are differentially regulated in a pathway analysis. The problem with this analysis in the past was that ChIP-seq datasets were in varying formats on GEO and processing these into a standardised format would be too time consuming. With the advent of the ENCODE Project, there is now a large body of transcription factor binding data in a uniform format (link), that is being mined in many creative ways. In our group, we used this approach extensively (here, here and here).

In this post, we will mine ENCODE transcription factor bind…