CLIPSeqTools

Suite to analyse CLIP-Seq datasets

View project onGitHub

Description

CLIPSeqTools is a collection of command line applications used for the analysis of CLIP-Seq datasets. CLIP-Seq stands for UV cross-linking and immunoprecipitation coupled with high-throughput sequencing.

CLIPSeqTools has applications for a wide range of analyses that will give an in depth view of the analysed dataset. Examples of such analyses are: genome read coverage, distibution of reads on genic elements, motif enrichment, relative position of reads of two datasets, differential gene counts, etc).

CLIPSeqTools is grouped in 4 toolboxes each of which performs a specific set of analyses:

  1. clipseqtools

    Application to analyse a single CLIP-Seq library.

  2. clipseqtools-compare

    Application to compare two CLIP-Seq libraries. (Can be used after clipseqtools is run on each dataset).

  3. clipseqtools-plots

    Helper application to create plots for the output of clipseqtools and clipseqtools-compare. (Note: Usually the plotting functions are called from the analysis scripts themselves using the --plot).

  4. clipseqtools-preprocess

    Application to process a FastQ file into files that are compatible with clipseqtools. (Among other things, it aligns the reads on the reference genome, annotate the alignments with genic, repeat masker and phastCons conservation information).

Installation

CLIPSeqTools is a Perl module and should be compatible with any Unix style operating system with the Perl programming language installed. Chances are that if you are working on a Mac or a Linux operating system you already have Perl installed.

Although the installation is straighforward for people that have some experience with command line installations it can be slightly cumbersome for people with no such experience. For this, we suggest to contact your IT department or someone able to help you with the installation process.

Prerequisites

CLIPSeqTools relies on a few external programs for things like the alignment and the plotting functionality. To successfully install and use CLIPSeqTools you will need to have the following tools installed and available in the users path:

  • R

    Language for statistical computing. To download R statistical package and for installation instructions refer to http://www.r-project.org/

  • cutadapt

    To remove 5' end adaptor sequence from reads (only if you use clipseq-tools preprocess). To download cutadapt and for installation instructions refer to https://code.google.com/p/cutadapt/

  • STAR

    For the alignment of reads on a reference genome (only if you use clipseqtools-preprocess). To download STAR and for installation instructions refer to https://code.google.com/p/rna-star/

  • Memory

    If you plan on using clipseqtools-preprocess to do the alignment of reads on a reference genome you will need a machine with at least 16 GB of RAM. The reason is that this is the amount of memory required by the STAR aligner. The amount of required memory might be smaller for smaller genomes but don't take our word for it.

Installing CLIPSeqTools

The simplest way to install CLIPSeqTools is to use CPAN which is the a package manager for Perl modules. If you are the system administrator and want to install the module system-wide, you need to switch to your root user.

To fire up the CPAN module, just get to your terminal (Command Line) and run the following command:

perl -MCPAN -e shell

If this is the first time you've run CPAN, it's going to ask you a series of questions - in most cases the default answer is fine.

Once you find yourself staring at the cpan> command prompt type:

install CLIPSeqTools

CPAN should take it from there and install CLIPSeqTools.

Getting Started

Download required files

CLIPSeqTools relies on certain data and annotation files to function properly. For the user's convenience, we provide the required files for 3 species - human (assembly: hg19), mouse (assembly: mm9) and fly (assembly: dm3) on our public server.

You may access these file at this link

Prepare your working directory

To keep things simple, in the following we assume you are using a working directory named clip and that you work for human (hg19) species.

  1. Create a new directory named data inside clip/.

    This creates the path clip/data/

  2. Download file hg19.tgz from our public server using the link given previously

  3. Put the downloaded file into the new directory clip/data/ and unzip it.

    This creates the path clip/data/hg19/. To save disk space you can now remove file hg19.tgz.

  4. Assuming your CLIP-Seq data are for proteinA, create a new directory named proteinA inside clip/.

    This creates the path clip/proteinA/

  5. Move/Copy the FastQ file with the CLIP-Seq reads into clip/proteinA/ and rename it to reads.fastq.

    Important: Unzip it, if it is zipped.

  6. Open a terminal and navigate to your working directory.

    cd /path/to/clip/
    
  7. List all directories and files with the following command.

    find .
    

    You should now have a working directory that looks like this:

    clip/
    clip/data/
    clip/data/hg19/
    clip/proteinA/
    clip/proteinA/reads.fastq
    

    Verify that everything is in place.

Align and process FastQ files with clipseqtools-preprocess

To process the fastq file, align the reads on the reference genome, annotate the alignments with genic, repeat masker and phastCons conservation information run the following command substituting <PLACEHOLDER> with the appropriate information.

  • If you are running on a machine with more than 32GB RAM.

    clipseqtools-preprocess all \
      --adaptor <3_END_ADAPTOR> \
      --fastq proteinA/reads.fastq \
      --gtf data/hg19/annotation/UCSC_gene_parts_genename.gtf \
      --rmsk data/hg19/annotation/rmsk.bed \
      --star_genome data/hg19/STAR/index/ \
      --cons_dir data/hg19/phastCons/ \
      --rname_sizes data/hg19/chrom.sizes \
      --o_prefix clip/proteinA/ \
      -v
    
  • If you are running on a machine with more than 16GB RAM.

    clipseqtools-preprocess all \
      --adaptor <3_END_ADAPTOR> \
      --fastq proteinA/reads.fastq \
      --gtf data/hg19/annotation/UCSC_gene_parts_genename.gtf \
      --rmsk data/hg19/annotation/rmsk.bed \
      --star_genome data/hg19/STAR/index-sparsed2/ \
      --cons_dir data/hg19/phastCons/ \
      --rname_sizes data/hg19/chrom.sizes \
      --o_prefix clip/proteinA/ \
      -v
    

The command above is doing a lot of things and it's going to take quite some time. Most likely it will take at least a few hours, so be patient and do NOT close the terminal. When it finishes you will find all files required to run clipseqtools in the next step under clip/proteinA/.

Analyse a library with clipseqtools

To run clipseqtools.

clipseqtools all \
  --database proteinA/reads.adtrim.star_Aligned.out.single.sorted.db \
  --gtf data/hg19/annotation/UCSC_gene_parts_genename.gtf \
  --rname_sizes data/hg19/chrom.sizes \
  --o_prefix clip/proteinA/ \
  --plot \
  -v

The command above is doing many things and is going to take some time, probably a few hours so be patient and do NOT close the terminal. When it finishes you will find the result files (tables and figures) in clip/proteinA/.

To view the table files (those with .tab extension) you can open them with a spreadsheet program like MS Excel or copy & paste their content directly into a spreadsheet.

Compare two libraries with clipseqtools-compare

Assuming you have two libraries on which you have previously run clipseqtools you can now use clipseqtools-compare to compare their results. For simplicity, we assume the two directories containing the clipseqtools results for these two libraries are clip/proteinA/ and clip/proteinB/. To compare the results for the two libraries run the following command.

clipseqtools-compare all \
  --database clip/proteinA/reads.adtrim.star_Aligned.out.single.sorted.db \
  --res_prefix clip/proteinA/ \
  --r_database clip/proteinB/reads.adtrim.star_Aligned.out.single.sorted.db \
  --r_res_prefix clip/proteinB/ \
  --rname_sizes data/hg19/chrom.sizes \
  --o_prefix clip/proteinA_vs_B/ \
  --plot \
  -v

Note that with the above command we are comparing library proteinA against the reference library proteinB.

The command is going to take some time so be patient. When it finishes you will find the result files for the analyses in clip/proteinA_vs_B/.