Protocols:HeliScopeCAGE read alignment: Difference between revisions
From FANTOM5_SSTAR
No edit summary |
No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
The data sequenced on Heliscope sequencers are | The data sequenced on Heliscope sequencers are processed, in order to: | ||
*Discard the CAGE tags derived from the ribosomal DNA repeating unit, which is not contained in the genome assembly, by rRNAdust (developed by Timo Lassmann). See [[Protocols:rRNAdust]] | *Discard the CAGE tags derived from the ribosomal DNA repeating unit, which is not contained in the genome assembly, by rRNAdust (developed by Timo Lassmann). See [[Protocols:rRNAdust]] | ||
*Align the remained CAGE tags with the genome sequences with DELVE (developed by Timo Lassmann), which generates BAM files containing a single mapped position per read with mapping quality and alignments. [[File:Delve_User_Manual.pdf]] | *Align the remained CAGE tags with the genome sequences with DELVE (developed by Timo Lassmann), which generates BAM files containing a single mapped position per read with mapping quality and alignments. [[File:Delve_User_Manual.pdf]] | ||
*Filter step in order to flag bad alignments and to assign read percent identity. We consider reads with at least 85% identity. Bam alignment files are fed to a script aln_filter (developed by Timo Lassmann), which takes care of both cases above. Scripts available at this URL http://fantom.gsc.riken.jp/5/suppl/aln_filter/ | |||
Post-mapping processing involves the following steps. | |||
*Retain only those reads with mapping quality corresponding to a 99% accuracy. | *Retain only those reads with mapping quality corresponding to a 99% accuracy. | ||
this is obtained by specifying the following samtools options | this is obtained by specifying the following samtools options | ||
Line 8: | Line 12: | ||
*Aggregate the 5'-end of those mapped CAGE tags as CAGE transcription starting site (CTSS). | *Aggregate the 5'-end of those mapped CAGE tags as CAGE transcription starting site (CTSS). | ||
command line instructions are combined with bedtools to read the bed file output obtained by the conversion of mapping file .BAM into .bed; then the bedtools function groupBy (or equivalent for the latest versions of bedtools) is used to aggregate those tags with same starting position into CTSS. | command line instructions are combined with bedtools to read the bed file output | ||
obtained by the conversion of mapping file .BAM into .bed; then the bedtools function | |||
groupBy (or equivalent for the latest versions of bedtools) is used to aggregate those | |||
tags with same starting position into CTSS. | |||
Full instructions from BAM to CTSS. Commands are executed separately for plus and minus strand. | Full instructions from BAM to CTSS. Commands are executed separately for plus and minus strand. | ||
*read bam file considering quality score and keeping binary format (-u) | *read bam file considering quality score and keeping binary format (-u) | ||
*convert into bed | *convert into bed |
Latest revision as of 10:18, 16 February 2015
The data sequenced on Heliscope sequencers are processed, in order to:
- Discard the CAGE tags derived from the ribosomal DNA repeating unit, which is not contained in the genome assembly, by rRNAdust (developed by Timo Lassmann). See Protocols:rRNAdust
- Align the remained CAGE tags with the genome sequences with DELVE (developed by Timo Lassmann), which generates BAM files containing a single mapped position per read with mapping quality and alignments. File:Delve User Manual.pdf
- Filter step in order to flag bad alignments and to assign read percent identity. We consider reads with at least 85% identity. Bam alignment files are fed to a script aln_filter (developed by Timo Lassmann), which takes care of both cases above. Scripts available at this URL http://fantom.gsc.riken.jp/5/suppl/aln_filter/
Post-mapping processing involves the following steps.
- Retain only those reads with mapping quality corresponding to a 99% accuracy.
this is obtained by specifying the following samtools options samtools view -q 20 [mapping_file.bam]
- Aggregate the 5'-end of those mapped CAGE tags as CAGE transcription starting site (CTSS).
command line instructions are combined with bedtools to read the bed file output obtained by the conversion of mapping file .BAM into .bed; then the bedtools function groupBy (or equivalent for the latest versions of bedtools) is used to aggregate those tags with same starting position into CTSS.
Full instructions from BAM to CTSS. Commands are executed separately for plus and minus strand.
- read bam file considering quality score and keeping binary format (-u)
- convert into bed
- select the strand
- sort
- aggregate by start position
- print results
samtools view -uq 20 BAMfile \ | bamToBed -i stdin \ | awk 'BEGIN{OFS="\t"}{if($6=="+"){print $1,$2,$5}}' \ | sort -k1,1 -k2,2n \ | groupBy -i stdin -grp 1,2 -opCols 3 -ops count \ | awk 'BEGIN{OFS="\t"}{print $1,$2,$2+1, $1":"$2".."$2+1",+" ,$3,"+"}' samtools view -uq 20 BAMfile \ | bamToBed -i stdin \ | awk 'BEGIN{OFS="\t"}{if($6=="-"){print $1,$3,$5}}' \ | sort -k1,1 -k2,2n \ | groupBy -i stdin -grp 1,2 -opCols 3 -ops count \ | awk 'BEGIN{OFS="\t"}{print $1,$2-1,$2, $1":"$2-1".."$2",-" ,$3,"-"}'