Changes between Initial Version and Version 1 of BIOS_Pipeline/Methodology


Ignore:
Timestamp:
Sep 19, 2016 4:58:23 PM (8 years ago)
Author:
jamverlouw
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BIOS_Pipeline/Methodology

    v1 v1  
     1= Methodology =
     2
     3''' Pipeline design '''
     4
     5See the attachment (at the pipeline page) for a graphical representation of the pipeline. The LUMC gitlab repository which holds the framework for the pipeline can be found via:
     6
     7{{{
     8https://git.lumc.nl/groups/rp3
     9}}}
     10
     11''' A note on !Exon/Gene Quantification using LUMC method '''
     12
     13The first step is creating counts for each (meta-)exon (meta exons are identified as overlapping exons) in the (meta-exon) annotation file using bedtools coveragebed (coverageBed) with the BAM file as input. The resulting histogram is converted using the hist2count script which simply counts all the bases for the intervals specified in the hist file (which is the same as the BED file). In the hist file, for a given interval (one single BED record), coverageBed (aka bedtools coverage) outputs the counts for each number of bases. For example, how many bases have 1 coverage, how many have 2, etc. The script simply aggregates this, so that the resulting count file contains the total base counts for each interval.
     14
     15To quantify genes, the region data is joined based on exons associated to the same individual gene only (Rscript). Some exons are associated to multiple genes. These are separately annotated as meta-exons since (with this method) it's not ascertainable which gene has contributed the data. These meta-features are similarly translated to meta-genes. All the results are written to a file which reports the base-count per (meta)gene.
     16
     17This situation:
     18||Gene||`-----BBBBBBBBB-`||
     19||Gene||`-AAAAA---------`||
     20||Exon||`-1-2-3-4-5---6-`||
     21||Coverage||`-8-8-9-9-7---7-`||
     22
     23Results in:
     24||= Gene =||= Exon =||= Base count (total coverage) =||
     25||A||1,2||(8+8)||
     26||AB||3||9||
     27||B||4,5,6||(9+7+7)||
     28
     29''' Example '''
     30
     31[[Image(base_count_example.png)]] 
     32
     33Graphical example of how base counts are calculated. (Courtesy of V. Takhaveev)
     34
     35''' Hist2count output (*.exon.count) '''
     36
     371. chromosome
     382. start position of the feature (exon)
     393. end position of the feature (exon)
     404. total base counts mapped to the feature
     415. normalized base counts for the region (column 4 divided by (column 3 - (column 2 + 1) ) )
     426. gene name
     43
     44
     45''' Tools and versions '''
     46
     47All the required tools for the pipeline are available in a tarball on the grid called toolbox.tar.gz. This tarball is also directly used when the pipeline is running on the grid. The versions of the tools can be found when running, the options we supply are to be found inside the pipeline (align.py). The tarball can be found via:
     48
     49 toolbox.tar.gz::
     50  !srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/bbmri.nl/RP3/mgalen/toolbox.tar.gz
     51
     52''' Combining count files '''
     53
     54Do this is steps if there is too many files. First for all Aflowcells then Bflowcells and merge.[[BR]]
     55For transcript counts:
     56
     57
     58{{{
     59(echo -n $'\t'; for i in B*.count; do echo ${i%.transcript.count}; done | paste -s; bash -c "paste <(cut -d'\"' -f 2  $(ls B*.count | head -1)) $(for i in B*.count; do echo -n " <(cut -d' ' -f 8 $i | sed 's/;//' )"; done)") > matrixB
     60}}}
     61
     62For gene counts:
     63
     64{{{
     65(echo -n $'\t'; for i in B*.count; do echo ${i%.gene.count}; done | paste -s; bash -c "paste <(tail -n +2 $(ls B*.count | head -1) | cut -f 1 ) $(for i in B*.count; do echo -n " <(tail -n +2 $i | cut -f 2)"; done)") > matrixB
     66}}}
     67
     68For exon counts:
     69
     70{{{
     71(echo -n $'\t'; for i in B*.count; do echo ${i%.exon.base.count}; done | paste -s; bash -c "paste <(cut -f 1-3  $(ls B*.count | head -1) | tr '\t' _ | sed 's/^/chr_/') $(for i in B*.count; do echo -n " <(cut -f 4 $i)"; done)") > matrixB
     72}}}
     73
     74See [https://git.lumc.nl/rp3/misc/tree/master/exon_counts_to_ratios this script] for converting exon counts to exon ratios.