Changes between Version 3 and Version 4 of DeNovoVariationPipeline


Ignore:
Timestamp:
Sep 26, 2010 9:54:37 PM (14 years ago)
Author:
Yurii Aulchenko
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DeNovoVariationPipeline

    v3 v4  
    1 [[TOC()]]
     1= De-novo variation pipeline =
    22
    3 = Is discovery of ''de-novo'' mutations feasible in GvNL data? =
     3This pipeline aims to discover and verify the 'de-novo' mutations
    44
    5 Following the discussion of Yurii Aulchenko and Kai Ye, 2010.09.08 – 2010.09.12
     5== Summary ==
    66
    7 == PROBLEM ==
     7'''Status''': idea
    88
    9 When a ‘de-novo’ mutation occurs, we may see the following picture:
     9'''Contributors''': Yurii, Kai, Morris
    1010
    11 Reads in one of the parents (r: reference. A: alternative variant)
    12 {{{
    13 rrrrrrrrrr
     11'''Timeline''': TBE
    1412
    15 rrrrrrrrrr
     13'''Resources''': TBE
    1614
    17 rrrrrrrrrr
     15'''Depends on''': availability of FASTQ (hard) and VCF (soft) data, ChipBasedQcPipeline, MendelianQcPipeline
    1816
    19 rrrrrrrrrr
     17'''Other projects depending on this''':  no, this is an end-project
    2018
    21 rrrrrrrrrr
     19== Aims and Deliverables ==
    2220
    23 rrrrrrrrrr
    24 }}}
     21 * Establish custom 'de-novo' discovery pipeline
     22 * Identify and verify a number of 'de-novo' mutations
     23 * Characterize ...
    2524
    26 Other parent reads
     25== Idea ==
    2726
    28 {{{
    29 rrrrrrrrrr
     27Because GvNL will do sequencing at 12x, identification of 'de-novo' variants based on simplistic Mendelian checks (see MendelianQcPipeline) is likely to lead to hundreds of thousands of variant, only few of which are truly 'de-novo'. A couple of ideas which may help solving the problem is listed in DeNovoVariationPipelineIdea.
    3028
    31 rrrrrrrrrr
     29BURNING: need to decide what line to follow and come up with realistic plan and estimate for resources needed!
    3230
    33 rrrrrrrrrr
     31== Workflow ==
    3432
    35 rrrrrrrrrr
     33Automated workflow (will be) provided in DeNovoVariationPipelineWorkflow page.
    3634
    37 rrrrrrrrrr
    38 
    39 rrrrrrrrrr
    40 
    41 rrrrrrrrrr
    42 
    43 }}}
    44 
    45 Reads in offspring
    46 
    47 {{{
    48 rrrrrArrrr
    49 
    50 rrrrrArrrr
    51 
    52 rrrrrArrrr
    53 
    54 rrrrrArrrr
    55 
    56 rrrrrArrrr
    57 
    58 rrrrrrrrrr
    59 
    60 rrrrrrrrrr
    61 
    62 rrrrrrrrrr
    63 
    64 rrrrrrrrrr
    65 
    66 }}}
    67 
    68 The problem is that actually either of the parents can be a heterozygous carrier of “A”, and we have missed this allele just by chance (chance to miss it is estimated to be approximately ~1%). This means we will see above described scenario in tens of thousands of locations. Given expected number of ‘de novo’ mutations in a person is ~50, verification step will present a major problem.
    69 
    70 We see there may be two (potentially complementary) ways out with this problem.
    71 
    72 == (1) COVERAGE ==
    73 
    74 For few trios, increase coverage (potentially only in parents, or even one parent); this will decrease the chance that we miss a heterozygote. We did calculations of what coverage should be so we get chance of het missing becoming comparable to the mutation rate; e.g. we aim chance het missing = 1e-8 or so (see Box 1 below for computations). It appears that at ~32x only half of the situations described above will be attributable to inadequate coverage, while other half will be true ‘de novo’ mutations.
    75 
    76 
    77 {{{
    78 Box 1
    79 
    80 Assume the heterozygote call is made when at least two reads show the variant.
    81 Let us also assume for the moment that coverage is always Nx. Denote reference
    82 sequence as “R” and alternative as “A”, so in fact the person is R/A. Let us compute
    83 the probability that we miss this heterozygote (i.e. will call it A/A or R/R):
    84 
    85 P(call R/A as R/R or A/A) = P(all N read R) + P([N-1] reads are R, and 1 read is A)
    86                                                 + P(all N read A) + P([N-1] reads are A, and 1 read is R)
    87 
    88 Assuming that probability of reads follows binomial distribution, we get
    89 
    90 P(call R/A as R/R or A/A) = 2*(N+1)*(1/2)!^N
    91 
    92 P(call R/A as R/R or A/A) ~ 1e-8 at N ~ 32
    93 }}}
    94 
    95 == (2) Exploit tagging of same-window reads ==
    96 
    97 Basically to detect ‘de novo’ we need a situation when WITHIN THE SAME READ WINDOW (or paired-end read window) both parental chromosomes are tagged by a variant, and we see the third variant appearing in this context in child only
    98 
    99 Below is a naïve example of a situation when we could be able to detect ‘de novo’ mutation “C” (in red). Note that this is only one situation when we can clearly see that “C” is ‘de novo’. More situations can be worked out following the same logic.
    100 
    101 r: reference. A, B: alternative variants tagging the sequence.
    102 
    103 Reads in one of the parents
    104 
    105 {{{
    106 rrrrrArrrr
    107 
    108 rrrrrArrrr
    109 
    110 rrrrrArrrr
    111 
    112 rrrrrArrrr
    113 
    114 rrrrrArrrr
    115 
    116 rrrrrArrrr
    117 
    118 rrrrrrrrrr
    119 
    120 rrrrrrrrrr
    121 
    122 rrrrrrrrrr
    123 
    124 rrrrrrrrrr
    125 
    126 rrrrrrrrrr
    127 
    128 rrrrrrrrrr
    129 
    130 }}} 
    131 
    132 Other parent reads
    133 
    134 {{{
    135 
    136 rBrrrrrrrr
    137 
    138 rBrrrrrrrr
    139 
    140 rBrrrrrrrr
    141 
    142 rBrrrrrrrr
    143 
    144 rBrrrrrrrr
    145 
    146 rrrrrrrrrr
    147 
    148 rrrrrrrrrr
    149 
    150 rrrrrrrrrr
    151 
    152 rrrrrrrrrr
    153 
    154 rrrrrrrrrr
    155 
    156 rrrrrrrrrr
    157 
    158 rrrrrrrrrr
    159 
    160 }}}
    161 
    162 Reads in offspring
    163 
    164 {{{
    165 
    166 rrCrrArrrr
    167 
    168 rrCrrArrrr
    169 
    170 rrCrrArrrr
    171 
    172 rrCrrArrrr
    173 
    174 rrCrrArrrr
    175 
    176 rrCrrArrrr
    177 
    178 rrrrrrrrrr
    179 
    180 rrrrrrrrrr
    181 
    182 rrrrrrrrrr
    183 
    184 rrrrrrrrrr
    185 
    186 rrrrrrrrrr
    187 
    188 rrrrrrrrrr
    189 
    190 }}} 
    191 
    192 To address whether this is realistic scenario under which we can detect de novo mutations, we need to answer the question about probability that, given ‘de novo’ mutation occurs, what is the chance we will see that mutation in at least four reads (it is clear that for ‘de novo’ we must use more stringent calling criteria) and that in at least two of these 4 reads we will also see a heterozygote coming from a parent. Computations estimating this chance are provided below in the Box 2.
    193 
    194 From these computations, it appears that the chance to see ‘de novo’ in 4 or more reads, and see an existing (transmitted from a parent) variant in at least two of these reads is about 0.09. Thus, using outlined strategy we will be able to detect several de novo mutations per trio offspring, translating to hundreds (or thousands) de novo described from the whole data set. Note that in above we ignored the paired-end nature of our sequencing data, which, when properly accounted for, would probably double the numbers of detectable de novo mutations. Next, if cross-reads phasing works accurately and at longer distances this will allow to bring this proportion even higher.
    195 
    196 {{{
    197 Box 2
    198 
    199 The probability that we see a ‘de novo’ in at least 4 reads out of 12 is 0.93.
    200 The chance that an existing heterozygous site is covered in the same read
    201 can be computed assuming the read length of 100, uniform distribution of
    202 the read-start position across the genome, and heterozygote probability of
    203 1/300 per site (Kai). Assume the ‘worst’ scenario of exactly 4 reads with
    204 ‘de novo’, what is the chance that in at least two of them we will see an
    205 existing heterozygote?
    206 
    207 Denoting the ‘de novo’ position in the read as 0, the ‘coverable’ position
    208 of a heterozygote may vary from -99 to +99. The chance that a heterozygote
    209 at +99 is included in the read is 0.01; if heterozygote is at +1, the chance is
    210 0.99. Thus, for a heterozygote at position ‘j’ (j in -99 to -1 and 1 to 99) the
    211 chance to be included in the read is (1-abs(j/100)). We assume that a chance
    212 to have a ‘linked’ alternative variant at a position is ½ * 1/300 = 1/600. Thus
    213 the probability to detect a ‘linked’ variant in at least two reads out of 4 is:
    214 
    215 P(see variant in >=2 reads)
    216 
    217 = P(variant is at -99) * P(see variant in >=2 reads | variant is at -99)
    218    + P(variant is at -98) * P(see variant in >=2 reads | variant is at -98) +
    219    … + P(variant is at +99) * P(see variant in >=2 reads | variant is at +99)
    220 
    221 = 1/600 (P(see variant in >=2 reads | variant is at -99)
    222    + … P(see variant in >=2 reads | variant is at +99))
    223 
    224 = 1/600 [ 2 * SUM_{j=(1,99)} SUM_{k=2,4} (1-j/100)^k * (j/100)^(4-k) ]
    225 
    226 Evaluation of this expression gives 
    227 
    228 P(see variant in >=2 reads) = 0.09
    229 
    230 Thus the joint probability to see ‘de novo’ in >=4 reads and see an
    231 established variant transmitted from a parent in at least 2 of these reads
    232 is 0.93*0.09 = 0.086.
    233 }}}
    234 
    235 == Conclusions ==
    236 
    237 From above computations it looks like both increasing coverage in parents (and may be offspring) of selected trios to >32x, and exploitation of information from the same reads may make detection of ‘de novo’ variants feasible.
    238 
    239 Note that above computations cannot be considered as final; multiple assumptions and approximations were made, it will depend on goodness of these assumptions and approximations how far off the true answer the provided figures are. Still, the true answer should be the same order of magnitude – which was our initial aim – to see if ‘de novo’ project looks at all feasible or not.
    240 
    241 Before any of these lines can be followed up, thorough discussion and further computations / simulations should be done to relax the assumption that N is constant (use N ~ Poisson(Lambda) instead) and also taking into account error probability (not considered above).
    242 
    243  
    244 
    245