Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 3 and Version 4 of DeNovoVariationPipeline

Timestamp:: Sep 26, 2010 9:54:37 PM (14 years ago)
Author:: Yurii Aulchenko
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

DeNovoVariationPipeline

-                      v3
+                      v4
+[[TOC()]]
+= De-novo variation pipeline =
+= Is discovery of ''de-novo'' mutations feasible in GvNL data? =
+This pipeline aims to discover and verify the 'de-novo' mutations
+Following the discussion of Yurii Aulchenko and Kai Ye, 2010.09.08 – 2010.09.12
+== Summary ==
+== PROBLEM ==
+'''Status''': idea
+When a ‘de-novo’ mutation occurs, we may see the following picture:
+'''Contributors''': Yurii, Kai, Morris
+Reads in one of the parents (r: reference. A: alternative variant)
+{{{
+rrrrrrrrrr
+'''Timeline''': TBE
+rrrrrrrrrr
+'''Resources''': TBE
+rrrrrrrrrr
+'''Depends on''': availability of FASTQ (hard) and VCF (soft) data, ChipBasedQcPipeline, MendelianQcPipeline
+rrrrrrrrrr
+'''Other projects depending on this''':  no, this is an end-project
+rrrrrrrrrr
+== Aims and Deliverables ==
+rrrrrrrrrr
+}}}
+ * Establish custom 'de-novo' discovery pipeline
+ * Identify and verify a number of 'de-novo' mutations
+ * Characterize ...
+Other parent reads
+== Idea ==
+{{{
+rrrrrrrrrr
+Because GvNL will do sequencing at 12x, identification of 'de-novo' variants based on simplistic Mendelian checks (see MendelianQcPipeline) is likely to lead to hundreds of thousands of variant, only few of which are truly 'de-novo'. A couple of ideas which may help solving the problem is listed in DeNovoVariationPipelineIdea.
+rrrrrrrrrr
+BURNING: need to decide what line to follow and come up with realistic plan and estimate for resources needed!
+rrrrrrrrrr
+== Workflow ==
+rrrrrrrrrr
+Automated workflow (will be) provided in DeNovoVariationPipelineWorkflow page.
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-}}}
-Reads in offspring
-{{{
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-}}}
-The problem is that actually either of the parents can be a heterozygous carrier of “A”, and we have missed this allele just by chance (chance to miss it is estimated to be approximately ~1%). This means we will see above described scenario in tens of thousands of locations. Given expected number of ‘de novo’ mutations in a person is ~50, verification step will present a major problem.
-We see there may be two (potentially complementary) ways out with this problem.
-== (1) COVERAGE ==
-For few trios, increase coverage (potentially only in parents, or even one parent); this will decrease the chance that we miss a heterozygote. We did calculations of what coverage should be so we get chance of het missing becoming comparable to the mutation rate; e.g. we aim chance het missing = 1e-8 or so (see Box 1 below for computations). It appears that at ~32x only half of the situations described above will be attributable to inadequate coverage, while other half will be true ‘de novo’ mutations.
-{{{
-Box 1
-Assume the heterozygote call is made when at least two reads show the variant.
-Let us also assume for the moment that coverage is always Nx. Denote reference
-sequence as “R” and alternative as “A”, so in fact the person is R/A. Let us compute
-the probability that we miss this heterozygote (i.e. will call it A/A or R/R):
-P(call R/A as R/R or A/A) = P(all N read R) + P([N-1] reads are R, and 1 read is A)
-                                                + P(all N read A) + P([N-1] reads are A, and 1 read is R)
-Assuming that probability of reads follows binomial distribution, we get
-P(call R/A as R/R or A/A) = 2*(N+1)*(1/2)!^N
-P(call R/A as R/R or A/A) ~ 1e-8 at N ~ 32
-}}}
-== (2) Exploit tagging of same-window reads ==
-Basically to detect ‘de novo’ we need a situation when WITHIN THE SAME READ WINDOW (or paired-end read window) both parental chromosomes are tagged by a variant, and we see the third variant appearing in this context in child only
-Below is a naïve example of a situation when we could be able to detect ‘de novo’ mutation “C” (in red). Note that this is only one situation when we can clearly see that “C” is ‘de novo’. More situations can be worked out following the same logic.
-r: reference. A, B: alternative variants tagging the sequence.
-Reads in one of the parents
-{{{
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrArrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-}}}
-Other parent reads
-{{{
-rBrrrrrrrr
-rBrrrrrrrr
-rBrrrrrrrr
-rBrrrrrrrr
-rBrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-}}}
-Reads in offspring
-{{{
-rrCrrArrrr
-rrCrrArrrr
-rrCrrArrrr
-rrCrrArrrr
-rrCrrArrrr
-rrCrrArrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-rrrrrrrrrr
-}}}
-To address whether this is realistic scenario under which we can detect de novo mutations, we need to answer the question about probability that, given ‘de novo’ mutation occurs, what is the chance we will see that mutation in at least four reads (it is clear that for ‘de novo’ we must use more stringent calling criteria) and that in at least two of these 4 reads we will also see a heterozygote coming from a parent. Computations estimating this chance are provided below in the Box 2.
-From these computations, it appears that the chance to see ‘de novo’ in 4 or more reads, and see an existing (transmitted from a parent) variant in at least two of these reads is about 0.09. Thus, using outlined strategy we will be able to detect several de novo mutations per trio offspring, translating to hundreds (or thousands) de novo described from the whole data set. Note that in above we ignored the paired-end nature of our sequencing data, which, when properly accounted for, would probably double the numbers of detectable de novo mutations. Next, if cross-reads phasing works accurately and at longer distances this will allow to bring this proportion even higher.
-{{{
-Box 2
-The probability that we see a ‘de novo’ in at least 4 reads out of 12 is 0.93.
-The chance that an existing heterozygous site is covered in the same read
-can be computed assuming the read length of 100, uniform distribution of
-the read-start position across the genome, and heterozygote probability of
-/300 per site (Kai). Assume the ‘worst’ scenario of exactly 4 reads with
-‘de novo’, what is the chance that in at least two of them we will see an
-existing heterozygote?
-Denoting the ‘de novo’ position in the read as 0, the ‘coverable’ position
-of a heterozygote may vary from -99 to +99. The chance that a heterozygote
-at +99 is included in the read is 0.01; if heterozygote is at +1, the chance is
-.99. Thus, for a heterozygote at position ‘j’ (j in -99 to -1 and 1 to 99) the
-chance to be included in the read is (1-abs(j/100)). We assume that a chance
-to have a ‘linked’ alternative variant at a position is ½ * 1/300 = 1/600. Thus
-the probability to detect a ‘linked’ variant in at least two reads out of 4 is:
-P(see variant in >=2 reads)
-= P(variant is at -99) * P(see variant in >=2 reads | variant is at -99)
-   + P(variant is at -98) * P(see variant in >=2 reads | variant is at -98) +
-   … + P(variant is at +99) * P(see variant in >=2 reads | variant is at +99)
-= 1/600 (P(see variant in >=2 reads | variant is at -99)
-   + … P(see variant in >=2 reads | variant is at +99))
-= 1/600 [ 2 * SUM_{j=(1,99)} SUM_{k=2,4} (1-j/100)^k * (j/100)^(4-k) ]
-Evaluation of this expression gives
-P(see variant in >=2 reads) = 0.09
-Thus the joint probability to see ‘de novo’ in >=4 reads and see an
-established variant transmitted from a parent in at least 2 of these reads
-is 0.93*0.09 = 0.086.
-}}}
-== Conclusions ==
-From above computations it looks like both increasing coverage in parents (and may be offspring) of selected trios to >32x, and exploitation of information from the same reads may make detection of ‘de novo’ variants feasible.
-Note that above computations cannot be considered as final; multiple assumptions and approximations were made, it will depend on goodness of these assumptions and approximations how far off the true answer the provided figures are. Still, the true answer should be the same order of magnitude – which was our initial aim – to see if ‘de novo’ project looks at all feasible or not.
-Before any of these lines can be followed up, thorough discussion and further computations / simulations should be done to relax the assumption that N is constant (use N ~ Poisson(Lambda) instead) and also taking into account error probability (not considered above).