Version 3 (modified by Morris Swertz, 14 years ago) (diff)


Analysis pipeline

BGI will provide us with SAM/BAM standard formatted files containing the raw, uncleaned reads. This is the main input for our pipeline. We will construct a data processing and analysis pipeline, based on the recent experiences from 1000 Genomes Project, using (where available) existing tools already developed (such as GATK, SAMTOOLS, PICARD).

See Genome Analysis Pipeline document drafted by Morris for individual steps for alignment, quality score recalibration, indel cleaning. The aim is to compare the outputs (sites called, genotypes called) between this pipeline and BGI.

Action items for pilot data: (URGENT)

Input: QC'd read data (SAM/BAM format) with genotypes (in VCF format) from BGI

  • Genome-wide coverage (? 12X)
  • Variant sites called (all, novel, dbSNP, HapMap?, 1KG pilot 2): number and Ti/Tv? ratio (as crude proxy for false positive rate)
  • Concordance with immunochip data (have genotypes been deposited?). Special focus on low-frequency SNPs. Need to pay attention to the genotype calling of the immunochip data.
  • Check genotypes for Mendel errors (trio problems?)
  • Visual inspection of variant sites using IGV

Action items for main project:

Input: raw read data (SAM/BAM format)

  • This needs a fully working pipeline as detailed in Morris' analysis document
  • more details to come

Attachments (1)

Download all attachments as: .zip