[[TOC()]] = Analysis pipeline = BGI will provide us with SAM/BAM standard formatted files containing the raw, uncleaned reads. This is the main input for our pipeline. We will construct a data processing and analysis pipeline, based on the recent experiences from 1000 Genomes Project, using (where available) existing tools already developed (such as GATK, SAMTOOLS, PICARD). See [http://gbic.target.rug.nl/trac/bbmri/raw-attachment/wiki/PipelinePlan/BBMRI-NL%20Genome%20Analysis%20Pipeline%20Manual%202010-08-23.doc Genome Analysis Pipeline document drafted by Morris and Freerk for individual steps for alignment, quality score recalibration, indel cleaning]. The aim is to compare the outputs (sites called, genotypes called) between this pipeline and BGI. == Action items for pilot data: (URGENT) == Input: QC'd read data (SAM/BAM format) with genotypes (in VCF format) from BGI * Genome-wide coverage (? 12X) * Variant sites called (all, novel, dbSNP, HapMap, 1KG pilot 2): number and Ti/Tv ratio (as crude proxy for false positive rate) * Concordance with immunochip data (have genotypes been deposited?). Special focus on low-frequency SNPs. Need to pay attention to the genotype calling of the immunochip data. * Check genotypes for Mendel errors (trio problems?) * Visual inspection of variant sites using IGV == Action items for main project: == Input: raw read data (SAM/BAM format) * This needs a fully working pipeline as detailed in Morris' analysis document * more details to come