== Introduction == The purpose if this run is to test the efficiency of the existing imputation pipelines in the Grid. == Datasets == === Reference === The reference dataset has been created from the raw VCF data of 1000 Genomes data. * Download VCF files from : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521 * Export only the SNPs (filter out the indels and SVs) from VCF data by using [http://vcftools.sourceforge.net/ vcftools] and convert to impute2 format (hap and legend format). {{{ vcftools \ --gzvcf ALL.chr1.phase1_release_v2.20101123.snps_indels_svs.vcf.gz \ --keep-INFO LCSNP --keep-INFO EXSNP --keep-INFO SNP \ --IMPUTE \ --out ALL.chr1.phase1_release_v2.20101123.snps_indels_svs. }}} * Alternatively we coud have used the 1000 Genomes reference panel in impute2 format (legend and hap files) from the impute2 website: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download_reference_data === Study panel === The study panel is an artificial genotype dataset. The dataset contains the SNPs set of the Illumina Hap550 platform. To generate it we followed the following steps: * Download the genetic map of b37 release of human genome from impute2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download_reference_data * Download and install hapgen2: https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html * Download the list of SNPs in the Hap550 platform: * Go to : http://genome.ucsc.edu/cgi-bin/hgTables?command=start * From the Table combo box select: snpArrayIllumina550 * Click the "get output" button * For the hapgen2 to run we need to specify at least one causal SNP. We selected the first SNP of the dataset (in our case 16539175). In order not to add any genome wide variation in this SNP determine the odds ratio as 1.0: {{{ hapgen2 \ -h ALL.chr6.merged_beagle_mach.20101123.snps_indels_svs.genotypes.exported.impute.hap \ -l ALL.chr6.merged_beagle_mach.20101123.snps_indels_svs.genotypes.exported.impute.legend \ -m genetic_map_chr6_combined_b37.txt \ -o chr6 \ -dl 16539175 1 1.0 1.0 \ -n 5000 0 }}} * This command requires 80G of memory. * Filter the generated gen files with the SNPs from the hap550 file. According to the documentation of hapgen2 you can specify a list of SNPs and limit the generation of the files in these SNPs (the -t option). I didn't use this option and I filtered the generated data afterwards to the positions of hap550 in order to check if the positions indeed matched. Some stats: {{{ SNPs in 1000Genomes and in Hap550: 546,233 SNPs in 1000Genomes but not in Hap550: 36,265,509 SNPs not in 1000Genomes but in Hap550: 935 }}} The study panel is an artificial genotype dataset. Created by