= GoNL Immunochip Data Preparation for Concordance = [[TOC]] This page describes the necessary steps to get a VCF Hg19 file containing the GoNL Immunochip data from the raw/QC'ed Immunochip data in PED format. This is using tools as available in early 2011 and should get much simpler when PLINK/Seq is released. Here, the procedure is shown for a FORWARD strand PED file. If you have a TOP/TOP PED file, you will not be able to include strand-ambiguous SNPs (A/T, C/G) in your output. = PED to VCF = The following steps explain how to produce a VCF file from PLINK ped files. It is a rather cumbersome process at the moment and should be streamlined when PLINK/Seq is released. == Initial PED to VCF == The only tool to have an easy (yet not completely correct) conversion from PED to VCF is the beta version of PLINK/Seq available -along with instructions- here: http://www.broadinstitute.org/gsa/wiki/index.php/Converting_ped_to_vcf * plink108 --file in_plink_filename --recode-vcf --out out_filename This pre-compiled version can only be run on Linux64 machines and some dependency problems may occur. == Correct the initial VCF file == The initial VCF file produced by PLINK 1.08 does contain the right information, however it is not actually in VCF format. The problem here is that PLINK files specify the Ref/Alt alleles relative to the dataset where VCF specifies the Ref/Alt alleles relative to the Human Genome Reference it is aligned on. To correct this VCF file, it is necessary to modify the initial VCF file so that the alleles are relative to the Human Genome Reference and not the dataset anymore. === Create a tab-delimited file containing your loci using [http://code.google.com/p/bedtools/ BEDTools] === First it has to be clear that here BED refers to the [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC BED format] and NOT the PLINK binary file format. To be able to sort the alleles with the Human Genome Reference, we need to access it. As it is a big file, [http://code.google.com/p/bedtools/ BEDTools] function ''fastaFromBed'' can extract only the loci of interest (in this case those on the chip) and report them in tab-delimited file. ''fastaFromBed'' needs a [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC BED] file as input. This file is tab-delimited and contains 3 columns: Chrom Start_seq End_seq. As we are only interested in specific loci, Start_seq and End_seq will be 1 base appart so that only the locus of interest is reported in the output file. This file can very easily be generated either from the initial VCF file or the PLINK BIM file: * From VCF: grep -v '!^#' in.vcf | awk '{OFS="\t";print $1,$2,$2+1}' > out.bed * OR (depending on the ref seq) From VCF: grep -v '!^#' in.vcf | awk '{OFS="\t";print $1,$2-1,$2}' > out.bed Once you have the input file, simply run ''fastaFromBed'' on it giving the Human Reference corresponding to the chip data as the other input. For more information on ''fastaFromBed'', see the [http://code.google.com/p/bedtools/ BEDTools] Manual. * fastaFromBed -fi reference.fa -bed in.bed -fo out.fa.tab -tab === Re-arrange Ref/Alt alleles based on the Human Genome Reference === Now that we have the Human Reference Genome loci, it is trivial to re-arrange the alleles so that the Ref and Alt alleles correspond to the Human Genome Reference. I wrote a small script, ''align-vcf-to-ref.pl'' that does the work provided the correct input. Note that when flipping the order of alleles in the VCF Ref/Alt columns, one must also flipped the genotypes correctly. If using ''align-vcf-to-ref.pl'', one also has the option to flip the strand where relevant if the data was not all on forward strand. Note that by doing so, all strand-ambiguous (A/T,C/G) will be removed. === [Optional] Update SNP IDs === Depending on the chip platform, the SNP IDs might not correspond to dbSNP IDs (ex. Illumina Immonchip). Illumina provides a list of the corresponding Illumina-dbSNP names. A small script ''reannotate-vcf-snp-ids.pl'' can update the SNP IDs so that they correspond to dbSNP based on the Illumina list (not on SNP location). === [Optional] Flip Strand === A small script, ''flip-vcf-snp.pl'', is available in case you have the following: * A VCF file coming from a TOP/TOP PLINK file set * A BIM file corresponding to the same dataset but in forward strand The script can be used to flip the strand according to the BIM file. == Liftover file == The last step in preparing the immunochip data for comparison with the sequence data is to liftover the VCF file to the same Human Genome Reference as the Sequence data so that comparisons can be made. Here is how: 1. Get the appropriate chain files * From the [ftp://gsapubftp-anonymous@ftp.broadinstitute.org Broad GSA ftp] (password is blank) * From the [http://hgdownload.cse.ucsc.edu/downloads.html#human USCS Genome Browser] # 1. Get the appropriate fasta files (you'll need both from-and-to builds fasta files) * From [ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ NCBI] * From [http://hgdownload.cse.ucsc.edu/downloads.html#human USCS Genome Browser] 1. Index the fasta files appropriately to get .fai (samtools) and .dict files (Picard) as described on the [http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_the_essential_GATK_input_files:_the_reference_genome GATK wiki] 1. Get and run the [http://www.broadinstitute.org/gsa/wiki/index.php/LiftOverVCF.pl GATK LiftOver tool] 1. IMPORTANT: Check your VCF file header as some versions of the GATK liftover tool (e.g. v1.0.5083) might mix the individuals in the header (sort alphabetically rather than preserve original order). If the order is changed, then you should copy/paste the original order from the source VCF file. Note that once the liftover VCF has successfully been created, it can be used to liftover the PLINK files. To do so: 1. Remove all SNPs that are not present in the new reference VCF file (using plink --extract) 1. Use the liftover VCF as an input to the ''liftover-bim.pl'' tool . == Downloads == All tools developed inhouse at UMCG are available here: [[http://www.bbmriwiki.nl/svn/ped_to_vcf]]