Changes between Initial Version and Version 1 of GoNL_Immunochip_Data_Preparation


Ignore:
Timestamp:
Apr 21, 2011 4:34:21 PM (13 years ago)
Author:
laurent
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GoNL_Immunochip_Data_Preparation

    v1 v1  
     1[[TOC]]
     2
     3This page describes the necessary steps to get a VCF Hg19 file containing the GoNL Immunochip data from the raw/QC'ed Immunochip data in PED format. This is using tools as available in early 2011 and should get much simpler when PLINK/Seq is released.
     4
     5Here, the procedure is shown for a FORWARD strand PED file. If you have a TOP/TOP PED file, you will still need to correct for strand.
     6
     7= PED to VCF=
     8The following steps explain how to produce a VCF file from PLINK ped files. It is a rather cumbersome process at the moment and should be streamlined when PLINK/Seq is released.
     9
     10== Initial PED to VCF ==
     11The only tool to have an easy (yet not completely correct) conversion from PED to VCF is the beta version of PLINK/Seq available -along with instructions- here: http://www.broadinstitute.org/gsa/wiki/index.php/Converting_ped_to_vcf
     12
     13This pre-compiled version can only be run on Linux64 machines and some dependency problems may occur.
     14
     15The result here is a VCF tool that contains all
     16
     17== Correct the initial VCF file ==
     18The initial VCF file produced by PLINK 1.08 does contain the right information, however it is not actually in VCF format. The problem here is that PLINK files specify the Ref/Alt alleles relative to the dataset where VCF specifies the Ref/Alt alleles relative to the Human Genome Reference it is aligned on. To correct this VCF file, it is necessary to modify the initial VCF file so that the alleles are relative to the Human Genome Reference and not the dataset anymore.
     19
     20=== Create a tab-delimited file containing your loci using [http://code.google.com/p/bedtools/ BEDTools] ===
     21First it has to be clear that here BED refers to the [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC BED format] and NOT the PLINK binary file format. To be able to sort the alleles with the Human Genome Reference, we need to access it. As it is a big file, [http://code.google.com/p/bedtools/ BEDTools] function ''fastaFromBed'' can extract only the loci of interest (in this case those on the chip) and report them in tab-delimited file.
     22
     23''fastaFromBed'' needs a [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC BED] file as input. This file is tab-delimited and contains 3 columns: Chrom Start_seq End_seq. As we are only interested in specific loci, Start_seq and End_seq will be 1 base appart so that only the locus of interest is reported in the output file. This file can very easily be generated either from the initial VCF file or the PLINK BIM file.
     24
     25Once you have the input file, simply run ''fastaFromBed'' on it giving the Human Reference corresponding to the chip data as the other input. For more information on ''fastaFromBed'', see the [http://code.google.com/p/bedtools/ BEDTools] Manual.
     26
     27=== Re-arrange Ref/Alt alleles based on the Human Genome Reference ===
     28Now that we have the Human Reference Genome loci, it is trivial to re-arrange the alleles so that the Ref and Alt alleles correspond to the Human Genome Reference. I wrote a small script, ''align-vcf-to-ref.pl'' that does the work provided the correct input. Note that when flipping the order of alleles in the VCF Ref/Alt columns, one must also flipped the genotypes correctly.
     29
     30=== [Optional] Update SNP IDs ===
     31Depending on the chip platform, the SNP IDs might not correspond to dbSNP IDs (ex. Illumina Immonchip). Illumina provides a list of the corresponding Illumina-dbSNP names. A small script ''reannotate-vcf-snp-ids.pl'' can update the SNP IDs so that they correspond to dbSNP based on the Illumina list (not on SNP location).
     32
     33=== [Optional] Flip Strand ===
     34A small script, ''flip-vcf-snp.pl'', is available in case you have the following:
     35* A VCF file coming from a TOP/TOP PLINK file set
     36* A BIM file corresponding to the same dataset but in forward strand
     37The script can be used to flip the strand according to the BIM file.
     38
     39= Liftover file =
     40The last step in preparing the immunochip data for comparison with the sequence data is to liftover the VCF file to the same Human Genome Reference as the Sequence data so that comparisons can be made. All is explained here: [[LiftOver_Genome_Assemblies]]
     41
     42Note that once the liftover VCF has successfully been created, it can be used to liftover the PLINK files. To do so:
     43# Remove all SNPs that are not present in the new reference VCF file (using plink --extract)
     44# Use the liftover VCF as an input to the ''liftover-bim.pl'' tool .