Changes between Version 2 and Version 3 of ImputationPipeline

Sep 13, 2010 4:37:29 PM (14 years ago)



  • ImputationPipeline

    v2 v3  
    1010TODO: describe the protocols here;
     12== Description from Harm-Jan ==
     14The imputation pipeline has changed, in such a way that it was reduced to only a few steps. To facilitate QC and conversion steps, I've bundled our conversion tools in one single program called ImputationTool.jar. 
     16Here, I shortly describe the steps that need to be in the new pipeline, in placeholders I also describe what the commands could look like, if you would implement this in a shellscript (or java program). These examples can be the complete execution steps of the pipeline.
     18'' Commands to run locally: ''
     19 1. if the dataset is in binary plink format, use plink --recode to convert back to ped+map)
     20 2. convert dataset to trityper format, if it is in ped+map format.
     22java -Xmx4g -jar ImputationTool.jar pmtt $plinkLocation $trityperOutputLocation
     24 3. compare the dataset to be imputed to the reference dataset (for example HapMap2 release 24, also in TriTyper format), and remove any snps for which the haplotypes are different, or do not correlate to the reference dataset. Also remove any SNP that is not in the reference. Save the output as Ped+Map
     26java -Xmx4g -jar ImputationTool.jar ttpmh $trityperOutputLocation $referenceLocation $pedAndMapOutputLocation [$famFile] # supply a famfile, if you have any... it is not required
     28 4. split the ped files in batches of 300 samples
     30  * mkdir -p ".$datasetLocation."/batches/
     31  * split -a2 -l$batchSize $pedAndMapOutputLocation $batchOutputLocation
     33 5. run linkage2beagle to convert the ped and map files to beagle format
     35for each batch
     37      java -Xmx7g -jar linkage2beagle.jar data=$batchOutputLocation/chr$chromosome.dat pedigree=$batchOutputLocation/chr$chromosome.ped.$batch  beagle=$beagleLocation/chr$chromosome.bgl.$batch
     41'' Commands to run in server: ''
     42 6. run the actual imputation on the batches on the cluster (needs hapmap to be recoded to beagle format as well, but I have these files for you)
     44for each batch
     46        java -Xmx11g\$TMPDIR -jar beagle.jar unphased=$beagleLocation/chr$chromosome.bgl.$batch phased=$referenceLocation/HM2_Chr$chromosome-BEAGLE markers=$referenceLocation/markers_Chr$chromosome.txt missing=0 out=$outputLocation/Chr$chromosome/chr$chromosome-$batch
     50'' Commands to run locally: ''
     51 7. convert the beagle imputed files into trityper format
     53java -Xmx4g -jar ImputationTool.jar bttb $outputLocation Chr/ChrCHROMOSOME-BATCH $imputedTriTyperLocation $numSamples   
     55 8. correlate the imputed snps to the snps in the original dataset
     57java -Xmx4g -jar ImputationTool.jar corr $trityperOutputLocation $datasetName $imputedTriTyperLocation $imputedDatasetName
     59 9. (if needed) convert to other formats (plink dosage / ped+map))
     61That's basically it. A lot simpler than the previous version, don't you think? The required tool is attached to this e-mail, but might still be a bit buggish. Any recommendations are therefore more than welcome.
    1263== IMPUTE pipeline ==