Changes between Version 11 and Version 12 of BGIDatasets


Ignore:
Timestamp:
Aug 18, 2011 12:10:45 AM (13 years ago)
Author:
laurent
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BGIDatasets

    v11 v12  
    1 = '''Overview BGI datasets''' =
     1= '''BGI Data''' =
     2Below is the description of the pipeline used by BGI for processing the GoNL data on hg19 and a short description of the format of the files delivered by BGI.
    23
    3 TODO: More info on statistic and how to access the dataset will be added here.
    4 BGI will shared with us: SNP calling results (VCF), SOAP format for indels.
     4== BGI hg19 Pipeline ==
     5 * Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2
     6   * Parameters based on other projects for indel detection
     7 * Remove duplicates: samtools, merge first then remove duplicates
     8 * SNP Detection: [http://soap.genomics.org.cn/soapsnp.html SOAPsnp]
     9   * Params: -r 0.0005 -3 0.001 -u -2
     10   * Depth > 4x
     11   * Q > 20
     12   * CN < 2
     13   * >5bp between each SNP
     14 * Indel Detection: [http://samtools.sourceforge.net/ samtools] pileup -ivcf
     15   * RMS Qual > 20
     16   * Variation freq >= 0.1
     17   * Consensus quality > Q20
     18   * Supporting reads >= 2
     19   * Results: ~700k per individual
     20 * CNV: [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV_Detector]
     21   * Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90
     22   * Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS:
     23     * Depth was significantly different from WG average with same GC content
     24     * Flanking region...
     25   * Results: ~150 CNV > 10k, ~100 > 100k per individual
     26
     27== BGI file format ==
     28BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats.
     29
     30 * Alignment data
     31   * BAM format
     32 * CNV
     33   * [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV Detector] output format
     34     * According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment.
     35   * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format]
     36     * Not assessed yet.
     37 * Indels
     38   * Samtools pileup format
     39     * Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs.
     40   * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format]
     41     * Not assessed yet.
     42 * SNP calls
     43   * [http://soap.genomics.org.cn/soapsnp.html#output2 SOAPsnp output format]
     44     *
     45   * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format]
     46     * Some of the files are in VCF3.0 format, others in VCF 4.0 format
     47     * Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly
     48     * Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time.
     49     * The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs.
     50
     51= Overview BGI datasets'''''''''' =
     52TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels.
    553
    654|| '''Batch''' || '''Samples''' || '''Lanes''' || '''Size''' || '''Groningen storage''' || '''Grid storage (VO)''' || '''Storage additional sites''' || '''Analysis site''' || '''Status''' ||
     
    1361
    1462== '''Directories''' ==
     63'''1st batch'''
    1564
    16 '''1st batch'''
    17 * Groningen
     65 * Groningen
    1866   * Raw data (fastq)
    1967   * Results
    20 * Grid (vlemed)
     68 * Grid (vlemed)
    2169   * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
    2270   * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
     
    2573
    2674'''2nd batch'''
    27 * Groningen
     75
     76 * Groningen
    2877   * Raw data (fastq)
    2978   * Results (bam)
    30 * Grid (vlemed)
     79 * Grid (vlemed)
    3180   * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
    3281   * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
     
    3584
    3685'''3rd batch'''
    37 * Groningen
     86
     87 * Groningen
    3888   * Raw data (fastq)
    3989   * Results
    4090
     91'''4th batch'''
    4192
    42 '''4th batch'''
    43 * Groningen
     93 * Groningen
    4494   * Raw data (fastq)
    4595   * Results
    46 * Grid (bbmri.nl)
     96 * Grid (bbmri.nl)
    4797   * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/bbmri.nl/fourth_batch (access to members of bbmri.nl VO)
    4898   * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/input (access to members of bbmri.nl VO)