wiki:BGIDatasets

Version 13 (modified by laurent, 13 years ago) (diff)

--

BGI Data

Below is the description of the pipeline used by BGI for processing the GoNL data on hg19 and a short description of the format of the files delivered by BGI.

BGI hg19 Pipeline

  • Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2
    • Parameters based on other projects for indel detection
  • Remove duplicates: samtools, merge first then remove duplicates
  • SNP Detection: SOAPsnp
    • Params: -r 0.0005 -3 0.001 -u -2
    • Depth > 4x
    • Q > 20
    • CN < 2
    • >5bp between each SNP
  • Indel Detection: samtools pileup -ivcf
    • RMS Qual > 20
    • Variation freq >= 0.1
    • Consensus quality > Q20
    • Supporting reads >= 2
    • Results: ~700k per individual
  • CNV: CNV_Detector
    • Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90
    • Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS:
      • Depth was significantly different from WG average with same GC content
      • Flanking region...
    • Results: ~150 CNV > 10k, ~100 > 100k per individual

BGI file format

BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats. Note that unless specified otherwise, all data is aligned on hg19.

  • Alignment data
    • BAM format
  • CNV
    • CNV Detector output format
      • According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment.
    • VCF format
      • Not assessed yet.
  • Indels
    • Samtools pileup format
      • Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs.
    • VCF format
      • Not assessed yet.
  • SNP calls
    • SOAPsnp output format
    • VCF format
      • Some of the files are in VCF3.0 format, others in VCF 4.0 format
      • Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly
      • Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time.
      • The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs.

Overview BGI datasets

TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels.

Batch Samples Lanes Size Groningen storage Grid storage (VO) Storage additional sites Analysis site Status
1st batch aka Pilot phase 60 183 yes yes (vlemed) UMGC, AMC (for comparison) aligned, in progress
2st batch 90 295 yes yes (vlemed) AMC in progress
3rd batch 222 683 10TB yes no UMGC in progress
4th batch 235 630 10TB yes yes (bbmri.nl) Hubrecht, EMC LUMC/TUdelft, UMGC, Hubrecht, EMC in progress
5th batch 153 no no not on storage yet
6th batch 10 no not arrived yet

Directories

1st batch

  • Groningen
    • Raw data (fastq)
    • Results
  • Grid (vlemed)
    • Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
    • Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
    • Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed)
    • Note: results will be copied to bbmri.nl VO when analysis is done

2nd batch

  • Groningen
    • Raw data (fastq)
    • Results (bam)
  • Grid (vlemed)
    • Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
    • Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
    • Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed)
    • Note: results will be copied to bbmri.nl VO when analysis is done

3rd batch

  • Groningen
    • Raw data (fastq)
    • Results

4th batch

  • Groningen
    • Raw data (fastq)
    • Results
  • Grid (bbmri.nl)
    • Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/bbmri.nl/fourth_batch (access to members of bbmri.nl VO)
    • Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/input (access to members of bbmri.nl VO)
    • Results LFC lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/output (access to members of bbmri.nl VO)