DataManagement – bbmri

Context Navigation

Version 1 (modified by laurent, 13 years ago) (diff)
--

This page is work-in-progress regarding the data Management of the GoNL.

All data should only be writable by their owners
All tools and resources should be read/executable by the whole gcc group
All project-specific data and results should be read/executable by the gvnl group

The root for all subsequent directories is data/gcc/

/tools
- Contains all GCC tools including GoNL tools
- All tools should be put in a folder using the naming convention: toolname-version
  - Ex: Picard v1.32 should be found in /data/gcc/tools/picard-tools-1.32/
/resources
- Contains all GCC resources inlcluding GoNL resources
- All resources should be put in a folder precising their version. Normally, should follow resource-version.
  - Ex: Human Genome build 19 should be found in /data/gcc/resources/hg-19/

The root for all subsequent directories is /data/gcc/projects/gonl/

/rawdata
- Contains all the raw unprocessed data by batch
  - Ex: All raw data for the 1st batch is located in /data/gcc/projects/gonl/rawdata/first_batch/
/results
- Contains all the results after processing the data
/results/BGI
- Contains all the results from the BGI pipeline (snps, indels, metrics, etc.)
/results/immunochip
- Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.)
/results/pipeline
- Contains all the results from the sequence data through the GoNL pipeline by batch
  - Ex: Results on the first batch are in /gcc/data/projects/gonl/results/pipeline/first_batch
- The subdirectory structure for each of the batches should be the following:
  - All results related to a sample shoud go in /sample_name
    - Ex: All results related to sample A2a (first batch) should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a
  - All results related to a lane of a sample should go in /sample_name/lane_name
    - Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a/FC20005_L1/

The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.

General convention
- Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
- Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
Sample-level files should be named using: step_id.step_name.sample_name.genome_build.time_stamp.extension
- Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: vc02.unified_genotyper.A2a.human_g1k_v37.2011_02_01_12_00.snp
Lane-level files should be named using: step_id.step_name.sample_name.lane_name.genome_build.time_stamp.extension
- Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: pe03.bwa_sampe.A2a.FC20005_L1.human_g1k_v37.2011_02_12_00.bam