Changes between Initial Version and Version 1 of BIOS_Metadatabase


Ignore:
Timestamp:
Sep 19, 2016 12:57:39 PM (8 years ago)
Author:
rick
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BIOS_Metadatabase

    v1 v1  
     1= Metadatabase =
     2
     3The BIOS project has generated for over 4000 individuals RNA-sequencing and DNA methylation data. A part from these data, GoNL imputed genotypes were generated from existing genotypes and several phenotypes/demographic variables were collected for the same set of samples.
     4A highly flexible sample-oriented metadatabase (MDb) was created in order to manage the dynamic generation of this large-scale multiple-omic data set. 
     5
     6The MDb is a non-relation database (http://couchdb.apache.org/) that uses JSON to store records and !JavaScript for querying. Furthermore, it has an HTTP API suitable to programmatically access the database from the GRID, e.g, the alignment pipeline.
     7
     8Each record or document is a sample (individual) within the BIOS project and has a unique identifier. Each document has a predefined structure according to our database schema (https://git.lumc.nl/rp3/bios-schema). Custom Python scripts are use to update or modify the database (https://git.lumc.nl/rp3/bios-mdb). 
     9
     10Access to the metadatabase (MDb) is restricted; please contact (Leon Mei or Maarten van Iterson).
     11
     12== Description of MDb content ==
     13
     14The MDb contains as much as meta-information as possible from all samples and datatypes: location of (raw) data on srm, md5 checksum verification, quality control information, links between the different identifiers used (person_id, dna_id, etc) and phenotype information.
     15
     16Every sample's meta information is encoded in a CouchDB document. Each document has a unique identifier (the bios_id) which is biobankname (CODAM, LL, LLS, NTR, RS and PAN) concatenated with person_id separated by a "-", e.g. CODAM-2001. This unique bios_id is not suitable for use in the public domain, e.g., EGA upload, therefore a unique not identifiable identifier has been created for each individual; the uuid.
     17
     18Every update of a sample in the database is recorded by increasing a revision number. Therefore it is always possible to undo wrong updates.
     19The attachment of this page has a json file representing a sample's information in the metadatabase (The content of the file can be past on a JSON viewer e.g. http://jsonviewer.stack.hu).
     20
     21== Description available views ==
     22
     23Views are the way to extract information form a couchDb. Views are organized into designs; each design contains a number of views related to a particular kind of information that can be extracted from the MDb. For example, there is a design “EGA” which contains currently two views 1) “freeze1RNASeq” to extract those samples for which RNAseq data has been uploaded to EGA and 2)   “freeze1Methylation” for the DNA methylation data.
     24
     25Other relevant views are:
     26
     27|| design: || view: ||
     28|| EGA  || freeze1RNASeq, freeze1Methylation, freeze2RNASeq, freeze2Methylation ||
     29|| Files  || getFastq, getIdat ||
     30|| Identifiers || getIds ||
     31|| Phenotypes || allPhenotypes, cellCounts, minimalPhenotypes ||
     32|| Runs || getGenotypes, getMethylationRuns, getRNASeqRuns ||
     33|| Samplesheets || rnaseqSamplesheet,  methylationSamplesheet ||
     34|| Verification || md5 ||
     35
     36
     37Note: We can always add views if necessary; please contact Maarten van Iterson.
     38
     39== Accessing the MDb ==
     40
     41Views can be downloaded as JSON documents by making a GET request. Most programming languages have utilities for making GET requests and to transform JSON documents. Some programming languages have an API for CouchDB e.g. JAVA and Python. There are several online tools available for transforming JSON documents to csv files.
     42
     43Please note that it is usually better to download the view separately and work on the downloaded file. This way you only have to enter your password once and you're resilient to network connectivity problems.
     44
     45=== Access the metadatabase using R ===
     46
     47We have developed the R package BIOSRutils (https://git.lumc.nl/rp3/biosrutils) for easy access to the MDb and processed datasets. BIOSRutils is available on the VM for R version 3.2.0 (start R using command R-3.2.0 from the commandline). The current version 0.0.1 this is still a development version, several of our aimed features are not yet fully implemented.
     48
     49BIOSRutils uses a configuration file to read in your MDb username and password, so that you do not have to type it every time you use the MDb.
     50
     51Create a file called .biosrutils and stored it in your home directory on the VM (/home/username) and add as the first line:
     52
     53usrpwd: 'username:password'
     54
     55Start R-3.2.0 and load the library:
     56
     57{{{
     58> library(BIOSRutils)
     59}}}
     60
     61Several predefined variables are available, such as, the urls to the current MDb and Rdb, as well as, your provide username and password (USRPWD). All the variables are capitalized to minimize interference with your own code.
     62
     63{{{
     64> ls()
     65[1] "BIOBANKS"     "DATASETS"     "MDb"          "PROXY"        "RDb"         
     66[6] "RP3DATADIR" "SRMBASE"      "USRPWD"       "VIEWS"   
     67}}}
     68
     69The BIOSRutils package provides the function getView to extract a particular view from the MDb. All available views are stored in the global variable VIEWS. Use the regular way to get help in R, e.g.:
     70
     71{{{
     72> ?getView
     73}}}
     74
     75For example, we want to extract all phenotype information from all samples we use the allPhenotype view from the design Phenotypes.
     76
     77{{{
     78phenotypes <- getView(view=“allPhenotypes”, design=“Phenotypes”)
     79}}}
     80
     81Basic R manipulations can be use to select particular information. e.g.:
     82
     83{{{
     84LLSMalesAbove70 <-  subset(phenotypes, grepl(“LLS”, ids) & Sex == 0 & DNA_BloodSampling_Age > 70)
     85}}}
     86
     87=== Access the MDb using curl ===
     88
     89By using a curl GET request the content of the view can be obtained as follows. For example using view `getIds` (substituting your username):
     90{{{
     91$ curl -X GET https://metadatabase.bbmrirp3-lumc.vm.surfsara.nl:6984/bios/_design/Identifiers/_view/getIds?reduce=false -u 'username' -k -g > getIds.json
     92$ cat getIds.json
     93{"total_rows":6379,"offset":0,"rows":[
     94{"id":"CODAM-2037","key":[false,"CODAM"],"value":{"bios_id":"CODAM-2037","uuid":"BIOS71A89511","biobank_id":"CODAM","person_id":"2037","pheno_id":"2037","gwas_id":"2037","dna_id":"2037","rna_id":"2037","rna_note":"library-prep: succeeded","gonl_id":null,"cg_id":null,"in_rp3":false}},
     95...
     96]}
     97}}}
     98
     99The [http://stedolan.github.io/jq/ jq] tool (installed on the cloud VM) can be used for quick processing of the JSON formatted result on the command line. For example, to get just the `uuid` values from that view:
     100{{{
     101$ curl -X GET
     102https://metadatabase.bbmrirp3-lumc.vm.surfsara.nl:6984/bios/_design/Identifiers/_view/getIds?reduce=false -k -u username | jq -r '.rows[].value.uuid // empty'
     103BIOS71A89511
     104BIOS78A709E9
     105BIOS700411C4
     106BIOS75EAD30E
     107...
     108}}}
     109
     110The JSON file can be parsed into a Python datastructure as follows:
     111{{{
     112> import json
     113> document = json.load(open('getIds.json'))
     114> document['rows'][0]
     115{u'value': {u'rna_note': u'library-prep: succeeded', u'biobank_id': u'CODAM', u'cg_id': None, u'in_rp3': False, u'uuid': u'BIOS71A89511', u'dna_id': u'2037', u'gwas_id': u'2037', u'gonl_id': None, u'pheno_id': u'2037', u'rna_id': u'2037', u'person_id': u'2037', u'bios_id': u'CODAM-2037'}, u'id': u'CODAM-2037', u'key': [False, u'CODAM']}
     116}}}
     117
     118=== Access the MDb using Firefox via BIOS VM ===
     119You can access MDb by running firefox on BIOS VM with X forwarding in your ssh session: "ssh -X bios-vm.bbmrirp3-lumc.vm.surfsara.nl".
     120
     121== Updates ==
     122
     1232014-05-09: For NTR set in_rp3 = TRUE for a set of unrelated samples passing methylation qc and have GoNLv5 imputed genotypes.
     124
     1252014-06-13: Check cg_id of LL all NA's.
     126
     1272014-06-13: Remove rnaseq info for four samples that had duplicated rnaseq_run_id's.
     128
     1292014-06-13: Add 80 LL rnaseq samples (1 sample could not be add because rna_id did not occur in sample sheet).
     130
     1312014-08-14: Added 193 PAN samples to the metadatabase.
     132
     1332014-08-14: Added 971 samples (LL=37, LLS=23, NTR=816, RS=95) to the metadatabase.
     134
     1352014-08-18: Added 185 NTR samples to the metadatabase fixed some issues with merged samples.
     136
     1372014-09-18: Modified the location and name of BIOS database. (Now cloudcouchdb.bbmrirp3-lumc.cloudlet.sara.nl:6984/bios)
     138
     1392014-10-02: Some more RNAseq and methylation data has been added to the metadatabase. Currently, containing 6070 samples of which 4096 have rnaseq data and 6031 methylation data.
     140
     1412014-11-05: Added rnaseqFreeze0 view.
     142
     1432014-12-03: Methylation data freeze flag is set.
     144
     1452014-12-03: Three LLS methylation data technical replicates are added _key=BIOS-ID-Rep.
     146
     1472014-12-03: Add uuid (universally unique identifier) using uuidgen -r using the first 8 characters converted to upper case and prefixed with e.g. BIOS2AF124EB.
     148
     1492016-02-10: Add freeze 2 flag for RNAseq
     150
     1512016-03-01: Add RNAseq quality control field to all freeze 2 runs
     152
     1532016-03-01: Set RNAseq quality control field of the first 10 bad quality runs
     154
     1552016-05-12: Fixed Flipped RNAseq plate
     156
     1572016-05-12: Fixed 13 detected reciprocal swaps RNA runs
     158
     1592016-05-12: Added DNA methylation freeze 2 flag
     160
     1612016-05-12: Added monozygotic twin pair indicator. If last character of the NTR pheno_id is lower-case this indicates that this individual is a monozygotic twin.
     162
     1632016-05-13: Linked genotype information from monozygotic twin pairs
     164
     1652016-06-01: Added new gonl identifiers
     166