BACK to main page

Loading localization data to sherlock

STEP 1: store raw files

copy raw tissue database files to S3 for each database. E.g. for Human Protein Atlas v18, upload it to: s3://sherlock/raw_zone/hpa_18

STEP 2: convert to json

Generate and copy json files to landing zone, like: s3://sherlock/landing_zone/hpa_18/tax_id=9606/tissue.json (the name of the file is arbitrary, presto will read all the files from the directory when it queries the partition)

And in the json we have single json records per line. The lines are separated by only a new-line character, and no comma). Note: you don’t need to add tax_id attribute here, as it is already coded to the folder name where the json is placed. Where we don’t have score value, we simply skip the attribute.

{ "molecule_id": "ensg11867234247", "molecule_id_type": "Ensembl", "tissue_bto_id": 142, "tissue_bto_name": "brain", "source_db": "HPA", "score": 0.576 }
{ "molecule_id": "ensg11867234247", "molecule_id_type": "Ensembl", "tissue_bto_id": 476, "tissue_bto_name": "foot",  "source_db": "HPA" }
{ "molecule_id": "ensg23829324243", "molecule_id_type": "Ensembl", "tissue_bto_id": 142, "tissue_bto_name": "brain", "source_db": "HPA", "score": 0.983 }

Use the following syntax when crating the values for each attribute:

molecule_id: lowercase trimmed string, mandatory
molecule_id_type: UniProtKB DBref, mandatory
tissue_bto_id: BRENDA tissue ID, mandatory
tissue_bto_name: BRENDA tissue term name, mandatory
source_db: UniProtKB DBref - if the source db is present in UniProtKB DBref, otherwise just put the database name, mandatory
score: float value between 0 and 1, optional (just skip the attribute if it is not present)

This is our script that makes the database conversion.

STEP 3: register landing tables in Presto

For each molecular interaction database we need to register a table in the landing zone:

CREATE TABLE landing.hpa_18 (
   molecule_id VARCHAR(64) NOT NULL,
   molecule_id_type VARCHAR(25) NOT NULL,
   tax_id INT NOT NULL,
   tissue_bto_id INT NOT NULL,
   tissue_bto_name VARCHAR(64) NOT NULL,
   source_db VARCHAR(25) NOT NULL,
   score DECIMAL(18, 8)
) WITH (
   format            = 'JSON',
   partitioned_by    = ARRAY['tax_id'],
   external_location = 'S3://sherlock/landing_zone/hpa_18' );

STEP 4:  use hive CLI to refresh the partition list  

msck repair table landing.hpa_18;

STEP 5: convert to ORC in the master zone (+ finer partitioning & total ordering)

CREATE TABLE master.hpa_18 WITH (
   format = 'ORC',
   partitioned_by = ARRAY['tax_id']
) AS SELECT * FROM landing.hpa_18 ORDER BY molecule_id, tissue_bto_id;

In the end we will have the master interaction files in the data lake, like: s3://sherlock/master_zone/hpa_18/tax_id=9606/something.orc

sherlock

an open source data platform, developed in the Korcsmaros Group to store, analyze and integrate bioinformatics data

Loading localization data to sherlock