BACK to main page

Loading Interaction data to sherlock

STEP 1: store raw files copy raw database files to S3 for each database, like:

s3://sherlock/raw_zone/omnipath_0_7_111
s3://sherlock/raw_zone/intact_2018_10_04
s3://sherlock/raw_zone/string_10_5

STEP 2: convert to json

Generate and copy json files to landing zone, partitioned by interactor_a_tax_id, like: s3://sherlock/landing_zone/intact_2018_10_04/interactor_a_tax_id=9606/intact.json (the name of the file is arbitrary, presto will read all the files from the directory when it queries the partition)

And in the json we have single json records per line. Here is an example for one json record (where we put each attribute in separate line for readability, but in the real json this is a single line, and the lines are separated by only a new-line character, and no comma).

{
   "interactor_a_id": "ensg11867234247",
   "interactor_b_id": "ensg98236453842",
   "interactor_a_id_type": "ensembl",
   "interactor_b_id_type": "ensembl",
   "interactor_b_tax_id": 9606,
   "interactor_a_molecula_type_mi_id": 250,
   "interactor_b_molecula_type_mi_id": 250,
   "interactor_a_molecula_type_name": gene,
   "interactor_b_molecula_type_name": gene,
   "interaction_detection_methods_mi_id": [401, 58],
   "interaction_types_mi_id": [208],
   "source_databases_mi_id": [469],
   "pmids": [14847410, 9368760]
}

Note: you don’t need to add interactor_a_tax_id attribute here, as it is already coded to the folder name where the json is placed.

Use the following syntax when crating the values for each attribute:

interactor_a_id: lowercase trimmed string, mandatory
interactor_b_id: lowercase trimmed string, mandatory
interactor_a_id_type: UniProtKB DBref, mandatory
interactor_b_id_type: UniProtKB DBref, mandatory
interactor_a_tax_id: NCBI TaxID int, mandatory
interactor_b_tax_id: NCBI TaxID int, mandatory
interactor_a_molecula_type_mi_id: mi_id int, mandatory
interactor_b_molecula_type_mi_id: mi_id int, mandatory
interactor_a_molecula_type_name: lowercase trimmed string (mi term name), mandatory
interactor_b_molecula_type_name: lowercase trimmed string (mi term name), mandatory
interaction_detection_methods_mi_id: [mi_id int, …], optional, default: empty list
interaction_types_mi_id: [mi_id int, …], optional, default: empty list
source_databases_mi_id: [mi_id int, …], optional, default: empty list
pmids: [ pmid int, …], optional, default: empty list

There are many scripts we made that makes the json conversion for the interaction database files.

STEP 3: register landing tables in Presto

For each molecular interaction database we need to register a table in the landing zone:

CREATE TABLE landing.intact_2018_10_04 (
   interactor_a_id VARCHAR(64) NOT NULL,
   interactor_b_id VARCHAR(64) NOT NULL,
   interactor_a_id_type VARCHAR(25) NOT NULL,
   interactor_b_id_type VARCHAR(25) NOT NULL,
   interactor_a_tax_id INT NOT NULL,
   interactor_b_tax_id INT NOT NULL,
   interactor_a_molecula_type_mi_id INT NOT NULL,
   interactor_b_molecula_type_mi_id INT NOT NULL,
   interactor_a_molecula_type_name VARCHAR(25) NOT NULL,
   interactor_b_molecula_type_name VARCHAR(25) NOT NULL,
   interaction_detection_methods_mi_id ARRAY<INT> NOT NULL,
   interaction_types_mi_id ARRAY<INT> NOT NULL,
   source_databases_mi_id ARRAY<INT> NOT NULL,
   pmids ARRAY<INT>  NOT NULL
) WITH (
   format            = 'JSON',
   partitioned_by    = ARRAY['interactor_a_tax_id'],
   external_location = 'S3://sherlock/landing_zone/intact_2018_10_04' );

STEP 4:  use hive CLI to refresh the partition list  

msck repair table landing.intact_2018_10_04;

STEP 5: convert to ORC in the master zone (+ finer partitioning & total ordering)

CREATE TABLE master.intact_2018_10_04 WITH (
   format = 'ORC',
   partitioned_by = ARRAY['interactor_a_tax_id']
) AS SELECT * FROM landing.intact_2018_10_04 ORDER BY interactor_a_id, interactor_b_id;

In the end we will have the master interaction files in the data lake, like: s3://sherlock/master_zone/intact_2018_10_04/interactor_a_tax_id=9606/something.orc

sherlock

an open source data platform, developed in the Korcsmaros Group to store, analyze and integrate bioinformatics data

Loading Interaction data to sherlock