BACK to main page

Loading gene/protein sequence data to sherlock

STEP 1: store raw files

copy raw human genome (hg38p12) files to: s3://sherlock/raw_zone/hg_38_p12

STEP 2: convert to json

Generate and copy json files to landing zone, like: s3://sherlock/landing_zone/hg_38_p12/chr=chr11/chromosome.json

And in the json we have single json records per line, like:

{"length": 1000, "start": 123001, "stop": 124000, "sequence": "actg.....ccta"}
{"length": 1000, "start": 124001, "stop": 125000, "sequence": "actg.....ccta"}
{"length": 1000, "start": 125001, "stop": 126000, "sequence": "actg.....ccta"}
{"length": 1000, "start": 126001, "stop": 127000, "sequence": "actg.....ccta"}
{"length": 1000, "start": 127001, "stop": 128000, "sequence": "actg.....ccta"}

Note: you don’t need to add chr attribute here, as it is already coded to the folder name where the json is placed.

Note: you can split the output files into many parts, if they are too large for some reason. The only important thing is to write full son objects (don’t split the file in the middle of the lines) and place all the files for a given chromosome into the same folder. (The file names don’t matter, only the folder name is important.)

Note: use the following syntaxes:

length: integer, larger than zero
start: integer, larger than zero, smaller or equal to stop
stop: integer, larger than zero, larger or equal to start
sequence: small case string, exactly as long as the length attribute

This is our script that makes the fasta -> json conversion.

STEP 3:  register landing table in Presto  

CREATE TABLE landing.hg_38_p12 (
   chr VARCHAR(64) NOT NULL,
   length INT NOT NULL,
   start INT NOT NULL,
   stop INT NOT NULL,
   sequence VARCHAR(1000) NOT NULL,
) WITH (
   format            = 'JSON',
   partitioned_by    = ARRAY['chr'],
   external_location = 'S3://sherlock/landing_zone/hg_38_p12' );

STEP 4:  use hive CLI to refresh the partition list  

msck repair table landing.hg_38_p12;

STEP 5:  convert to ORC in the master zone (+ finer partitioning & total ordering)

CREATE TABLE master.hg_38_p12 WITH (
   format = 'ORC',
   partitioned_by = ARRAY[′chr′]
) AS SELECT * FROM landing.hg_38_p12 ORDER BY start;

In the end we will have the genome files in the data lake, like: s3://sherlock/master_zone/hg_38_p12/chr=chr11/chromosome.orc

sherlock

an open source data platform, developed in the Korcsmaros Group to store, analyze and integrate bioinformatics data

Loading gene/protein sequence data to sherlock