BACK to main page

Loading gene annotation data to sherlock

STEP 1: store raw files copy raw bed files to S3 for each database, like:

STEP2: convert to json

Generate and copy json files to landing zone, partitioned by chromosome, like: s3://sherlock/landing_zone/annotations_hg38_gene_intron/chromosome=chr13/intron.json (the name of the file is arbitrary, presto will read all the files from the directory when it queries the partition)

And in the json we have single json records per line. Here is an example for one json record (where we put each attribute in separate line for readability, but in the real json this is a single line, and the lines are separated by only a new-line character, and no comma).

Note: you don’t need to add chromosome attribute here, as it is already coded to the folder name where the json is placed.

   “entity_type”: "gene",
   “entity_id_type”: “ensembl",
   “entity_id": "ensg11867234247",
   “start”: 12764,
   “end”: 13264,
   “source_db”: “ucsc“,
   “genome”: “hg38”,

Use the following syntax when crating the values for each attribute:

This is our script that makes the annotation file -> json conversion.

STEP 3: 
register landing tables in Presto:

For each annotation file we need to register a table in the landing zone:

CREATE TABLE IF NOT EXISTS landing.annotations_hg38_gene_intron (
   entity_type VARCHAR,
   entity_id_type VARCHAR,
   entity_id VARCHAR,
   genome VARCHAR,
   start INT,
   end INT,
   strand VARCHAR,
   source_db VARCHAR,
   chromosome VARCHAR
) WITH (
   format            = ''JSON'',
   partitioned_by    = ARRAY[''chromosome''],
   external_location = ''s3a://sherlock/landing_zone/annotations_hg38_gene_intron'' );

STEP 4: 
use hive CLI to refresh the partition list

msck repair table landing.annotations_hg38_gene_intron;

STEP 5: 
convert to ORC in the master zone (+ total ordering)

CREATE TABLE master.annotations_hg38_gene_intron WITH (
   format = 'ORC',
   partitioned_by = ARRAY['chromosome']
) AS SELECT * FROM landing.annotations_hg38_gene_intron ORDER BY start;

In the end we will have the master annotation files in the data lake, like: s3://sherlock/master_zone/annotations_hg38_gene_intron/chromosome=chr13/something.orc

© 2018, 2019 Earlham Institute (License)