BACK to main page

Fetching a sequence region around an SNP

In this example, we have a single point mutation on chromosome 11, in position 58723, and we wish to fetch the wild type sequence around this SNP, let’s say in +/- 700 length.

The following query also expects that we store the sequence in at least 700 long chunks in the data lake (what we do actually). The inner query part will fetch 3 chunks from the genome, and the outer part will split our region of interest from this “large sequence”. The query also handles the case, when there is no 700 long sequence around the SNP (e.g. if the mutation is positioned int he very begining or very end of the chromosome).

SELECT 
   SUBSTR(long_sequence, 
          MAX(1, SNP_POS-700-long_sequence_start), 
          MIN(LENGTH(long_sequence) - (MAX(1, SNP_POS-700-long_sequence_start)), SNP_POS+700-long_sequence_start) ) 
   AS result
FROM (
        SELECT 
            ARRAY_JOIN ( ARRAY_AGG(genome.sequence ORDER BY genome.start) ) AS long_sequence
            MIN(genome.start) AS long_sequence_start
        FROM master.hg38 genome
        WHERE genome.chr = ‘chr11’
          AND genome.start BETWEEN SNP_POS - 2000 AND SNP_POS + 1000;
      );

© 2018, 2019 Earlham Institute (License)