Help with extracting genome sequences

Hi

I wish to extract alignments between sequences for human and macaque, for which I have the genome coordinates of the human (hg19) in bed format. If I upload a file containing the entry below:

chr1 840486 841186 XLOC_000010 0 + 840486 841186 0,0,0 2 573,5, 0,695,

However, just sampling the outputs, I get these below:

Extract genomic DNA:

chr1:840486-841186(+) CTGTTGGGTCCCCTCCCCGCCCCACCGCGTCCCAGGGAACCCCGGCAGGG CACCCAGTGAGGGGGGCCCGGGCGTCCGCCCATTCCTCACTGCTGTCCCC GCCTGTGCCCGAAACCCCCGTTCACGTTCACCGAGAAAACAGACATAAAC CCAGCCAGGcacatccactagaatggctgtgatttcagaaaagcggacgt aagtgctgccgaggagatggaggcgttggaccccctcgcgcattgtcggg gcgggtgcagccgcggtgcaaaaggaccttcctcagaaagttgagcacga agttcccacaggcccggaagttcccctcccgggcgctccccagagagctg aagacTGGGCGCGTGCCGGCGGCAAATGTTCACAGCAAAGGGCGGCCCAG TGCGCAGCTGGAACATCAGCCCCAGGCGGTCGCTGCGACAGGGACGAGCC TGGGAAACGTGAaaatgtccagaacagggaatccacagataaagaaaaga catcgtggttgccagagctcgcgggagggggcaacagggaccgactgctt aacgtgtatggttccccttcagggtgaggacgtgttctgggacgaggtcg aggtcagggttgcaggacaagatgaaagcgctaaatgccactgaattgtt tgctttaacgtcattaattttgttatgtgaatttcatctcaatAGAATAA

Stitch gene blocks:

hg19.XLOC_000010 CTGTTGGGTCCCCTCCCCGCCCCACCGCGTCCCAGGGAACCCCGGCAGGGCACCCAGTGAGGGGGGCCCGGGCGTCCGCCCATTCCTCACTGCTGTCCCCGCCTGTGCCCGAAACCCCCGTTCACGTTCACCGAGAAAACAGACATAAACCCAGCCAGGcacatccactagaatggctgtgatttcagaaaagcggacgtaagtgctgccgaggagatggaggcgttggaccccctcgcgcattgtcggggcgggtgcagccgcggtgcaaaaggaccttcctcagaaagttgagcacgaagttcccacaggcccggaagttcccctcccgggcgctccccagagagctgaagacTGGGCGCGTGCCGGCGGCAAATGTTCACAGCAAAGGGCGGCCCAGTGCGCAGCTGGAACATCAGCCCCAGGCGGTCGCTGCGACAGGGACGAGCCTGGGAAACGTGAaaatgtccagaacagggaatccacagataaagaaaagacatcgtggttgccagagctcgcgggagggggcaacagggaccgactgcttaacgtgtatggttccccttcaggAATAA

Am I interpreting this wrong, or are the exons wrongly annotated as both uppercase and lowercase? The motif ‘TTCAGG’ should be the end of the first exon, but in the first example it is lowercase (I’ve marked as bold in the sequence) and in the second case is lowercase.

It seems like the sequence is as I would expect, but the uppercase/lowercase of exons/introns are not as I would expect. I need both exon and intron sequence for further analysis so this is an issue, unless I am missing something obvious here.

Can anybody help?

Thanks

Liam

1 Like

Hi - The upper/lower cases bases are with respect the repeat masking (do not represent exons/introns).

There are a few variations of masking applied (or not applied) the UCSC Feb 2009 GRCh37/hg19 genome build – represented in the different data versions published here: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/

The Extract Genomic DNA tool extracts sequence from a hg19 fasta version of the genome with lower case base masking applied to RepeatMasker and Tandem Repeats Finder regions. If you want to visualize this, view your bed dataset in the UCSC genome browser here: https://genome.ucsc.edu or expand the dataset within Galaxy and click on the “view at UCSC main” link and a browser window will open up. The default view has the repeat tracks displayed.

The "Stitch Gene Blocks` tool extracts sequence from the MAF alignments associated with the UCSC’s Conservation track. I’m not exactly sure which hg19 genome masking version this track used, but it appears to be slightly different. You could ask UCSC which masking version was used to generate the MAF for clarification. This is the source MAF used in Galaxy: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/

For reference, the tool Bedtools GetFastaBed also extracts fasta from the sequence from a hg19 fasta version of the genome with lower case base masking applied to RepeatMasker and Tandem Repeats Finder regions (same as Extract Genomic DNA).

Sorry, forgot to include a solution that I think will achieve extracting the data you want.

The output from the tools above does not include introns when given a bed12 input. You could restrict the current bed12 dataset down to six columns (bed6) to get the full genomic sequence for the entire region, but a tool like Gene BED To Exon/Intron/Codon BED allows for more specific region extractions by unpacking a bed12 into bed6 datasets representing UTR, coding, exons, introns – alone or in various combinations. Then sequence can be extracted from those specific bed6 datasets.

Please give that a try and see if it produces what you need.