Help with extracting genome sequences

la466 · February 27, 2019, 12:27pm

Hi

I wish to extract alignments between sequences for human and macaque, for which I have the genome coordinates of the human (hg19) in bed format. If I upload a file containing the entry below:

chr1 840486 841186 XLOC_000010 0 + 840486 841186 0,0,0 2 573,5, 0,695,

However, just sampling the outputs, I get these below:

Extract genomic DNA:

chr1:840486-841186(+) CTGTTGGGTCCCCTCCCCGCCCCACCGCGTCCCAGGGAACCCCGGCAGGG CACCCAGTGAGGGGGGCCCGGGCGTCCGCCCATTCCTCACTGCTGTCCCC GCCTGTGCCCGAAACCCCCGTTCACGTTCACCGAGAAAACAGACATAAAC CCAGCCAGGcacatccactagaatggctgtgatttcagaaaagcggacgt aagtgctgccgaggagatggaggcgttggaccccctcgcgcattgtcggg gcgggtgcagccgcggtgcaaaaggaccttcctcagaaagttgagcacga agttcccacaggcccggaagttcccctcccgggcgctccccagagagctg aagacTGGGCGCGTGCCGGCGGCAAATGTTCACAGCAAAGGGCGGCCCAG TGCGCAGCTGGAACATCAGCCCCAGGCGGTCGCTGCGACAGGGACGAGCC TGGGAAACGTGAaaatgtccagaacagggaatccacagataaagaaaaga catcgtggttgccagagctcgcgggagggggcaacagggaccgactgctt aacgtgtatggttccccttcagggtgaggacgtgttctgggacgaggtcg aggtcagggttgcaggacaagatgaaagcgctaaatgccactgaattgtt tgctttaacgtcattaattttgttatgtgaatttcatctcaatAGAATAA

Stitch gene blocks:

hg19.XLOC_000010 CTGTTGGGTCCCCTCCCCGCCCCACCGCGTCCCAGGGAACCCCGGCAGGGCACCCAGTGAGGGGGGCCCGGGCGTCCGCCCATTCCTCACTGCTGTCCCCGCCTGTGCCCGAAACCCCCGTTCACGTTCACCGAGAAAACAGACATAAACCCAGCCAGGcacatccactagaatggctgtgatttcagaaaagcggacgtaagtgctgccgaggagatggaggcgttggaccccctcgcgcattgtcggggcgggtgcagccgcggtgcaaaaggaccttcctcagaaagttgagcacgaagttcccacaggcccggaagttcccctcccgggcgctccccagagagctgaagacTGGGCGCGTGCCGGCGGCAAATGTTCACAGCAAAGGGCGGCCCAGTGCGCAGCTGGAACATCAGCCCCAGGCGGTCGCTGCGACAGGGACGAGCCTGGGAAACGTGAaaatgtccagaacagggaatccacagataaagaaaagacatcgtggttgccagagctcgcgggagggggcaacagggaccgactgcttaacgtgtatggttccccttcaggAATAA

Am I interpreting this wrong, or are the exons wrongly annotated as both uppercase and lowercase? The motif ‘TTCAGG’ should be the end of the first exon, but in the first example it is lowercase (I’ve marked as bold in the sequence) and in the second case is lowercase.

It seems like the sequence is as I would expect, but the uppercase/lowercase of exons/introns are not as I would expect. I need both exon and intron sequence for further analysis so this is an issue, unless I am missing something obvious here.

Can anybody help?

Thanks

Liam

jennaj · February 27, 2019, 7:36pm

Hi - The upper/lower cases bases are with respect the repeat masking (do not represent exons/introns).

There are a few variations of masking applied (or not applied) the UCSC Feb 2009 GRCh37/hg19 genome build – represented in the different data versions published here: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/

The Extract Genomic DNA tool extracts sequence from a hg19 fasta version of the genome with lower case base masking applied to RepeatMasker and Tandem Repeats Finder regions. If you want to visualize this, view your bed dataset in the UCSC genome browser here: https://genome.ucsc.edu or expand the dataset within Galaxy and click on the “view at UCSC main” link and a browser window will open up. The default view has the repeat tracks displayed.

The "Stitch Gene Blocks` tool extracts sequence from the MAF alignments associated with the UCSC’s Conservation track. I’m not exactly sure which hg19 genome masking version this track used, but it appears to be slightly different. You could ask UCSC which masking version was used to generate the MAF for clarification. This is the source MAF used in Galaxy: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/

For reference, the tool Bedtools GetFastaBed also extracts fasta from the sequence from a hg19 fasta version of the genome with lower case base masking applied to RepeatMasker and Tandem Repeats Finder regions (same as Extract Genomic DNA).

jennaj · February 27, 2019, 7:49pm

Sorry, forgot to include a solution that I think will achieve extracting the data you want.

The output from the tools above does not include introns when given a bed12 input. You could restrict the current bed12 dataset down to six columns (bed6) to get the full genomic sequence for the entire region, but a tool like Gene BED To Exon/Intron/Codon BED allows for more specific region extractions by unpacking a bed12 into bed6 datasets representing UTR, coding, exons, introns – alone or in various combinations. Then sequence can be extracted from those specific bed6 datasets.

Please give that a try and see if it produces what you need.

Topic		Replies	Views
Extract Genomic DNA: index not found for hg19 usegalaxy.eu support bed , reference-index , chip-seq , server-side-error , epigenetics	4	1000	December 2, 2019
Extracting sequences from bed file using tools extract Genomic DNA tool and bed to Fasta tool usegalaxy.org support metadata , custom-genome , bedtools , custom-build	3	2126	June 30, 2020
Issue with Extract Genomic DNA usegalaxy.eu support macs2	7	1404	May 18, 2019
Extract Genomic DNA- Chromosome was not found for hg19 usegalaxy.org support bed	2	796	November 20, 2020
Extract Genomic DNA won't work usegalaxy.org support fasta-manipulation , bed , custom-build	3	1673	December 11, 2019

Help with extracting genome sequences

Related topics