Extract Genomic DNA won't work

Ilse · December 9, 2019, 10:29am

Hay,

I am trying to extact genomic DNA out of a FASTA file and a flanked VCF file.
I have a VCF file and a FASTA file and I want to exctact the SNP’s with 200 bp flanks.
First I uploaded my Fasta (this one is TAB delimited) and my VCF file. I transformed my VCF file to a pgSnp file and then get flanks with “Get flanks returns flanking region/s for every gene”. Then I used Extract Genomic DNA for getting te sequences. But the output I get is totally empty and says “Unable to fetch the sequence”.

I hope someone is able to help me.

Kind regards,
Ilse

jennaj · December 10, 2019, 3:19am

Hi @Ilse,

Extract Genomic DNA expects two inputs. If the formatting or content is off for either, you will not get valid results.

Query: Genomic coordinates in bed or gtf format.
- The tool states interval format as a valid input, which is a less strict format datatype version than bed. Even so, the data must have bed column format/ordering for at least the first 3-6 columns (six columns if including strand).
- The tool also states gff format as a valid input, which is a less strict format datatype version than gtf. This means that the 9th column (attributes) does need to include the stricter minimum values (gene_id and transcript_id). Do not use a gff3 input or expect errors.
- Using VCF to pgSnp is fine. The first four columns of data are in interval format, and if you use Cut to restrict to the first four columns the data will then be in bed format. But using Get flanks will also restrict the output to be in interval format with bed column ordering.
Target: A locally-cached index on the server or a Custom genome in fasta format.
- To use a “locally-cached” genome, that genome must be assigned as the “database” metadata to the query input.
- To use a use Custom genome, you might need to run NormalizeFasta on your fasta to remove the description line content (data on the “>” title line after the first whitespace) and wrap the bases to a consistent length (80 is good). The tool will only be able to interpret data that is actually in fasta format, not tabular. Transform with Tabular-to-Fasta, if you need to, first.

Also, check your data for chromosome/identifier naming mismatches. Between the two inputs, the “chromosome” names must be formatted exactly the same and the overall content based on the same reference genome/transcriptome version/build. This means that the identifiers in the first column of your interval/bed dataset must exactly match what is on the “>” title line of the fasta dataset.

The FAQs below have more details:

FAQs: https://galaxyproject.org/support/

Hope that helps, but if not, share some more details. You could copy/paste the first few lines of both inputs and/or post back screenshots in a reply. Make sure to expand the datasets to show the currently assigned datatypes or state exactly what those are for each.

Ilse · December 11, 2019, 9:47am

Hey @jennaj,

Really helpfull! Thank you very mutch
The only problem was a simple capital letter what result into a chromosome/identifier naming mismatches .

jennaj · December 11, 2019, 8:03pm

Yep, sometimes the smallest differences cause problems! Tools are picky and only do exactly what you tell them to do…

Very glad you found and fixed