Extract Genomic DNA Invalid Lines Error

aroebuck · December 14, 2020, 7:29pm

I am trying to run Extract Genomic DNA on a bed file in which I have an ID, the start coordinate and the end coordinate. The sequences I am targeting represent products for potential primers (~180 of them total).
The Coordinate file set up:
Chrom Start End
chr1000 184143 184871

The file I am matching it to is a set of ORFs that I extracted from a Blast search in the following format:

chr1 GTGCTCACCGCCGCCGCGNNNNNNN…
chr10 CTGGCTGCGGCGCACCTACGGCCTCACCGTCGC…
chr100 ATGGTGGCCACCGCTCCCCCGGCGAACCGGGCCAGGATGGCACAG
etc…

I have both files assigned to my custom build for the organism. When I try to run this, however, I get the error:
181 warnings, 1st is: Unable to fetch the sequence from ‘184143’ to ‘171’ for chrom 'chr1000 '.
Skipped 181 invalid lines, 1st is #1, “chr1000 184143 184314”

I think I have made the formats of the chromosome names as identical as they get, but why will this still not function? I was wondering if it came to a matter of tabs vs spaces?

Any ideas are helpful.Thanks!

EDIT: I tried converting my bed file to tab delimited (just in case even though I saved it as tab-delimited) and tried again, and it still is saying all my lines are invalid.

nick · December 14, 2020, 11:53pm

I’m sorry you’re having this issue. Before going any further I’d suggest seeing if the tool “bedtools GetFastaBed” would work instead. In most cases it can perform the same job as “Extract Genomic DNA”.

aroebuck · December 21, 2020, 8:27pm

Thank you for the suggestion - I tried this and got the following (it showed up green at the end but still failed to extract any of my sequences):
WARNING. chromosome (chr1000) was not found in the FASTA file. Skipping.

I double checked the fasta file and it definitely is still there:
“>chr1000 ATAGCTCGCCCCGCCGGGCGGCCCGGGCGAGGGCTCCGGGGCCTCGTCGGTGAACGGCGTCGCCCCACCGGCCATCCGCTCCAGGGCCGGCCC…”

Does this program required different formatting?

Thanks

jennaj · December 22, 2020, 8:55pm

Hi @aroebuck

Few questions:

Did you run the tool NormalizeFasta on your custom genome before using it, and before promoting it to a custom build? If not try that first as a solution. You’ll need to name the custom build with a different “database” name (dbkey) – or delete the old one first.
How many sequences (“chromosomes”) are in your custom genome? If the fasta has 1000s of chromosomes, that might be problematic.
Check to make sure the bed file is formatted correctly.
Please confirm that you are working at usegalaxy.org. If somewhere else, please describe.

FAQs: Galaxy Support

I also added some tags to your post that link to prior Q&A for custom genome/build troubleshooting.

Let’s start troubleshooting more from there.

aroebuck · December 22, 2020, 10:23pm

Hi @jennaj, I did try normalizing, but I will try again. I am actually using an instance hosted on my own computer. I am trying to extract sequences for primer design. I started with a single WGS, then obtained a series of unique ORFs from which I am trying to extract the sequences that will be putative targets for strain-specific primers.

Originally I tried using the initial genome (one long sequence) as my “custom build” but that didn’t work because (I am guessing) it isn’t broken into chromosomes. I have since been trying to use the fasta file containing the unique ORFs as my “custom build” since it has multiple sequences that can be treated as chromosomes. However, there are over 47 000 sequences in that file so could that be the issue?

Could it also be an issue that the file I used to make the custom build is the same file as the reference genome? I’m afraid I don’t understand why there needs to be a build to begin with since I thought it would just match the headers in the reference genome and the coordinate file. I have checked the formatting of both files and I think they’re both good (ran a replace spaces with tabs and condense multiple tabs to one).

Any more advice you can give based on this would be amazing.

Thank you very much,

Andrea

jennaj · January 11, 2021, 10:14pm

Apologies for the delay @aroebuck Did you solve the problem?

If not, some more feedback:

This would be a fasta containing one sequence. As long as the format is in fasta, that is fine (and actually quite common: bacteria, etc). My guess is that the bed content was not based on that sequence, which resulted in an error.

If the “reference genome” fasta has that many “chromosomes/sequences”, you may run into problems (memory) with some tools, but I wouldn’t expect this tool to have a problem.

When using a custom genome with this tool, the expected inputs are:

bed input with at least 3 columns. If you include more columns, they must follow the bed format.
fasta input that has been run through NormalizeFasta.
the “chromosome” identifiers in the first column of the bed file are an exact match for the identifiers in the fasta file (minus the “>” at the start and no description content on the title line – just the identifier).
Both are selected on the tool form from the working history’s set of datasets.
See the Datatype’s FAQ in the original reply if you are not sure how to format the bed input.

The alternative tool is bedtools GetBedFasta, but if your bed and fasta are not a match, or the bed input has format problems, this tool will fail as well.

Maybe try at a public Galaxy server then compare to your own server to eliminate technical issues on your local? That way you could also share the history (privately if you want) for more feedback. Or, capture some screenshots of the bed and fasta – some example data that you think should be matching up – and post those back.

Thanks!

aroebuck · January 22, 2021, 5:49pm

Thank you @jennaj for the suggestions. I was able to resolve the issue by the following:

Split my large file containing the ORFs of interest (~47 000 sequences) into separate files each containing 500 sequences each
Used one of the files produced from the above step to find putative primers and generate a bed file with the coordinates (Used Primer3).
Created a custom build on galaxy using the file (same one as selected for Step 2) that I used to find the primers. This same file was also used as the reference genome for the Extract Genomic DNA search.
Assigned the bed file and the reference genome to the custom build.
Success running Extract Genomic DNA.

I actually didn’t need to normalize the data following procedure, but that may have been happy coincidence.

Thanks again for the suggestions!

Andrea

Topic		Replies	Views
Extracting sequences from bed file using tools extract Genomic DNA tool and bed to Fasta tool usegalaxy.org support metadata , custom-genome , bedtools , custom-build	3	2126	June 30, 2020
Extract Genomic DNA Issue usegalaxy.org support	6	565	December 15, 2020
Extract Genomic DNA- Chromosome was not found for hg19 usegalaxy.org support bed	2	796	November 20, 2020
Extract Genomic DNA won't work usegalaxy.org support fasta-manipulation , bed , custom-build	3	1673	December 11, 2019
Extract Genomic DNA: index not found for hg19 usegalaxy.eu support bed , reference-index , chip-seq , server-side-error , epigenetics	4	1000	December 2, 2019

Extract Genomic DNA Invalid Lines Error

Related topics