I have a local installation of Galaxy that I set up some time ago - I am back at it, and just tried to run a fastq to fasta on a dataset in my history. I ran fastq quality converter, and that worked fine, but I got an error running fastq to fasta. The bug report says:
fastq_to_fasta: Invalid input: expecting FASTQ prefix character ‘@’ on line 36279761. Is this a valid FASTQ file?
But I downloaded the resulting fasta file and it looks fine to me and the last sequence in the fasta file is the last sequence in the fastq file.
I have two questions - can I find out what went wrong, and even if not, can I change the “state” of the dataset from ERROR back to good? Since it is my installation, I have access to all the files, of course.
With an older Galaxy release, and older versions of tools, or some mix of updated Galaxy/not and updated tool versions/not … many things can go wrong.
The error indicates that the datatype assigned to the fastq data is a mismatch for the actual data format. This used to happen if the fastq data is labeled as being compressed when it is not, or the reverse. It can also occur if a dataset is truncated or has some type of internal formatting problem.
You could try to figure out what the actual format is and assign the correct datatype and try the tool again (rerun). Or fix what is wrong (truncated end of file?), reupload, run again, etc.
The best way forward, to avoid other problems, would be to update your Galaxy instance to the most current release and to also update your tools to the most current version.
Be aware that Solexa data is a much older format and is not really supported anymore – so do your updates with that in mind. Or don’t update and do your best to work around/solve some of the usage problems that might come up when working with older releases/tools.
The Galaxy FAQs address common troubleshooting solutions. These are based on the newer releases, but the basics apply to bioinformatics in general (in Galaxy or not).
In short: Tools are picky about inputs! Correct file/dataset format matters. Assigned datatype matters. More has been added to Galaxy to help people avoid usage issues but not everything odd can be captured, and the earlier releases/tool versions don’t have all of the latest upgrades/UI-assists. Unless you update, you’ll need to work through any problems encountered, and sorting out old bugs (now resolved) apart from actual new bugs (rare) from usage issues (common) will be a challenge, yet is certainly possible, and will still be easier than setting up the same experiment correctly line-command. Plus, working line-command you’ll lose the other parts of Galaxy’s reproducibility bonuses: histories, workflows, sharing.
I got the same error recently on newly compiled fastx-toolkit parsing a fastq file. It stopped halfway. This software is no longer maintained, so maybe no longer suitable to run on more modern systems (it needed patching to get it compiled). Instead, I created this python blob which got the job done as well as fastq_to_fasta. Hth.
import itertools
def fastq2fasta(filein,savename,noN=False):
savefile=open(savename, 'w')
with open(filein, 'r') as infile:
while True:
next_4_lines = list(itertools.islice(infile,4))
if not next_4_lines:
break
fasta_header=">{}".format(next_4_lines[0].strip())
seq_line=next_4_lines[1].strip()
if noN and seq_line.find("N") != -1:
continue
savefile.write("{}\n{}\n".format(fasta_header,seq_line))
savefile.close()