fastq unavailable -- Tool does not recognize inputs? How to check why

Trying to use BBMap on an uploaded FASTQ dataset but it shows as unavailable. Also does this if I upload FASTQ.GZ and unzip in Galaxy. Wondering how to get around this. Data is originally pacbio.

Thanks.

Hi @TTP

Most tools in Galaxy expect reads in fastqsanger or fastqsanger.gz format and with one of those datatypes assigned. Those datatypes represent a specific type of quality score scaling: Sanger Phred +33.

This is produced by Illumina 1.8+ pipelines. PacBio used Phred+33 scaling to start with and now the scaling is neutral. You can double check with a tool like FastQC or FASTQ info to see what those report.

There are some tools that expect just fastq or fastq.gz (scaling unspecified). But if you are ever not sure what datatype(s) a tool accepts as input, any tool not just those that accept reads, try this:

  1. Create a new empty history
  2. Load up the tool form
  3. Review the input datatypes listed in the tool form’s input area
  4. Note: A tool could have more than one input area and this works for all user-supplied inputs from the history.

Example

Thanks for the quick reply. Visual inspection of the file contents shows that it is OK, and it can be loaded into other tools such as FASTP. So I am not sure what needs to be fixed with this file to run it on BBMap. It doesn’t give me any other information besides “unavailable”.

FASTQ info gives this:

fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/2/b/2/dataset_2b240dcd-29ca-4631-be15-688dad04558c.dat
Read name provided with no suffix
Scanning complete.

Reads processed: 34797
Memory used in indexing: ~2 MB

Hi @TTP

You probably just need to change the datatype of your datasets.

Tools will only accept datasets with a datatype assigned that matches one of the datatypes in input field prompt. Any other data with a different datatype is filtered out as a potential input. All tools work this way.

FAQ changing-the-datatype

Fastq info does basic validation/format checks. That’s the output when it doesn’t find a problem (“pass”).

FastQC generates more metrics. The quality score scaling is included in the top section. This help from yesterday explains how to review/read a FastQC report: FastQC Overrepresented sequences percentage - #2 by jennaj

The tool requires fastq and I am giving it fastq. Other fastq files are recognized fine. I uploaded the file again and got the same result, also have same result when I upload fastq.gz and unzip in galaxy.

Hi @TTP

It requires fastqsanger or fastqsanger.gz. You can assign that directly to datasets that have that format even if Galaxy does not detect that format.

If you want to share your history with that dataset for a second opinion on the format, please generate and post back a shared link to the history publicly. Or, ask for a moderator to start up a direct message chat to privately share the link in instead. How to generate the link: sharing-your-history

Thanks; here is the link

https://usegalaxy.org/u/tpaull/h/unnamed-history

It is 1123: 2.28-sequences.fastq

Hi @TTP

Review parameters and error messages on the Dataset Information page for any job troubleshooting-errors

The current error will resolve if this option is set to Yes: Quality and trimming optionsKeep going, rather than crashing, if a read has out-of-range quality values?.

Once adjusted, you’ll get different informative errors. For this case, the parameters for BBmap need to be tuned to map the long reads. Suggested parameters are available many places online – try a searching at forums like Seqanswers and Biostar or any others you usually use – then tune to fit your data.

A alternative mapper with pre-set options for Pacbio reads is Minimap2. We have a tutorial that covers usage here Antibiotic resistance detection

Side note: The fasta target has a technical problem: the custom genome fasta has a title line that doesn’t include an identifier and includes many spaces. The fasta identifier is the first “word” after the > symbol and everything else is considered description content. datatypes/#fasta and working-with-fasta-datasets

  • A tool like Sed can be used with an expression like s/ //g to remove all spaces. data-manipulation-olympics.
  • This wasn’t failing mapping tools but could certainly cause problems with other tools.
  • Search this forum with keywords like “custom” or “fasta” to find much Q&A about how custom genome fasta target datasets should be formatted.

Thanks. I think I can get the mapping to work now and the correction about the fasta file is useful. Even with the fix of the fasta file though, I am not able to load the indexed bam files into IGV; I get an error: "Invalid BAM file header: missing sequence name in file”
There is an invariant string at the beginning of each line in the bam file (m64277e_220913_154253) but putting this as the header in the fasta file also doesn’t work. I assume that the fasta file is still the problem here, not the BAM file?
current header is >m64277e_220913_154253

Hi @TTP

IGV will not have your “custom” genome" fasta already indexed. But you can create a custom “database” key from your fasta both in Galaxy and in IGV, assign that to datasets, and view all just as if the database was natively indexed.

The error you are getting seems like you assigned some other database key that was already in Galaxy and in IGV – and your data is not actually based on that other database, so a mismatch error resulted.

This prior Q&A has many more details. The context is about an assembly result, but the basics are the same: how to use IGV with a custom fasta target/genome plus any other data that is associated with it (like a mapping result eg BAM). Opening Unicycler assemblies with IGV local - #4 by jennaj