Download and Extract reads in FASTQ format from NCBI SRA in workflow issue

Hi! I am teaching a course where we run through an RNA-seq analysis where we take one sample through multiple steps (Download using “Download and Extract reads in FASTQC format from NCBI SRA”, run FastQC, trimmomatic, run FastQC again, HISAT2, featureCounts), extract a workflow, and then apply that workflow to 5 more samples. However, I’m having an issue building a workflow because of how the initial Download step is working.

It seems that the Download step is creating two outputs in my history that are lists with single-end data (a list with 1 fastqsanger.gz dataset) and paired-end data (a list with 0 pairs). There’s a hidden file in my history that’s saved as a fastqsanger.gz. If I just leave the lists in my history, those lists aren’t detected as a relevant input file type by FastQC. If I manually un-hide the fastqsanger.gz, that is. I can complete the FastQC and all other steps fine. However, if I extract a workflow and try to run on additional samples, those invocations break, because the workflow can’t automatically go from the output of “Download and Extract reads in FASTQC format from NCBI SRA” to FastQC.

I’m probably missing something really obvious, but how can I get these steps to play nicely together in a workflow? Is there something I should be doing to those lists after data is downloaded to get them as detectable automatically by FastQC? Here’s my current workflow, showing that Downloading, FastQC, and the rest of the workflow are unconnected. If I try to manually connect them, I get an error box that pops up saying “Can’t map over this input with output collection type”

I ran through this in class two years ago and didn’t have this issue. It used to be that the output of the “Download and Extract reads in FastQC format from NCBI SRA” was the fastq file itself, so it seems that this is something that has changed about how the output of this tool is saved.

Screenshot 2025-10-27 at 12.02.59 PM

I’m getting the lists using the current version of “Download…” (Galaxy Version 3.1.1+galaxy1), with the last time I taught the course using (Galaxy Version 2.10.8+galaxy0).

Any guidance you could offer would be greatly appreciated!

Thanks,

Emily

Welcome @erdavenport

Thanks for sharing all these details, and I think we’ll be able to help!

There are two similar tools for NCBI SRA fastq data download. The underlying scripts are based on the same NCBI’s toolkit but the options are configured to work slightly differently.

I’m going to define the tools first, then address your question. Feel free to jump to the bottom then use the extra help as a reference instead. :slight_smile:

Tool Help

What it does?

This tool extracts data (in fastq format) from the Short Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). It is based on the fasterq-dump utility of the SRA Toolkit. The following applies:

  • if data is paired-ended (or mate-pair) the tool will generate a collection of file pairs, in which each element will be a pair of fastq files containing forward and reverse mates.
  • if data is single ended, each element of the collection will be a single fastq dataset.

Both tools have the same initial Help section but the data is organized a bit differently between the two types of output! The collection folder shape is written on the collection folder in your history. updated

These are the main differences:

Download and Extract Reads in FASTQ format from NCBI SRA (current versions) and Faster Download and Extract Reads in FASTQ format from NCBI SRA

Collection folder shape: list of pairs

List of pairs collections are a nested listing of files.

  • Each sample will be written into a grouped element representing that sample. The dataset files will be inside of it.

  • If the data is single end, each sample will have one fastq file.

    • One file of R1 reads per sample.
    • The sample group is at the first level, and the fastq file is nested inside of it.
    • Later on, if the data is sent to a tool that produces multiple outputs, each will still retain that top level sample element identifier (and any group or name tags added).
  • If the data is paired end, each sample will have two fastq files.

    • One fastq file for the forward (R1) reads and one for the reverse (R2) reads.
    • The sample group is at the first level, and both fastq files are nested inside of it.
    • This structure is understood by tools that can accept a paired end collection shape and can considerably speed up analysis.
    • The sample element identifier (plus tags) stay with that pair throughout the analysis.
    • With one sample pair or hundreds (!), and maybe complicated sample/condition tagging, this organization can be very powerful.
    • Blog from last week with a video explaining how this works! → Why collections?

:warning: Download and Extract Reads in FASTQ format from NCBI SRA (legacy versions)

Collection folder shape: list

:hammer_and_wrench:* This is an update! Current versions of both tools will work the same at the UseGalaxy servers. Other servers may still host the legacy version.*

List collections are a flat listing of files.

  • Each sample will be written into a single element of a list collection.
  • If the data is single end, each sample will have one fastq file of the single reads.
    • One file of R1 reads per sample.
  • If the data is paired end, each sample will be a fastq file containing both reads together.
    • One file with both the R1 and R2 reads per sample.
    • This is the “interleaved” type of fastq file – many tools will expect that these will be split out into the forward and reverse reads. The tools from the Seqtk tool group are one way to do this.

So, that is a lot of information! But maybe helps to explain why data in “a collection” might not be enough when choosing how to send data to tools. The datatype matters, but also the shape of the data. You are never stuck in any particular shape! See Collection Operation tools for more.

I see you are using the first. The collection folder shape list is accepted by FastQC. You should be able to connect this to tool. However, because the downstream tools are connected first and this can lead to some workflow metadata conflicts. If you hover over the connection, you will get a message about the “shape” of the collection not being a match for what is expected.

:robot: If there is more than one sample in your list, remember to include Collection Operations → Flatten Collection to combine the sample name with the read end label into a unique label (element identifiers) → sample_forward (single end) or sample_forward and sample_reverse (paired end). This will give each fastq dataset’s report a unique label when these are later combined with a tool like MultiQC.

Try this:

  • disconnect all the workflow connection noodles
  • then, connect the tools again in the order of execution
  • for your use case, you will be connecting just one of the outputs from the Download and Extract Reads – the single end list connection
  • once connected, this informs each downstream tool that a list collection of data is passing through
  • you might need to adjust some of the input parameters on the other tools, too
  • this is an excellent tip to remind students about! If the noodle don’t connect, reset the workflow metadata by disconnecting all, then connecting from the start. I usually say that getting the inputs on the canvas as the first step is a good idea!

Please give that a try and let us know if it helps or not! You are also welcome to share back your workflow, or a failed workflow invocation, and we can troubleshoot this more if simply reconnecting is not enough. Maybe one of the downstream tools need an input choice adjusted. Thanks! :rocket:

BONUS 1

For an example of a similar QA workflow that can handle a paired collection without running into duplicated sample identifiers with FastQC/MultiQC, please see this post. The example workflow could be adjusted to work with single end reads too! This one also has some simple subworkflows that might be good for followup for intermediate students.

Then, this is one discussion from last week where we went into more detail about why this organization works! → Issues with receiving results from FastQC in MultiQC from several collections of samples - #5 by jennaj

BONUS 2

One tip I might suggest is to add in another input! This could isolate the accession input to a single field on the workflow runtime form (at the top!), or a separate text file with a list of accessions (selected from the history). This could be annotated with a usage warning “Single end samples only!” but stated a bit nicer! :scientist:


.

Thanks for your quick reply @jennaj!

So I understand that I have a flat list in the Single-end data, however, it doesn’t look like FastQC accepts that as an input data type:

How can I get FastQC to work off of that list? Sorry if this is really obvious and I’m just missing it!

Thanks,

Emily

1 Like

Glad that helped!

For the input area on a tool form, there is an icon for a single dataset, multiple datasets, or a collection.

The default is a single dataset. Click on the collection folder to inform the tool to look for a collection in your history.

If you hover over those icons you see some tool tips! Then, if you still can’t find a dataset, toggle open the accepted formats. The datatype formats the tool is scanning for are listed.

Then, when the shape of the data and the format of the data between your active history and what the tool is expecting are a “match”, your dataset will show up in the listing of potential inputs. This is a bit of automatic QC built into Galaxy to prevent the wrong kind of data from being chosen but can be tricky the first few times!

Please give that a try! Based on your screenshot, I would expect the data in collection 2 to show up.

More questions are welcome – I’d like to get this working for you! :slight_smile:

Hi @erdavenport

I had to make an update to my original post but it shouldn’t impact what you are doing!

Why? We are in the middle of a release cycle and updates to the way collections are processed has been streamlined. This will not impact what you are doing with single end reads! But I wanted to make sure I put the correct information into this topic for anyone reading it later on.

Ah, ok! Yes, if I select that and reprocess samples, the workflow works as expected.

I was also able to go back and adjust the old workflow so that it will work with the lists. Since we already did a bunch of the processing for one sample in class last week, this is what we’ll go with moving forward.

Thanks so much for all your rapid help! It’s very much appreciated as I prepped for class tomorrow!

1 Like