FastQC reports from list of pairs PE Illumina reads all have sample name "forward" or "reverse"

Hi,

I noticed that upon running FastQC on my list of pairs containing PE Illumina reads, all the generated reports have the filename of the input reads set to either ‘forward’ or ‘reverse’, depending which set of reads it corresponds to. This isn’t too big of an issue since the output of FastQC is another list of pairs itself, so I can easily figure out which ‘forward’ and ‘reverse’ it is referring to, however when I take the FastQC rawdata outputs and use them for MultiQC, the tool cannot summarize all my data as the FastQC files all refer to the same ‘filename’ even though they’re all from different SRAs.

I used the fasterqdump tool from the sra-tools to import many SRAs and that’s how the list of pairs of my reads was created.

Would appreciate any advice on how I can work around this, ideally so that FastQC actually picks up the correct filename for each read!

Cheers.

Hi,
Can you use variables to auto-rename the outputs (like this) ?

Hi @David,

Thanks for your reply. I don’t think this is necessarily what I’m looking for per se, since within the FastQC file itself, it references its input file ‘forward.gz’ or ‘reverse.gz’. Rather, I would ideally like to have it refer to the proper filename ‘SRR000000-forward.fastq.gz’ for example. That is because the contents of the FastQC file itself is being parsed by downstream tools; not so much the name of the FastQC output files. I’ve attached a screenshot of the FastQC output for reference.

The FastQC command which is run by Galaxy looks like this, notice how in the first symlink step it names the symlink reverse.gz:

ln -s '/REDACTED/galaxy-21.01/database/files/030/dataset_30560.dat' 'reverse.gz' &&
mkdir -p '/REDACTED/galaxy-21.01/database/jobs_directory/007/7581/working/dataset_30618_files' &&
fastqc --outdir '/REDACTED/galaxy-21.01/database/jobs_directory/007/7581/working/dataset_30618_files'  --adapters '/REDACTED/galaxy-21.01/database/files/030/dataset_30591.dat'   --quiet --extract  --kmers 7 -f 'fastq' 'reverse.gz'  &&
cp '/REDACTED/galaxy-21.01/database/jobs_directory/007/7581/working/dataset_30618_files'/*/fastqc_data.txt output.txt &&
cp '/REDACTED/galaxy-21.01/database/jobs_directory/007/7581/working/dataset_30618_files'/*\.html output.html

I looked into the FastQC toolshed XML file some more and this is how the code starts off:

set input_name = re.sub('[^\w\-\s]', '_', str($input_file.element_identifier))
set input_file_sl = $input_name + '.gz'
ln -s '${input_file}' '${input_file_sl}' && 
.....

input_file is the corresponding read file I select from my history. So essentially it seems like for my original sequencing read files, the input_file.element_identifier is returning either ‘forward’ or ‘reverse’ rather than a more useful name.

Thanks for the detailed reply.
My guess would play with the set input_name as you can see is operating changes on the file name string.

Or even rename the inputs to something like “‘forward-SRR000000.fastq.gz”.

I did some more digging, and the issue only arises when using list of pairs PE reads. If I just select one individual read file, the value for $input_file.element_identifier is indeed its proper name SRR0000-forward.fastq.gz, while if I have a collection, the element_identifier for all reads is either just forward or reverse. This seems to be an issue (feature?) inherent to Galaxy itself.

Not quite sure how to move on from here, other than maybe running FastQC individually on each reads file in a list, rather than on the list of pairs itself.

EDIT: flattening the list seems to solve the issue! easy enough workaround!

1 Like

Nice :+1:.
I’d also suggest you to take a look at this issue and maybe build dataset lists rather than pairs to see the difference.