Faster Download and Extract Reads in FASTQ and ENA reads are slightly different

Hi all,

I just noticed there is a slight difference between FASTQ files downloaded using the ‘ Faster Download and Extract Reads in FASTQ’ tool and the one published on ENA for the same accession, does anybody know why this is happening?

Cheers,
Rand

Welcome, @rand.zoabi

Are you referring to the sequence identifiers, or the bases?

Both can be a bit different, and this is why

  1. Raw reads are submitted to the SRA
  2. These are then “processed” by SRA
  3. The processed reads are what the fastq fetching tools retrieve

Working with those processed reads is what most people will want to use. The exceptions include very limited use cases, example: a scientist wants to apply their own custom “processing”.

So – what is that processing?

  • Details are here NLM GenBank and SRA Data Processing
  • The practical parts are the standardization of the read names (to accessions) and adjustment of the original quality score scaling to Sanger Phred +33 scores, which we label as fastqsanger in Galaxy (this is mostly legacy and not needed now).
    • Why “legacy”? The current Illumina pipelines generate the “standard” fastqsanger aka Sanger Phred +33 scores by default. No re-scaling of the scores is needed as a preprocessing step (anywhere: by SRA, by scientist working in Galaxy with their own data, by scientists working other places…).
    • Most current tools expect fastqsanger scaling, and while some older tools might accept others … you’ll probably have another downstream tool that doesn’t, so it is much easier to just start off with the standard format at the start.
    • So, fetching data from the SRA retrieves the “processed” reads with fastqsanger quality score scaling.
      • That means even if an accession represents older reads that had their base calls performed using a different “quality score scaling scheme”, the data has been pre-normalized (or “groomed”) to be the standard.
    • The sequence names are also standardized to the accession label.
      • Why? The primary reason I can think of is that it easier to work with the reads and keep track of them when labeled this way.
      • But there is another reason for some older read types: the data was inconsistently labeled on the + line, leading to tool usage problems, so it was dropped.

How to explore this more?

These are some legacy Galaxy FAQs addressing the different processing people used to do to fix-up fastq files. Any of these about “grooming” quality scores, or adjusting read names in the @ or + lines, or adjusting the datatype are the ones I talking about.

  • If interested, find them in the GTN FAQs here and some others are still at the Hub – in a hidden area that hasn’t been perfectly reformatted! –.
  • The methods are still valid, just not usually needed. Compare the statistics between the “raw” versus “processed” fastq files, using those FAQs as a guide for what to look for, and you’ll likely be able to explain what you are noticing.
  • I don’t think you can get the raw reads using the SRA fetching tools, or at least not in the options that Galaxy supports. And, beware using the raw sequences directly with tools, unless this is for a specific purpose and you are willing to troubleshoot strange errors and confirm results. Why? The wrong quality score scaling might pass through as putatively successful “green” but spurious result. Meaning, a tool didn’t technically fail, but didn’t interpret the data scientifically correctly either.

With that background, your observation about SRA and ENA then has a few dimensions.

The Faster tool retrieves data from the NCBI hosted SRA, specifically the processed reads. See the Help directly on the Galaxy tool form for which exact utility that is, how it works, and the link to more details. Browsing the NCBI SRA website directly will find the raw reads there, too.

And you are getting reads from the ENA hosted SRA directly, through some link. You can look for the raw and processed here too, but I’m not certain if they will host both or not for all accessions – but I’ve seen it before for some. So, maybe ask them if it is not clear where to look for two read versions for the accession you are interested in, now that you understand these exist and what to look for. If you find a link to some guide or FAQ at their site that explains this sort of navigation, please link it back so we can all learn how it works better.

Hope this helps! :slight_smile:

Xref