NCBI SRA Fastq (convert SRA files from GEO into fastq files)

Hello,

I want to convert SRA files from GEO into fastq files in order to map them on the genome. I have uploaded SRA files directly on Galaxy but it does not seem to be the right thing to do.

Now, I’m going through the instructions on the support page of Galaxy: NCBI SRA Fastq

It asks to manipulate the file before importing to Galaxy but I cannot see mentioned tools on the NCBI SRA Run Selector page (highlighted above). Is there a more detailed explanation or tutorial to follow because the link I shared seems to be the thing I need but it is not very clear to me how to proceed.

Thank you.

1 Like

@ysrbrs,

First you isolate the IDs:

Organizing metadata

The “RunInfo Table” provides the experimental condition and replicate structure of all of the samples. Prior to importing the data, we need to parse this file into individual files that contain the sample IDs of the replicates in each condition. This can be achieved by using a combination of the ‘group’, ‘compare two datasets’, ‘filter’, and ‘cut’ tools to end up with single column lists of sample IDs (SRRxxxxx) corresponding to each condition.

Then you import IDs with Download and Extract Reads in FASTA/Q; which corresponds to NCBI SRA Tools (fastq-dump):

What it does?

This tool extracts data (in fastq format) from the Short Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). It is based on the fastq-dump utility of the SRA Toolkit.

1 Like

Thank you for your reply!

I have done what’s recommended but now I have 18 individual fastqsanger.gz files but I want to group them by condition, replica. I have 6 different conditions with 3 replicates each.

Any tutorial advice on how to do it? It would be more organized this way before going with trimming and alignment etc. In addition, I have to do the same analysis with a much bigger dataset so I would be happy to learn it.

https://galaxyproject.org/tutorials/collections/

This tutorial has not been helpful because my fastqsanger.gz files aren’t directly on the history pane but inside a folder “Single-end data (fasterq-dump)”. That is why I cannot see check box on upper right as shown on the link.

@ysrbrs, you understand that the datasets are organized by collections?
Can you please paste a screenshot so we can see how your history looks like?

1 Like

Hi @ysrbrs,
probably in that case, the faster way would be to upload the datasets in groups, according to your replicates/conditions.

Regards

3 Likes

Hi @ysrbrs

That FAQ is intended to help resolve issues around format variations – after the data is in Galaxy. The collections GTN tutorial is an overview of the ways to manipulate data in collections.

@David is correct about the output of the collection from this tool, and the advice from @gallardoalba is a good simple way to get your data organized at the start.

Should you want to explore advanced methods to download data in a structured format from the very start, see the GTN tutorials here under the section Uploading Data – the two about the “Rule Based Uploader” are what you will be doing, and the one above it provides context and the one below it demonstrates usage in the context of a practical example. Many other tutorials cover the Upload/Collection functions – use the search at that site to find them. Even it is not exactly your use case, can still be helpful.

3 Likes

Dear all, @David @gallardoalba

Thank you for your responses. I somehow did not see these messages. Sorry for the late reply.

I figured out that, in Galaxy, it’s easier to merge multiple files into a data collection, than splitting a collection into multiple files.

So, I uploaded multiple Acc_List.txt in groups depending on the condition. Each contained only three (replicates) SRA Run accessions for each sample.

2 Likes