Separate/split replicates in single SRA file

Hello,

I’m new to galaxy and working on an RNAseq experiment where I have 2 conditions with 3 replicates for each one. The NCBI SRA accession number is SRP154796.

To perform DGE analysis on DESeq2 or edgeR, I need to have a separate count file for each replicate (that is what I understood after reading a few posts here). However, I don’t have access to the original fastq files on NCBI and can only download a single SRA file where all the replicates are merged.

I’ve used a tool on galaxy to convert my SRA file to fastq format but I obtain a single fastq file with interleaved reads. When I download the files from ENA, the reads are separated in forward/reverse but the replicates are still merged :
https://www.ebi.ac.uk/ena/data/view/SRX4415824

I couldn’t find a way to separate the replicates, hence I only have 2 count files for DESeq/edgeR and constantly get an error.

Does anyone know how I can get the original fastq files for all the replicates? Or if the problem comes from the way I’m using DESeq2/edgeR?

Thank you in advance.

1 Like

https://www.ebi.ac.uk/ena/data/view/PRJNA482166

https://www.ebi.ac.uk/ena/data/view/PRJNA482166&portal=read_run

From a review, it appears that only the biological reads were published or the formatting of the EBI SRA file is problematic. There are no “original submission” fastq data for either paired-end sample. That is usually an indication that the original data was published somewhere else but I didn’t find the data at NCBI’s SRA. Parsing the EBI-sourced SRA with NCBI’s SRA toolkit failed – but you could also explore the data that way (line-command – won’t work in Galaxy due to format issues in the SRA file itself).

BUT – none of that will help with your analysis. Technical replicates are not appropriate for differential expression analysis – they are used to evaluate the quality of different sequencing runs based on the same biological sample. These tools require at least two conditions with at least two biological replicates each for valid expression analysis. Biological replicates are published (or rather, should be published) as distinct runs – and this data appears to only have one paired-end run (one biological sample) per condition.

FAQs related to fastq data are near the top and DE tools are covered in the last one in this Support FAQ group: https://galaxyproject.org/support/#getting-inputs-right

Thanks!

Thank you very much for your reply and this detailed explanation!! I was really going crazy because I couldn’t understand. Have a wonderful day and thank you again!

1 Like