Comparison of RNA-seq data with a published paper.

Hi Everyone in the community.

Once again want to highlight Galaxy’s community efforts. I am very new to this field, almost useless I would say. I want to learn RNA-seq, and in this respect the GalaxyHelp site come in super handy. A big thank you.

My problem:
I downloaded raw RNA-seq data from the NCBI. 5 treated and 5 control samples from a mouse study, already published. After reading the web site and following the tutorial on Reference-based RNA-Seq data analysis my results did not match with the results of the paper.

Papers methods: NovaSeq6000, Bcl to FASTQ with CASAVA software suite, RNA sequencing results with Kallisto with mouse genome release-96, counts and TPM mapped to gene using mapping file retrieved from Ensembl BioMart website filtering genes and transcripts. Then DESeq2, normalised to counts and keep FDR<5%.

My steps for the analysis were the following:

(1) Loaded a SSR list of control and treated samples as separate collections through the Upload.

(2) Got the data for each collection through the Faster Download and Extract Reads in FASTQ. Each collection contained 5 samples as a set of 5 paired-end data.

(3) Checked the quality of data through FastQC on each collection of data. The data indicated fragments of 75.
Some stats: Sequences flagged as poor quality, 0; Sequence length, 20-76; %GC, 46.

(4) Performed Cutadapt on both collections of data following the aforementioned tutorial.
“Paired Collection”: 2 PE fastqs
“Filter Options” - “Minimum length (R1)”: 20
“Read Modification Options” - “Quality cutoff”: 20
“Outputs selector” - Select: Report: Cutadapt’s per-adapter statistics. You can use this file with MultiQC.

(5) Performed MultiQC again on the reports from the results from (4).

(6) Download the mouse genome file from the gencode, the GTF, main annotation file. Then, I removed the headers (lines that start with a “#”) by just opening the file as TXT. Saved and uploaded to Galaxy as GTF file. File gencode.vM30.annotation.gtf Description of file: mouse genome (GRCm39), version M30 (Ensembl 107).

(7) Performed mapping with RNA STAR according to the tutorial.
“Single-end or paired-end reads”: Paired-end (as collection).
“RNA-Seq FASTQ/FASTA paired reads”: the Cutadapt on collection N: Reads (output of Cutadapt tool).
“Custom or built-in reference genome”: Use a built-in index.
“Reference genome with or without an annotation”: Use genome reference without builtin gene-model.
“Select reference genome”: Mouse (Mus Musculus) mm39 Full.
“Gene model (gff3,gtf) file for splice junctions”: The imported mouse genome from step (6) in GTF.
“Length of the genomic sequence around annotated junctions”: 74 (from step 3).
“Per gene/transcript output”: Per gene read counts (GeneCounts).

(8) Perfoming MultiQC again on the above from step (7) I noticed that my data might be stranded. Therefore I performed an Infer Experiment. After converting the GTF to BED12 the results I got where as follows:

This is PairEnd Data
Fraction of reads failed to determine: 0.0752
Fraction of reads explained by “1++,1–,2±,2-+”: 0.0273
Fraction of reads explained by “1±,1-+,2++,2–”: 0.8975

The above indicated that my data were reverse-stranded.

(9) Performed FeatureCounts with the following parameters to count the number of reads per gene.
“Alignment file”: RNA STAR on collection N: mapped.bam (output of RNA STAR tool).
“Specify strand information”: Reverse Stranded.
“Gene annotation file”: in your history.
“Gene annotation file”*: mouse genome (GRCm39) from step (6).
“Output format”: Gene-ID “\t” read-count (MultiQC/DESeq2/edgeR/limma-voom compatible).
“Create gene-length file”: Yes.
“Count fragments instead of reads”: Enabled; fragments (or templates) will be counted instead of reads.
“Minimum mapping quality per read”: 10.
GFF feature type filter”: exon.
“GFF gene identifier”: gene_id.
“Allow reads to map to multiple features”: Disabled; reads that align to multiple features or overlapping features are excluded.

(10) performed MultiQC on the results from step (9) which showed that >53-55% were assigned, while ~35-38% were unassigned multi mapping. All the other percentages were very small. I noticed the unassigned multi mapping, but I continued as normal on the next steps.

(11) Then I performed a DESeq2 on treated collection vs untreated collection and Annotated DESeq2/DEXSeq adj-p<0.05 on the results.

Have I done something wrong in my analysis or it is normal that using different packages you get different results? The most abundant gene detected from the paper was found in my RNA-seq analysis, but with a very big difference; papers Log2FC >18, mine 0.39. the majority of the genes were there, but with a difference in the fold change.

Could anyone help me please?
Many thanks.

1 Like

Methods/data to consider adjusting to get a closer match to the original:

  • Kalisto is available in Galaxy.
  • “Ensembl mouse genome release-96” is based on the mm10/GRCm38 mouse reference assembly/annotation.
  • The mm10 genome is indexed for Galaxy tools, and paired annotation based on Ensembl is hosted at UCSC along with the associated transcripts. Get the GTF from the downloads area and use their Table browser to extract transcripts as only refMrna.fa.gz is already output in a prepared file (one level up from that GTF directory).

Another variation on what you already have: The reads could be mapped as stranded during your step 7 (RNA-Star). The tutorial didn’t break this down (the toy data was unstranded for simplicity) – but now you know. That might reduce read/count loss by getting more primary/unique read alignments preserved and mapped to the same strand as the annotation.

Hope that helps! Others reading, feel free to offer more suggestions