DGE counts of zero on Limma voom

I am getting unregulated and down regulated counts of 0 in DGE analyses in LimmaVoom.
I have tried adjusting several settings in Limma (CPM, TREAT parameters) but still only get Flat gene count numbers back and zero for UP and DOWN regulated genes.

Any insights would be greatly appreciated. I can make my history available.

Regards,
Swati

Hi @seq2022
How many replicates do you have? Have you checked the MDS (PCA) plot? Do you see separation of the samples on 1st component?
If you share the history, I can have a look.
Kind regards,
Igor

Hello,
Here is the link to the history:
Galaxy | Europe | Published History | COVID Paper RNA-SEQ
I have 21 samples - 6 pediatric 15 adult

I do see separation and the volcano plot shows the genes with FC in expression and the DE tables.
But the adjusted p value in the table for all genes is >0.05, which I am assuming is why they are not being counted as up or down regulated?

Any insights will be much appreciated.
Swati

Hi @seq2022
You’ve done very good work with data analysis, with many QC steps. This is very useful.
By default, the volcano plot shows top 10 genes, and, as you said, the adjusted P-values are greater than 0.05. Maybe try more aggressive filtering on low counts: Density Plot: filtered counts has a significant peak at start. You used 2 samples as a cut-off. Maybe increase this number. You’ll get less genes, hence smaller correction. Based on featureCounts reports you have many off-target reads, and several samples have many unmapped reads. Check how these samples behave on MDS plots. There is no strong separation on MDS plots. For example, one ‘black’ sample cluster with red samples, and one red sample cluster with the black ones. Check the factor file to make sure it has correct. Check alignments using IGV or UCSC Genome Browser. labels. I looked at one, and it seems it has a lot of alignments started on the same position. Some of these ‘read blocks’ have identical differences from the reference genome that are not present among common variants from dbSNP135. This is an indication of PCR duplicates abundance. Check nucleotide composition in FastQC report. You’ll see fluctuation, which is unusual for sets containing 10s M reads. It might be the data…
You can unshare the history unless you have other questions on the matter.

Kind regards,
Igor

interesting information

Hi Igor,

Thanks so much for the ideas. I did try to filter with higher CPM cutoffs and even with changing the sample number to 6 and 21 and still got counts of zero for upend down regulated. If I understand correctly this would be more aggressive filtering? Are there any other parameters that would be good to try?

I did check the factor file and the labeling is correct. It’s not too surprising that there is overlap between the black and red groups since they are both COVID infected groups. The part that I can not explain is why I have not been able to reproduce the up and down regulated genes the authors reported in their paper. Could you take a look at their methods and see if there is any step that could help improve my results. They do mention they maped to GENCODE ref file and used Salmon.

I had used the hg38 refseq.bed from UCSC and mapped with HISAT2. But the mapping itself didn’t appear to be a problem with >95% mapping on QC. The feature counts showed low assignment. Would using a different reference file help improve this?

This is the paragraph from their methods:

RNA-Seq.

RNA was isolated from cryopreserved cells using the miRNeasy Micro kit (QIAGEN, 217084). Samples with sufficient RNA quantity and quality were used for analysis. Libraries were prepared at the Yale Center for Genome Analysis using the NEBNext rRNA Depletion Kit (New England Biolabs E6310L). Individual samples or pools from 2 patients of similar age and outcome (n = 4 pools, adults) were normalized to 1.2 nM and loaded on an Illumina NovaSeq S4 flow cell to generate 30 M read-pairs per sample. Samples were checked for read quality and adapter contamination using FastQC and aligned to transcripts using the GENCODE transcript sequences (version 33) as the reference file with Salmon (30). All analyses in R were performed using R version 4.0.3. Transcripts were mapped to genes using tximport. Differential gene expression analysis was performed with DESeq2 (31). Heatmaps were generated using the pheatmap package. PCA was performed with the prcomp function using all genes with a nonzero total read count. Prior to PCA, data were transformed with the vst function in DESeq2. PCA results were visualized with the factoextra package. For GSEA, Hallmark (h) and GO (c5.go) data sets were downloaded from MSigDB (Broad Institute) and analysis performed in R with the fgsea package using 1000 permutations.

This is what they reported for results

** Overall, we identified 538 differentially expressed genes (P < 0.05 after FDR correction), including 267 that were upregulated and 271 that were downregulated when comparing pediatric and adult patient**

Thank you again for taking a look at this. I would appreciate any other suggestions you may have.

Swati

Just an update. I tried filtering all 21 samples and with LogFC of zero. And with CPM of 0.5 or CPM of 2. Still no change in the output, all genes come up as Flat.

Hi Swati,

Increase in number of samples and increase of CPM in low count filtering should remove more genes. Yes, this is more aggressive filtering.

Not all samples have >95% mapping rate. One sample has 17% mapped reads, another 39%.

Different gene annotations give different results including different number of reads assigned to genes. The option you used, built-in featureCounts gene model, is a safe choice. Examine alignments on UCSC Genome Browser and check off-target reads. Just in case: click on collection of BAM alignments > click on any BAM file in the collection > click on bar plot (Visuallize) icon > click at Disply at UCSC (main) link in the middle window. Remove tracks you don’t need for faster performance. Add tracks like SNPs - see my previous reply. Make sure you have gene annotations. You can add several alignments to UCSC session. Check the reads. usually we expect more or less random distribution along the gene (exons). Usually the coverage is not uniform, but the reads usually do not map in blocks,

Maybe check DE genes from the paper on interactive limma plots. Look at expression level. Many top genes highlighted in Volcano plot in the fie I looked at have low expression, plus the group with big number of samples have wide distribution for genes: no expression in some samples, while other samples have gene expression in range similar to the second group.

Are you sure the libraries are not stranded? Check nucleotide distributioni in FastQC outputs for sample ending at 63. The proportion of complementary nucleotides is not identical (C and T are over-represented in F reads, while under represented in R). This is an indication of a stranded protocol.

Galaxy Europe has Salmon and tximport, so you can import GENCODE transcripts and reproduce the published protocol. Alternatively, download GENCODE gene annotations in gtf format and use it instead of the built-in featureCounts gene model.

As I said previously, I have some doubts about the data quality. Not sure if this can be compensated by changing limma settings. Maybe do more QC, like gene coverage.

The individual steps are described in corresponding sections. The tutorial also describes how to check the strandness.

The method section mentions similar age and outcome, but does not mention sex. Do you know if both male and female samples were used? It might be mentioned somewhere in the paper.

Kind regards,

Igor

Hi Igor,

Thanks so much for the suggestions.

I had looked at strandedness - this is an antisense library as inferred from the Infer Seq steps in QC.

I used unstranded as the setting for both HISAT and Feature Count steps before performing the InferSeq. I repeated it with changing the setting to antisense stranded for both HISAT and featureCounts but that did not work at all for some reason (got FeatureCounts of zero when I changed these to antisense) so all the current analysis have been done using unstranded settings for both HISAT and FeatureCounts.

Let me try to use Salmon and tximport and GENCODE and see if that will work to reproduce their results.

Many thanks,
Swati