I have a question about an output file I have created using the programme edgeR. I am somewhat familiar with genomics processing, but am trying to work with some tools I have never used before. My analysis is on an RNA sequencing dataset with three caategories (day 1, day 3, day 6), with 3 replicates each, so 9 samples in total. I have run every step until the differentially expressed genes analysis, which I have set up both in edgeR and in DEseq2 to compare the two.
I have questions about the output of both, but for now I would like to ask just one thing.
In edgeR, my output, (edgeR report in the history tab) shows me an MDS plot, shown below. I have a vague idea of what an MDS plot is, but mostly because I have had PCA covered in a statistics course, whereas MDS is actually PCoA.
I can see that the day 1 samples (in green) group nicely. However, for the day 3 and day 6 samples, only 2 of of 3 group separately whereas for both groups 1 sample (day3 R3 and day6 R3 in figure) do not group as well.
I am not 100% certain if I performed my analysis correctly, but assuming I did, I would like to ask for some advice on how to continue with my data. Would it invalidate my data if I removed the R3 replicates of day 3 and day 6 that group separate from the pairs of other replicates? Or would it be fine to do this?
Not really just a Galaxy question of course: but, please do not filter your data based on an MDS plot - you actually have pretty good separation in 2 axes and you would introduce some bias - probably away from the Null which is the very worst kind.
Hey Fubar, thanks for answering. I know it’s more of a statistics question, but I am also not sure what other tools in Galaxy I could use to make sure my samples are really independent. Would you have a suggestion for a different tool that could help me assure this?
I already did the analysis in DEseq2, as mentioned in the post, but this gave a roughly similar view of the data, with replicate 3 for day 3 and 6 not matching with the other replicates.
Those clusters are separated enough in 2 axes to suggest something’s different and real.
How extreme is the treatment? Really disrupting the biology and homeostasis produces more extreme variation in the treated class than a more mild perturbation will - biological systems become more unstable in more extremely disturbed situations.
FWIW which may be little, here’s my personal and rather hard line view:
Failed technical QC is the only reason to drop a sample IMHO - it weakens your statistical power to detect a true class difference but doesn’t introduce any bias since it’s not based on the measurements. Dropping a sample based on some extreme observed characteristic will break the fundamental statistical assumption of random sampling - and that’s a big deal. The replicates provide an estimate of the variance of the true biological variability within each class. Downstream statistical methods assume random samples from each class. If you deliberately reduce the observed variance within a class’s replicates, you introduce a bias - meaning unreliable statistical inference.
Hello again Fubar, thanks for your in-depth answer!
The treatment is nothing more than the labels say: a difference in age between the animals (nematodes). We are interested in age-dependent differences in gene expression. The nematodes pass through their entire reproductive life in those first 6 days, so we are expecting differences for sure but nothing I would call extreme.
I appreciate the honest take on data exclusion, because I am of a similar mindset. Often enough do I hear colleagues talking about removing points from their data to make it fit better, and I’m always skeptical about if that wouldn’t just be cherrypicking.
I don’t want to do that. What I do want to do is to make sure I am not interpreting an artifact introduced by mistake as valid data. What if I switched two tubes during sample prep? I suppose in that case, if there would be very strong grouping it would be more obvious than what I am seeing here.
It would be more likely that I just have some replicates that happen to be close to eachother on the Eigenfield. Perhaps, had I sequenced a hundred replicates for each set (hyperbole of course) I would see clouds of points, barely touching but separate enough. Who knows? But if they would follow the same distribution as the current points, they would still be differentiable different groups.
I think I am satisfied with my datapoints and will just keep them. We don’t know what more datapoints could potentially reveal, but for now I’ll trust the data I have.