How to arrange samples for Diffbind analysis

Hello all, I am new to ChIPseq and I’m having problems with Diffbind analysis. My control (WT) and treated (3xTG) samples was not assigned properly for differential analysis. Under the replicate column, Diffbind calls WT and 3xTG as replicate 1,2,3,4 instead of replicate 1,2 for both conditions (as shown in the attached picture 1). In addition, under the caller column, it shows as raw instead of bed. I have included narrowpeak bed file called from MACs as well as the BAM files (shown in picture 2).


I would really appreciate if anyone could show me how to arrange my samples so that the output would be WT (replicate 1 and 2) and 3xTG (replicate 1 and 2). Thank you!

Welcome, @Zhen_Kai_Ngian

The column set for the score column is not appropriate for files from MACS2. Expand your dataset to find the correct column, probably c5, unless you manipulated the files?

The Galaxy version of the tool requires sample replicates.

Quote from the tool form down in the help section, sometimes missed.

Note this DiffBind tool requires a minimum of four samples (two groups with two replicates each).

If you want to do “data exploration” with only two samples, that might be possible with the command-line version of this tool. But check before you invest time in that – some of the other tools by the same authors had changes that require the use of replicates now, too, at the source (so impact command-line versions). All Bioconductor tools wrapped for Galaxy have always required replicates. Three replicates per condition is the minimum ideal for scientific reasons but two is enough to satisfy the technical requirements. Bioconductor Forum

All of this is due to how the underlying tool is creating data structures in R. Your screenshot of the input design from the Dataset Details view explains how the data is logically input to this tool, and is correct. The issue is with missing replicates – in this view that would show up as two datasets in each section, and is the same as how you would input the data on the tool form.

The order per group matters for the peak-bam pairings but don’t worry about that yet. If you get a job failure once inputting replicates, you can read about this in the help section and reorder the datasets in the history (copy dataset > new history, done in the order that you want the tool to use them). Or, bundle and process the samples within collections, and this “order” will work out automatically and not be as tedious. The good news: this is only tool with that “dataset order in the history” wrinkle and it is because of the pairing of the inputs. There wasn’t another good way to model the data back when the wrapper was written. You job didn’t have replicates yet so the pairings were fine.

Resources

Hi, thank you so much for your time and assistance! I have followed the instructions from Diffbind and directly use the Narrowpeak bed file generated by MACS2 as my input file. I have checked the bed file and used column 8, which has the score (-log10pvalue), instead of column 5 which is the integer score.

I actually have 2 replicates of ChIPseq data for WT and 3xTG (2 peak and 2 bam files per condition), which I have grouped them as shown in picture 2 (2 samples per group). I am just wondering why the output by Diffbind considers WT and 3xTG as replicate 1,2,3,4 instead of 1,2 and 1,2. However, looking at the intervals column, it correctly reflect the number of peaks in each replicate (picture 1). Hence I am not sure if it is just a bug for the replicate (1,2,3,4) and caller (raw) columns. I have attached a screenshot of a Diffbind tutorial output, where they are able to segregate replicates of different groups, instead of calling all samples as replicate from 1-11 (example picture).

As such, I’m worried I may have done something wrong that might affect downstream analysis. I have also tried arranging the individual files in order instead of grouping them but still got the same results. Any suggestions to solve this issue would be greatly appreciated and thank you for your help once again.

Hi @Zhen_Kai_Ngian

Would you please post back the Tool Parameters view for the new job? That shows how Galaxy consumed the data.

Hi @jennaj

So sorry for the late response. I have attached the tool parameters for the job I have described above. Please let me know if you need further information, thank you.


Thanks @Zhen_Kai_Ngian

All of that looks correct. Is there something odd about the outputs?

Hi @jennaj

I think the output should be correct as I’ve used a tutorial sample data and got similar results. The only discrepancy is just the ‘replicate’ and ‘caller’ columns, where Galaxy output still does not index “responsive” and “resistant” group properly. I will try to run the same analysis on Rstudio to check if I get the same results as Galaxy. Thank you very much for your time and assistance once again!

1 Like