MACS2 scientific result troubleshooting

Hi I have a similar problem. When I am adding a control Bam file to my sequence I am getting 0 narrow peaks and 0 bed. When I do not have a control I am getting the results. Could you please help me?

1 Like

Hi, @Manolis1
Did you solve your problem? We have the problem too in our CUT&RUN analysis. We have 4 histone modification antibodies, and one IgG negative control in CUT&RUN. 3 histone data showed very good results. But this is no peaks after MACS2 in one histone data, even though we used all the same parameter.
If you solve your problem, could you please kindly let us know?
Thank you very much for your help!
Best wishes!

2 Likes

Hi @gallardoalba
Did you solve this problem from @Manolis1 ?
We have the same problem now. Could you please show us how to solve it?
Thank you very much for your help!
Best wishes!

Hi @nourmahfel and @Heystone

I reviewed a shared history from @nourmahfel yesterday.

The immediate problem seemed to be that the mapped read results in the upstream BAM datasets were filtered with a very low mapq threshold (10). That can result in many multi-mapped reads remaining which can lead to no/low resulting significant peaks found in treatments when compared against the background. I suggested refiltering with a mapq of 20, then 30, and comparing the all results after to see what happens.

The GTN tutorials cover BAM filtering by mapQ values, see: Search Tutorials. Start with the first tutorial for a technical overview of where this value is located in BAM/SAM datasets. The others will cover practical items with context. This including other steps you might want to incorporate to clean up ChiP-seq data (eg “Mark duplicates”). That site can be searched with keywords or navigated by analysis domain.

This FAQ covers the most common reasons for technical issues in Galaxy (and bioinformatics analysis generally): Understanding input error messages (or odd putatively successful “green” results as not all tools fail due to technical problems).

This wasn’t the problem @nourmahfel was having (technically, the inputs were fine), but you should review it quickly @Heystone to eliminate anything simple to fix. Examples: what does FastQC report about the read quality and does that data need more QA or was the applied QA too aggressive? Is all the data mapped against the same exact reference genome (mismatches happen!)? All samples were mapped using the same tool? Is the alignment rate different between the samples, and if so, was that expected or explained by read quality differences?

If your data looks fine technically, then you are having a scientific content problem that might be remedied with changes to your methods. Or the data content itself might be the problem (no actual significant difference in the treatment versus the background). Consider examining regions in the BAM files in a genome browser (IGV, UCSC) where you expected peaks to be called.

Your own search here (directly, or use the tags I added) will find more advice at this forum. In short: eliminate technical issues first, then explore your data content and tune analysis methods if needed.

More ways to get help with this type of protocol

GTN Tutorials (query = “ChIP-seq”): Search Tutorials

MACS2 has a google forum – a search should find it. Other bioinformatics help sites will have more advice. MACS2 has all been around for some time.

And this is a good blog post that covers more about what mapQ values represent, plus a bit about how that varies across common alignment tools.

If you are still stuck after doing those checks, share back a link to your history and we can review.

Thanks!

Hi @Heystone

Unfortunately, my problem remains unsolved. I haven`t found yet a way to solve it.

In case I will find something, I will post it. Would you please do the same if you find a solution as well?

Greetings and wishes for a great day.

Hi Jennifer,

I have had Mapq of 30 before but changed it according to the bioinformatician’s recommendation at uni. I didn’t get any results with Mapq 30 either. I tried mapping it using BWA and bowtie and it did not work either. I have tried different parameters as well and different q-value thresholds. I have obtained the data from SRA and I am not sure what the problem could be with the data. Could you please help to identify the underlying issue with the data?

Thank you,
Kind Regards,
Nour

This comment makes me think that the person who reviewed the data thought it was too sparse to call peaks, so they suggested less aggressive filtering. That is certainly something to try, but all of this is a balance. If you tried something like filtering only the control at 30 (strict), and treatments at 10 (permissive), that may even produce results, but the scientific quality of the result would be questionable, and in a different way than not including a control at all (which already did produce questionable results for one of you).

If there is a suspected data content problem, these are the items to examine:

  1. Run FastQC on the reads before and after other QA is applied (trimming, etc) and review << if the data doesn’t appear to have sufficient quality it may not be usable
  2. Review the BAM mapping rates << this is likely where the problem first starts to impact the analysis, I’m guessing low coverage
  3. Review the BAM alignments in a genome browser (details in original post + consider adding in an annotation GTF – UCSC has several for each mouse build in Galaxy)
  4. Try different filtering methods: mapQ is not the only item, getting rid of optical duplicates is also important (also in original post, plus see protocol advice at other forums – one example Did you remove ChIP-seq duplicates)
  5. Use the most current version of the genome build for the species (mm10, not mm9)
  6. And, sometimes using a mapper like BLASTN+, with a subset of reads as a query against a target database that contains reads from many species (WGS) can give some information: are the reads really from the species you think they were originally from? Is there contamination of some type?

Other items like checking that the control and treatment samples are not mixed up is part of this.

Get the processed reads from archives like SRA, and not the submitted reads, unless you understand how to manipulate the submitted reads into the current Illumina format that most tools know how to interpret. The processed fastq reads from SRA are already in the right format (fastqsanger). You can test this by allowing Galaxy to guess the format and also by running FastQC after the data is loaded (the format is reported).

Not all public datasets are high quality, and that can impact the ability to use reads for analysis. Reads that you created yourself or obtained from a collaborator could also have quality/usability problems. This is a judgement call on your part, and the help above and in the original post are the ways to explore data for quality/performance with tools leading up to the problems with MACS2.

Small note: The usegalaxy.org server is undergoing maintenance. I wouldn’t expect the problem in the history I reviewed to be impacted – but this can double checked by running the same analysis at a different usegalaxy.* server: usegalaxy.eu or usegalaxy.org.au.

1 Like