Why the TSS reference point vs Scale region pattern very different?

Hi I would like to ask when i do compute matrix of my chipseq data in c elegans, I use bamcoverage to normalise to input and get the signal vs input (my bin size was set to 50).
Then I use the bigwig file generated from this bamcoverage to plot against the all genes gtf file with computematrix with the bin size as 10 for TSS reference point ±2000bp, I saw a very sharp peak at TSS with plotprofile.
Then I also use computematrix to plot scale region (–regionBodyLength =2000) -a 250 -b 250 then I tried bin size of 1, 5, 50 all gives me a plot a massive board peak behind TSS and the TSS peak is much less obvious.
It is so confusing whether my protein bind TSS sharply now, anyone could tell me what cause such issue?

Hi @emilyhokning

The scaling is probably distorting the data. The computeMatrix tool has a many fine-tuning options. Right under where you set the region, there are more options to fine tune the scaling/resolution details.

After toggling to “scale-regions”, and setting a value, there are two primary scaling dimensions under Set distance up- and downstream of the given regions (toggle to “yes” to reveal the defaults). The first set describes the region size up/down to consider for scaling, and the other sets another up/down region inside that to not scale.

Experimenting with those options is probably what you are looking for to better capture the TSS peak resolution without as much distortion. So, a hybrid instead of “all or nothing” for the scaling. Shrinking (scaling) all to 2000 bases, then 2500 bases with the extensions, then focusing on a condensed 500 bases up/down is likely just too dense to capture what is revealed with the reference-point plots (4000 bases).

If I misunderstood something, maybe share your example history or at least some screenshots? These graphs are highly dependent on the parameters, so for each plot, also include a screenshot of the parameters used (click on the “i” icon to find these and copy/paste or screenshot the content under the “input/parameters” section, for each). We are looking for the kind of information that the tool’s Gallery exposes in their examples (plot + parameters): https://deeptools.readthedocs.io/en/latest/content/example_gallery.html#

I hope this helps!

Dear Jenn
Thanks for your reply!
So I now have some hope that my protein still peaks at TSS after knowing scale region mode is distorting the regions.
attached is the image on right is the TSS reference point plot and on left is the scale region mode, pls look at the blue curve and ignore the green one.
I would like to ask does setting the scale as higher and the up/down regions as lower will shrink and distored the image less? I’ve tried multiple values for that but still a board peak from TSS to TES.

Also does it mean the upstream and downstream regions are not shrunk but the regions from TSS to TES is shrunk?

And for hybrid scaling, does it mean exclude 2000 bp from 3’ scaling so as to see the TSS signal better.

Hi @emilyhokning

The extra graphs help. If you are getting near exact results for TSS and TES regions, that can indicate a strand problem during data processing.

My guess is that strand was not included in one or more files, or it was not interpreted by a tool correctly (some input file format problem or similar). Address this first.

Dear Jenn,
This is the first row of my input gtf, it does have the strand information, it should be a gtf format but galaxy auto detect it as gff.

Seqname Source Feature Start End Score Strand Frame Group
chrV WormBase gene 8049703 8050911 . - . gene_id WBGene00000620; gene_name col-43; gene_source WormBase; gene_biotype protein_coding;

But when I change the format to gtf manually in galaxy, computematrix will fail the run due to an error “computematrix ValueError: invalid literal for int() with base 10: ‘WormBase’”
This “WormBase” should be the column 2 of the gtf which is the source and I dont know what cause that error. But replacing the file type as gff again on galaxy will eliminate this error.

And sorry for not understanding, do you mean the scale region mode may fail to detect the strands but the reference point option could detect the strands?

Thanks very much again!

Hi @emilyhokning

What you have right now is a sort of hybrid file – part GFF3, part GTF. Some tools could have trouble understanding it.

Would you like to share your history with this data? You can unshare after we are done.

Hi Jenn,
this is the link to my history:

Btw how could I tell GFF3 from GTF?
Thanks very much!

Hi @emilyhokning

What happens if you filter the annotation for just the feature (3rd column) that is labeled as “transcript”? You can manipulate which regions to plot – and TSS/TES are based on the transcript genomic footprints.

For the warning messages from ComputeMatrix, that probably means the annotation (GFF file) had genome coordinates that were not also represented in the bigWig file. How that file was created is not in this history. I would suggest that you investigate what is going on. Are you certain that all data was based on the same exact reference genome? https://training.galaxyproject.org/training-material/faqs/galaxy/datasets_chromosome_identifiers.html

Start by checking the GFF coordinates for one of the messages, and determine if that same region is in your bigWig (by reviewing the input files that created it).

Skipping transcript_r249, due to being absent in the computeMatrix output.

Dear Jenn,
I generated the file by aligning to cel11(ce235) C elegans genome(is locally cached) in Map with Bwa with default settings, then used samtools view to filter away regions that is unmapped and then samtools sort default settings, then I merge my chip-seq replicates and also merge my input replicates, then I mark duplicates (remove dup = true) for both the merged Chipseq and merged input. Then I used bamcompare: normalized using CPM, extend reads costume length=200 Actually i used a encode blacklist region here in bamcompare.

But then I tried to computeMatrix adding the blacklist region again and it is still skipping similar transcripts.
I would like to know what does it mean by transcript_r249? as I couldnt find it from the gtf, does it means row number 249.

I would like to ask if the ce11 in mapwithBWA locally cached could be the problem as the interested gene gtf im using for computematrix is also the version Ce11(cel235) downloaded from ensmbl and filtered for genes im interested in.
Thanks again for your help!

Sorry to add some information
Actually in another history I used the same file mapped with BWA with that locally cached ce11 genome samtools filter and sorted, used MACS2 to call peak, the error is empty, I further used ChIPseeker and this ce11 “gtf” or “gff3” im not sure now :sweat_smile: to do the peak annotation it didnt show others error except for chrMtdna which is not significant for this chipseq study. I wonder does my merge and remove duplicate, and bamcompare causes this problem. I will include the file input for bamCompare and do bamCompare in this history.

Thanks for your patient and kind help in troubleshooting my problem, I’m struggling for my thesis and this is one of the results I intended to display, if multiple regions are skipped in the output I dont know if I can draw any conclusion base on the plot :frowning:

Hi @emilyhokning

Did you confirm that your reference annotation is a “match” for the ce11 reference genome hosted in Galaxy? If not, do that next.

Ensembl uses a different chromosome naming scheme than UCSC.

The ce11 reference genome indexed at public Galaxy servers was sourced from UCSC.

UCSC also hosts reference annotation if you want to try theirs to see what happens.

How to get a GTF from UCSC is in here: https://training.galaxyproject.org/training-material/faqs/galaxy/datasets_working_with_reference_annotation.html

And yes, mismatched reference data produces scientifically “bad” results. Others will definitely notice. Please give the above a try.

Dear Jenn
I tried with a UCSC downloaded GTF and still multiple regions are skipped. would it be possible I dont have any signal in this regions from my ChIPseq.
[Parent Directory]
(https://hgdownload.soe.ucsc.edu/goldenPath/ce11/bigZips/) - e1**1.ensGene.gtf.gz](https://hgdownload.soe.ucsc.edu/goldenPath/ce11/bigZips/genes/ce11.ensGene.gtf.gz) 2020-09-01 12:56 7.1M ce11.ncbiRefSeq.gtf.gz 2020-01-10 09:41 5.7M ce11.refGene.gtf.gz 2020-01-10 09:41 6.7M


Dear Jenn
Thanks very much for your patience, I just realised when I turn off the skip zero, in computeMatrix these error disappeared.

1 Like