Seeking advice on amount of ambiguity in featureCounts

Using the tutorials (https://galaxyproject.org/tutorials/nt_rnaseq/) I am analyzing a set of paired-end, unstranded rna-seq data (mouse).

The QC along the way looks good until I get to featurecounts:

I’ve not used the HISAT2-stringtie-fc pipeline before, and I know this is unstranded, but the amount of ambiguity seems off to me and was looking for input from others.

Thanks!
NitDawg

1 Like

Welcome @NitDawg

Your read data appears to have high duplication. If you run FastQC you’ll find more details/confirmation about read duplication in those reports.

QA won’t help if the source data actually has high redundancy (tool: Trimmomatic). It could just be low-quality sequencing results or very deep sequencing was done. Contamination could be a factor, but removing those reads won’t help to get more data assigned to a known gene from your reference annotation, it will just reduce the final number of “unassigned-ambiguity” later on in the pipeline.

One note: It is import to run HISAT2 with the option to output results that are formatted for Stringtie. That is covered in the tutorial but is sometimes missed. Worth double-checking. Use the “rerun” (double circle icon) for the mapping jobs to review what options you used.

More DE analysis tutorials can be found here under the group “Transcriptomics” if you want to compare methods/tool choices:

Hope that helps!

1 Like

Yup! Most helpful @jennaj! The DE analysis was consistent with the hypothesis, so I’ll keep poking around to get some workflows established.

I did run HISAT2 and under “spliced alignment options” i used:

Transcriptome assembly reporting: Report alignments tailored for transcript assemblers including StringTie

Just wanted to confirm that was the only spot for it, since the location seemed odd.

Appreciate the quick reply, your input is very much appreciated.

1 Like

Agree with this. Is a very important option, should probably moved up into the main form options. Let me think about that a bit more – if I create a ticket, will post it back. A few tutorials will need an update for the option move (if and once done) but will improve usage and worth the effort imho.

Glad the rest helped. I was thinking about your results and almost wrote back again earlier.

My thoughts: Your reads are not mapping uniquely for some reason. This could be due to 1) very short read lengths 2) repetitive content or 3) the setting used during mapping (overly permissive).

The first cannot be addressed – unless your QA was too strict and trimmed bases that could be retained, which when not removed, can sometimes increase mapping specificity. The second cannot be addressed at all once library prep is done and the reads sequenced (unless you remove them during QA, which is not really necessary – eg: they cannot be salvaged and won’t improve “assigned” levels) and could be due to the native repetitive content of the particular genome you are sequencing or mapping against. The last is an additional area to check/play around with by tuning the mapping options. Experiment if interested.

For any expression data (not just RNA-seq), the less QA the better in most cases. The mappers can handle poorer base quality at the start/end of reads in many cases and will just remove any reads at that step that are not “good” enough (at all) to map, even if only partially. But obviously remove adaptor and other sequencing artifacts as that can definitely impact mapping rates.

Why less QA? The goal with expression data is always to preserve the original abundance levels as much as possible for accurate counts/differential expression results downstream.

Best! Jen

1 Like

Sweet. I’m going back and troubleshooting my QA, however, I now hit a brick wall with HISAT2. I keep getting:

Could not display BAM file, error was:
file has no sequences defined (mode=‘rb’) - is it SAM/BAM format? Consider opening with check_sq=False

Even when I go back to my original input files in my previous history and try to run them again through HISAT2, I get the same error. I also re-uploaded the originals and get the same error…extra confused, but probably doing something wrong on my end…going to try a few more things and I’ll report back. Thanks!

https://usegalaxy.org/u/nitdawg/h/hisat2

UPDATE:
When I run the same exact files through HISAT2 on https://usegalaxy.eu, it runs fine. Now I’m even more confused :slight_smile:.

UPDATE 2:
Well, now the same jobs are working again at https://usegalaxy.org/. I looked at all the documentation regarding errors, but all the failed runs came back as: “Remote job server indicated a problem running or monitoring this job.” I’m going to go back and digest the https://galaxyproject.org/support/tool-error/ to better navigate what to do when this happens, but I guess there was something going on? Not sure, but I’ll keep updates coming with this dataset.

Ok, I’m still plugging away, @jennaj…messing around with QA didn’t change much of anything. However, here is what I found. I used HISAT2 as before:

I then ran stringtie, then merged with either mm10 _RefSeq.gtf from galaxy (how I ran things the first time), or UCSC genes.gtf from iGenomes and got quite different results:

mm10_RefSeq.gtf:

UCSC:

So I’m going through the workflow again to see what’s up. Turtle pace, I know, but getting there.

1 Like

Hi @NitDawg

Avoid the UCSC GTF output from the table browser (what that first graph looks like it represents). It is fine for some uses, but not DE analysis. Use Gencode or iGenomes for mm10. Exactly why is explained in this new-ish FAQ along with more summarized help for DE analysis/tools in general:

Maybe the FAQ will help with other problems, maybe not, but worth reviewing. It addresses mostly usage problems but also common content problems (but only in generalized terms). Content issues in your read data spawn off into a range of ways to find out if or how you can address it.

For example, if you ended up with ribosomal contamination from the library prep, or the reads are very short, and then later end up with a large number of multi-mapping/non-properly paired reads, that cannot be solved later. The reads can only be ignored, or you can go back and address library construction issues. IF you did this yourself. IF public data, then you are stuck with how they did it – and not all public datasets are of high quality/content, as you almost certainly know :slight_smile:

You could do some detective work on those multi-mapped reads to find out what content they represent (repetitive content? contamination of some sort? too short to capture a unique hit with the setting used with HISAT2?). Blastn or Megablast could help with some of that, or for less set-up/work, try tossing a few of those sequences into UCSC’s BLAT web tool and review what tracks overlap the results (I do this ALL the time). BLAT is wrapped for Galaxy, but that means setting up your own server, getting the license (free for academic use), indexing … much more work, but required if you are not using a UCSC genome (but luckily you are … mm10 … so try the web tool, will be faster/less investment). NCBI also hosts BLAST in a web form but you won’t get the same kind of results (no comparison tracks pre-processed, all those overlapping HSPs, weird stats – I’ve never been a huge fan of BLAST but that is my own personal opinion from fighting and writing parsers to make more sense of the results … many, many times over the years – but others like it fine). Make your own choices about what tools to use, if you decide to dig deeper.

Be aware that BLAT is not intended to map NGS reads, but is very robust. Just paste in the sequences, not the quality scores. The tool will even reformat (properly wrap the lines), add in “>” title lines if you paste in a sequence fragment, remove extraneous spaces/content, and more. Almost impossible to paste in sequence data it will not “fix-up” format-wise, and map. If there is any possible hit, the tool will usually find it, and report back both strong and weak hits. Then review the results in the UCSC Browser (there are links in the BLAT results to make this quick), scroll down, set tracks and refresh the view. Next, review what else what is overlapping those same genomic regions you have hits for. I’ll even BLAT against a few different genomes – it sort depends on what the syntenic (“Conservation”) and repeat tracks reveal and/or how fast I just want to guess and get some info back.

FastQC will certainly give some clues, but in full informatics analysis, nothing is truly “automatic” – the QA tools are aids. Then you need to go deeper based on those results or try different methods.

Your reads are not mapping uniquely and you probably want to find out why, or you can get rid of those and move on. Filtering the BAM result to only retain properly paired reads (and optionally remove unmapped, to make the BAM smaller eg: easier/faster to process) is an option. Or reverse it and retain those that are not properly pairing, to do some detect work on them.

Tool search with “filter” – there are a few tool choices for BAM inputs. The BAMtools version is a reliable choice but others work fine. The primary difference is how the form is set up/organized. If you understand bitwise flags, then any are Ok. If not and/or don’t want to bother with decoding, then pick a tool that has those flags already translated into human-readable options (as the BAMtools “Filter BAM” version does).

Hope that helps a bit more!

Always helpful, @jennaj! Yeah, the libraries were not prepped in my lab, but my revamped workflow is a definite improvement from the first pass (all >15M). Reads were not short, but I will still take a look at the multi’s just to satisfy my curiosity.

In the past, I’ve filtered the multi’s out of the BAMs, but as long as "count multi-mapping: is disabled, the output of featureCounts is only “assigned”, is that correct?

I ended up getting it down to this:

and

Which is much better than the 4-5 M I was at before. I used iGenomes mm10 per your suggestion. I’ll be back.

-NitDawg