Hi @NitDawg
Avoid the UCSC GTF output from the table browser (what that first graph looks like it represents). It is fine for some uses, but not DE analysis. Use Gencode or iGenomes for mm10. Exactly why is explained in this new-ish FAQ along with more summarized help for DE analysis/tools in general:
Maybe the FAQ will help with other problems, maybe not, but worth reviewing. It addresses mostly usage problems but also common content problems (but only in generalized terms). Content issues in your read data spawn off into a range of ways to find out if or how you can address it.
For example, if you ended up with ribosomal contamination from the library prep, or the reads are very short, and then later end up with a large number of multi-mapping/non-properly paired reads, that cannot be solved later. The reads can only be ignored, or you can go back and address library construction issues. IF you did this yourself. IF public data, then you are stuck with how they did it – and not all public datasets are of high quality/content, as you almost certainly know
You could do some detective work on those multi-mapped reads to find out what content they represent (repetitive content? contamination of some sort? too short to capture a unique hit with the setting used with HISAT2
?). Blastn
or Megablast
could help with some of that, or for less set-up/work, try tossing a few of those sequences into UCSC’s BLAT
web tool and review what tracks overlap the results (I do this ALL the time). BLAT
is wrapped for Galaxy, but that means setting up your own server, getting the license (free for academic use), indexing … much more work, but required if you are not using a UCSC genome (but luckily you are … mm10 … so try the web tool, will be faster/less investment). NCBI also hosts BLAST in a web form but you won’t get the same kind of results (no comparison tracks pre-processed, all those overlapping HSPs, weird stats – I’ve never been a huge fan of BLAST but that is my own personal opinion from fighting and writing parsers to make more sense of the results … many, many times over the years – but others like it fine). Make your own choices about what tools to use, if you decide to dig deeper.
Be aware that BLAT
is not intended to map NGS reads, but is very robust. Just paste in the sequences, not the quality scores. The tool will even reformat (properly wrap the lines), add in “>” title lines if you paste in a sequence fragment, remove extraneous spaces/content, and more. Almost impossible to paste in sequence data it will not “fix-up” format-wise, and map. If there is any possible hit, the tool will usually find it, and report back both strong and weak hits. Then review the results in the UCSC Browser (there are links in the BLAT
results to make this quick), scroll down, set tracks and refresh the view. Next, review what else what is overlapping those same genomic regions you have hits for. I’ll even BLAT
against a few different genomes – it sort depends on what the syntenic (“Conservation”) and repeat tracks reveal and/or how fast I just want to guess and get some info back.
FastQC
will certainly give some clues, but in full informatics analysis, nothing is truly “automatic” – the QA tools are aids. Then you need to go deeper based on those results or try different methods.
Your reads are not mapping uniquely and you probably want to find out why, or you can get rid of those and move on. Filtering the BAM result to only retain properly paired reads (and optionally remove unmapped, to make the BAM smaller eg: easier/faster to process) is an option. Or reverse it and retain those that are not properly pairing, to do some detect work on them.
Tool search with “filter” – there are a few tool choices for BAM inputs. The BAMtools
version is a reliable choice but others work fine. The primary difference is how the form is set up/organized. If you understand bitwise flags, then any are Ok. If not and/or don’t want to bother with decoding, then pick a tool that has those flags already translated into human-readable options (as the BAMtools “Filter BAM” version does).
Hope that helps a bit more!