How can I improve very low assigned rate in featureCounts?

featurecounts

#1

I"m a new to Galaxy.
I’m trying to analyze my RNA seq data from mice using HISAT2 (or STAR) and featureCount.
My data are stranded paired end.

I could get a good mapping rate in HISAT2, but only 30% of reads were assigned in featureCounts.

HISAT2 (with default settings except for strandedness, R)
56

featureCounts
28

I’m using built-in file of mm10 as a reference genome, because I have no idea which FASTQ/GFF3 files I should chose in UCSC/ensemble.

I have also tried STAR without a GFF3 annotation file and faced with the same problem in the counting step by featureCounts. With a GFF3 file I downloaded from ensemble, I couldn’t get results due to an error.

Could you suggest any possible solutions?
I have been in stuck in the same place for a few weeks.

Thank you very much in advance.


#2

Hi,

The strand for HISAT2 paired-end inputs should be FR, RF, or Unstranded, so this might be a typo and you meant RF?

Featurecounts also requires strandedness to match what was used for mapping. That is an F, R, or Unstranded toggle.

The data is failing for three primary reasons:

  • Unmapped:

    • Did you run the fastq data through QA/QC tools before mapping?
    • Some level of unmapped is expected, it depends on the quality of your sequence data. Trimming cannot eliminate all data problems.
  • Mapping quality:

    • The default is “12” but that can be modified under advanced settings.
    • Was this changed? If yes, try using the default.
  • NoFeatures:

    • If the strandedness is incorrect, the number of reads discarded for this reason can be very high.
    • It could also be high because your reads do not correspond (content-wise) to known transcripts.
    • Was there something special about how the library was constructed? If true, you might need to provide your own reference annotation that matches your sequencing target (ncRNA, etc).

I would suggest comparing your methods to those in this Galaxy Training Network (GTN) tutorial. QA/QC, strandedness assessment, and usage for these tools are all covered.

Using the built-in annotation for mm10 is usually a very good choice for RNA-seq data. There are other sources but I don’t think the result will be much different if using a basic transcript reference annotation dataset from any source.

But if you want to try, Gencode and iGenomes are good alternative sources, with Gencode a bit simpler to get into Galaxy. This prior Q&A is about human (hg38), but both sources also have data for mouse (mm10): RNA-STAR and hg38 GTF reference annotation


#3

I really appreciate your help.

The strand for HISAT2 paired-end inputs should be FR , RF , or Unstranded , so this might be a typo and you meant RF ?

Yes, it is RF. Sorry for the typo.
I selected RF for HISAT2 and R for featureCounts.
I judged the strandedness by Infer Experiment.
25

Did you run the fastq data through QA/QC tools before mapping?
Yes, I did. The result of FASTQC was not bad. Per base sequence quality scores were beyond 35 in all positions.

The default is “12” but that can be modified under advanced settings.
Was this changed? If yes, try using the default.
I didn’t change the value at all…
My screen was like these.


Was there something special about how the library was constructed? If true, you might need to provide your own reference annotation that matches your sequencing target (ncRNA, etc).

I think my data which was given by a predecessor has nothing special, because it was his first time to do RNA seq. The library was constructed by a specialized facility in our university.
My aim is to conduct differential expression analysis by using the data.

If I made any mistakes above, please tell me.

I will read GTN tutorial and the link as to annotation.


#4

This looks correct if the bed dataset used with Infer was from the correct genome/build. You might want to double check that.

This tutorial covers more details about strandedness, but I don’t think that is the problem given the Infer results, even though your Featurecounts output does suggest the data could be unstranded, e.g. about half the reads are counting up correctly to known genes, half are not – a bit suspicious.

From here, I would suggest examining the full HISAT2 BAM data in a genome browser to compare where the reads are mapping with respect to reference annotation. IGV has mm10 pre-loaded along with RefSeq annotation. UCSC also supports this genome and has even more annotation natively available. Both allow the loading of additional tracks, including the bed dataset used with the Infer tool.

If you cannot figure this out, we can help review the tool settings to make sure more is not going on (including a potential tool bug). Message me direct a share link to your history if you want to do that, or let me know if your email registered here is the same as used at Galaxy Main https://usegalaxy.org and I can find the data that way. You don’t need to post anything public (email, share link) – and never share your account password with anyone :slight_smile:


#5

Thanks a lot.

I have organized my history in Galaxy Main. Now it may be a little bit easier to see.
I couldn’t send you mail although I tired. Could you find my history using my e-mail address? I’m using the same address both in Galaxy Help and Main.

I double checked the result of Infer using a UCSC mm10 BED file. The result was the same.
HISAT2>featureCounts still resulted in the bad assigned rate, so I have tried to run STAR>featureCounts using an Ensemble GFF3 annotation file for mice. STAR has finished with an error massage below.

Fatal error: Matched on FATAL ERROR

Fatal INPUT FILE error, no valid exon lines in the GTF file: /galaxy-repl/main/files/030/235/dataset_30235858.dat
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

Mar 08 12:55:52 … FATAL ERROR, exiting

I hope I could solve these problems…


#6

Hi,

Ensembl chromosome identifiers are different than those in the UCSC mm10 genome. The Gencode and iGenomes versions will work. FAQ: Mismatched Chromosome identifiers (and how to avoid them)

Did you visualize the HISAT2 bam and reference annotation (UCSC’s bed, plus whatever gft you decide to use) in a browser?

I’ll take a look at your history but from what you have reported, it sounds like the reads are simply not mapping to known gene regions. There could be many reasons for this, including problems with the original RNA-seq library construction. FastQC should give some clues as well as other RSeqQ tools.


#7

I see. I will run STAR again with the proper identifiers. I hope it works. Thank you.

As for HISAT2 > feature counts, I only used the built-in data.
If you can find out something wrong as to the HISAT2>featureCounts process in the history, please tell me how to solve it.


#8

Hi,

I ran STAR with a GFF3 file from Gencode.
It worked, but the results by featureCounts using the data was similar to those by HISAT2.
HTSeq generated the lots of no_features as well.

STAR>featureCounts
52

STAR> HTSeq
06

Now, I’m trying to visualize the bam file by IGV. I have followed the instruction provided in Galaxy training but I still cannot obtain a proper image. I will continue working on this.

The screenshot og IGV.


#9

Hi - I took a look at your history. The problem is with the sequence content.

Examine the FastQC results and you’ll see a very large number of sequences in the “Overrepresented” and “Sequence Duplication” reports, even for the version of the data that had adaptor removed. Some of these are a single, repeated, nucleotide strings. Others are some other overrepresented sequence (but not adaptor). De-duplicated data runs < 30%, which roughly matches the fraction of reads that end up getting counted by Featurecounts/HTseq-count. That is going to be your “best possible result” when using this data.

There was almost certainly a problem with the library construction. Whether to use the data as-is or to go back and rebuild/sequence the RNA libraries is your choice.

Help for understanding FastQC results: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/


Update: I took a few of the overrepresented sequences and ran a quick BLAT at UCSC through the online tool: https://genome.ucsc.edu. Try that too if you want. The reads are mapping to RNA repetitive elements, also pointing to a potential problem with the library construction (RNA contamination). This tool is very easy to use and has click-through display to the browser where annotation tracks can be visualized along with your data. The “Repeat Masker” track is displayed by default but that can be tuned (add in gene annotation tracks, etc). Your bam data from HISAT2/STAR can also be viewed here using the link inside of the expanded datasets named " display at UCSC main". FastQC help for the “Overrepresented” analysis module.


#10

Thank you very much indeed. Now, I’m relieved a little bit.
This library was built by a specialized facility in our university.
Do you think the problem is due to the quality of samples we submitted or the way they constructed the library?


#11

It could be either one, you’ll need to do some more detective work. Is there someone at the sequencing facility you can talk to?

Examples: gDNA contamination could be in the original sample (and would presumably be checked for by whoever is doing the library prep). rRNA contamination could be introduced during library prep. Since your data mapped so well, cross-species contamination doesn’t seem to be an issue.

There could be more going on, these are just guesses from QC done so far.

There is much discussion about RNA-seq contamination re: how to characterize it and how to correct for it (when possible) at these forums: https://www.biostars.org/ https://bioinformatics.stackexchange.com/. You could also do a literature search.

The Galaxy tutorials review basic QA/QC but do not cover every potential problem/resolution in detail. That said, many of the tools referenced at other sources are available in Galaxy or will have some similar tool/method/workflow possible.