What is the meaning of Reference Gene Model in FPKM Tools in Galaxy?

Hi everyone, i have 3 question:

  1. in “FPKM Counts” Tools, (that give us FPKM for each Gene), there is one Parameter as: “Reference Gene Model in BED format”, what is the meaning of that? It means the annotation file (GTF)?
  2. If the answer is the annotation file(GTF) in the BED format, why does the downloaded annotation file from NCBI not have the ability to be converted to the BED format in Galaxy, while the annotation file(GTF) obtained from the Ensembl Plant is converted?
  3. I did all Analysis using GTF file from NCBI, but now galaxy can not convert this GTF file to BED format, can i use GTF from Ensembl for this section? are the GTF files from different source the same?

Hi @Dr.Lida

These are good questions! :slight_smile: Reference annotation can vary quite a bit between data providers. Some tools expect the content in a variation of BED format and other tools expect a variation of GFF/GTF/GFF3 formats. The good news is that converting between these formats is usually possible.

Let’s break this down using the tool form and resources linked from it for context, then I’ll explain the steps to transform your data.


Tool Form → RSeQC

The top section where you input the gene model specifies that it is expecting a BED12 input (toggle accepted formats). That means a BED dataset with 12 columns. UCSC originally designed the specification and has a good description here.

Screenshot of the Input area

accepted-formats

Next, further down on the tool form is the Help section. (scroll down)

Quote from tool form

About RSeQC

The RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. “Basic modules” quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while “RNA-seq specific modules” investigate sequencing saturation status of both splicing junction detection and expression estimation, mapped reads clipping profile, mapped reads distribution, coverage uniformity over gene body, reproducibility, strand specificity and splice junction annotation.

Following the link to the author’s website at RSeQC finds this:

Screenshot of RSeQC homepage

rseqc-homepage

And at that page (RSeQC: An RNA-seq Quality Control Package — RSeQC documentation) more help is provided.

Screenshot of RSeQC Gene Model page


What to do

UCSC’s reference genomes are the version that the public Galaxy servers pre-index for tools. That means if you used a native index for other steps (like mapping) you can also source the bed12 from UCSC.

If you are using a reference genome that is not supported by UCSC, you can convert to BED12 from a standardized GFT format. UCSC has another useful guide here for GFF/GTF/GFF3 formats.

If you have GTF now, first make sure the format is standardized. This usually means removing any extra headers that may have been added by the data provider, or you may need to transformation from GFF3 to GTF first, then convert to BED12.

Screenshot of a search in the tool panel with the keyword bed12

search-bed12

I ran a test using a GTF input with the tool Convert GTF to BED12 and it produced what I expected (match to the UCSC specification). RSeQC should be able to consume it fine. You can do the same for your reference annotation once in a standardized GTF format.

Shared history → https://usegalaxy.eu/u/jenj/h/help-gtf-to-bed12

A tool might be able to do the conversion automatically if the data is in a standardized GTF format. If a tool cannot do the transformation, double check the format, or consider doing the transformation yourself. We have a guide here about standardizing formats → FAQ: Working with GFF GFT GTF2 GFF3 reference annotation

And these tutorials include examples of using RSeQCGalaxy Training!


So, that’s a lot of information! Please review and let us know if you need more help. You can share a history back with what you did, or at least the starting annotation file, and we can troubleshoot what to do next. :scientist: