RNA-Seq analysis in Galaxy

Hello
I have a problem to use the featurecount part in Galaxy database. My RNA-Seq data about arabidopsis thaliana. I use the external genome file from my computer I downloaded from ensemble genome browser. Then I upload my genome file for HISAT2 part. Now for run the featurecount part, I can not see the arabidosis genome lfor execute. What am I doing now?

1 Like

Hi @Ahmad_B

You’ll need two reference files:

  1. Reference genome in fasta format
  2. Reference annotation in GTF or GFF3 format.

Some tools will interpret GFF3 just fine, and some will require that the tool form parameters are adjusted, and some will need to be converted to GTF format (tool → gffread).

Featurecounts will accept both, but you’ll need to tune two parameters to match the attributes: GFF feature type filter and GFF gene identifier. See the tool form for more details.

Tips:

  1. Get both data from the same source
  2. Confirm the format of each to avoid odd errors/results
  3. Also confirm the content – both need to be based on the same exact assembly and use the same exact chromosome identifiers. This should be true if sourced from the same data provider, but always double check yourself.

Most of the Q&A for this tool at this forum is about getting the format and content to be a correct “match”, but also see the FAQs below.

FAQs

And, scroll down into the help section for any tools to find more tips. Some have example data and link out to author resources, too. If a tool is included in a tutorial, that will also be linked, or you can navigate the tutorials directly by analysis domain. Example: Transcriptomics.

This sounds like you need to still obtain the reference annotation.

Hope that helps!

I use the genome data and gfft or gff file to annotation. but after I use a featurecount, I had a result data as a zero point for my annotation.
you can see the below to my final results in this part. I do not know I have a problem in HISAT2 part and BAM file or my samples data is not good? because my FASTQC also good results. I don’t know what am i doing now?
please help me.

Geneid HISAT2 on data 50 data 49 and data 80: aligned reads (BAM)
Geneid HISAT2 on data 50, data 49, and data 80: aligned reads (BAM)
AT1G01010 0
AT1G01020 0
AT1G01030 0
AT1G01040 0
AT1G01046 0
AT1G01050 0
AT1G01060 0
AT1G01070 0
AT1G01073 0
AT1G01080 0
AT1G01090 0
AT1G01100 0
AT1G01110 0
AT1G01115 0
AT1G01120 0
AT1G01130 0

Hello @Ahmad_B

Please post back the first few lines of your reference genome and of your reference annotation.

You can also post back the area of the job details view (“i” icon) that lists out the inputs and parameters. Keep this content structured please: copy/paste or screenshot that captures all of it.

Both together will provide more context to help sort out what exactly is going on.

thank you this is my data

the below datais the gff format file for annotation:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10
#!genome-build-accession NCBI_Assembly:GCF_000001735.3
#!annotation-source TAIR and Araport Araport11
##sequence-region NC_003070.9 1 30427671
##species Taxonomy browser (Arabidopsis thaliana)
NC_003070.9 RefSeq region 1 30427671 . + . ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003070.9 RefSeq gene 3631 5899 . + . ID=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=rna0;Parent=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;Name=NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 3631 3913 . + . ID=id1;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 3996 4276 . + . ID=id2;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 4486 4605 . + . ID=id3;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2

the below data is the reference genome I downloaded from ensembel genome browser:

1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT
CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA
TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT
GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAA
GCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTT
ATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTT
ACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTG
TAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGG
GATGGTCCTTTAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAA
AGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTATCGCCTCGACGATGCTCTAT

is the chromosome naming in the GFF3 data. Is it really for just one chromosome? You will need annotation for all chromosomes instead.

is the chromosome naming in the fasta, with

as the part of the fasta that is not description content

These don’t match up, so the tool cannot actually use the annotation.

How to use a custom reference genome. Be sure to follow the fasta formatting guidelines. Custom genome + custom build: How to use a genome that is not natively indexed at the server you are working at - #2 by jennaj

Then, you need reference annotation that is based on that exact genome assembly, including the same chromosome identifiers. The headers in your current annotation indicate that it is just annotation for one contig, not all in the assembly. Plus the identifier that is included is a mismatch.

What to do:

  1. Clean up the format of your genome fasta assembly.
  2. Find the reference annotation for that full genome assembly, and load it to Galaxy.
  3. Double check that both use the same chromosome identifiers, and that none are missing from either file. Then double check that your BAM data uses the same format, or consider remapping against your updated fasta.
  4. Once your BAM is ready, and you have “matching” annotation available, that is when you can adjust downstream tool form settings to match the annotation’s attributes (9th column).

I know this seems complicated, but you only need to get the assembly and annotation into Galaxy once, then formatted once. Consider putting the files in a dedicated history used just for storing the paired reference datasets. Copy those into histories for mapping and other downstream steps.

So far, you have all of the information you need except for the full matching reference annotation.

Hello Jennifer
I tried with same genome ref seq data and annotation file from NCBI and also ensemble genome browser. but I get the zero data same as my previous messages. I do not know what am I doing now?

Hi @Ahmad_B

Are the reference genome and reference annotation a “match” this time? And, you remapped against that reference genome (it it was updated at all)?

If you want to post back the first few lines of each again, we can try to help again. You could also post back the job details view if you think the problem is introduced at that step (capture the link from the “i” icon inside one of output datasets).

genome file

NC_003070.9 Arabidopsis thaliana chromosome 1 sequence
ccctaaaccctaaaccctaaaccctaaacctctGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG
AATCCCTAAATACCTAAttccctaaacccgaaaccggTTTCTCTGGTTGAAAATCATTGTGtatataatgataattttat
CGTTTTTATGTAATTGCTTATTGTTGTGtgtagattttttaaaaatatcatttgagGTCAATACAAATCCTATTTCTTGT
GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATT
TGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT
GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGgaaaattatttagttg
taGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGCATTTAT
TCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTA
TCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACtcaaaaaagtatttttagatgtttgttttgc
ttctttgaagTAGTTTCTCTTTGCAAAATTCCTCTTTTTTTAGAGTGATTTGGATGATTCAAGACTTCTCGGTACTGCAA

annotation file

Chr1 TAIR10 exon 3631 3913 . + . transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1 TAIR10 exon 3996 4276 . + . transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1 TAIR10 exon 4486 4605 . + . transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1 TAIR10 exon 4706 5095 . + . transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;

also tried with this annotation file

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10
#!genome-build-accession NCBI_Assembly:GCF_000001735.3
#!annotation-source TAIR and Araport Araport11
##sequence-region NC_003070.9 1 30427671
##species Taxonomy browser (Arabidopsis thaliana)
NC_003070.9 RefSeq region 1 30427671 . + . ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003070.9 RefSeq gene 3631 5899 . + . ID=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=rna0;Parent=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;Name=NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2

Those files are still a mismatch.

What to do:

  1. See if the data provider has a matched set of files
  2. Or, see if there is a mapping file available to use for a Replace tool and use it (scroll down on the tool form to find the available ‘mapping files’): Replace column by values which are defined in a convert file
  3. Get both from a different data provider

Hello Jennifer
I tried several data but I can not get the trure results. Cn you send me the suitablr ref genome and annotation file for Arabidopsis Thaliana to run it?