RNA-Seq analysis in Galaxy

Ahmad_B · October 12, 2023, 12:25pm

Hello
I have a problem to use the featurecount part in Galaxy database. My RNA-Seq data about arabidopsis thaliana. I use the external genome file from my computer I downloaded from ensemble genome browser. Then I upload my genome file for HISAT2 part. Now for run the featurecount part, I can not see the arabidosis genome lfor execute. What am I doing now?

jennaj · October 13, 2023, 8:03pm

Hi @Ahmad_B

You’ll need two reference files:

Reference genome in fasta format
Reference annotation in GTF or GFF3 format.

Some tools will interpret GFF3 just fine, and some will require that the tool form parameters are adjusted, and some will need to be converted to GTF format (tool → gffread).

Featurecounts will accept both, but you’ll need to tune two parameters to match the attributes: GFF feature type filter and GFF gene identifier. See the tool form for more details.

Tips:

Get both data from the same source
Confirm the format of each to avoid odd errors/results
Also confirm the content – both need to be based on the same exact assembly and use the same exact chromosome identifiers. This should be true if sourced from the same data provider, but always double check yourself.

Most of the Q&A for this tool at this forum is about getting the format and content to be a correct “match”, but also see the FAQs below.

FAQs

And, scroll down into the help section for any tools to find more tips. Some have example data and link out to author resources, too. If a tool is included in a tutorial, that will also be linked, or you can navigate the tutorials directly by analysis domain. Example: Transcriptomics.

This sounds like you need to still obtain the reference annotation.

Hope that helps!

Ahmad_B · October 16, 2023, 7:24pm

I use the genome data and gfft or gff file to annotation. but after I use a featurecount, I had a result data as a zero point for my annotation.
you can see the below to my final results in this part. I do not know I have a problem in HISAT2 part and BAM file or my samples data is not good? because my FASTQC also good results. I don’t know what am i doing now?
please help me.

Geneid	HISAT2 on data 50	data 49	and data 80: aligned reads (BAM)
Geneid	HISAT2 on data 50, data 49, and data 80: aligned reads (BAM)
AT1G01010	0
AT1G01020	0
AT1G01030	0
AT1G01040	0
AT1G01046	0
AT1G01050	0
AT1G01060	0
AT1G01070	0
AT1G01073	0
AT1G01080	0
AT1G01090	0
AT1G01100	0
AT1G01110	0
AT1G01115	0
AT1G01120	0
AT1G01130	0

jennaj · October 16, 2023, 8:16pm

Hello @Ahmad_B

Please post back the first few lines of your reference genome and of your reference annotation.

You can also post back the area of the job details view (“i” icon) that lists out the inputs and parameters. Keep this content structured please: copy/paste or screenshot that captures all of it.

Both together will provide more context to help sort out what exactly is going on.

Ahmad_B · October 16, 2023, 8:47pm

thank you this is my data

the below datais the gff format file for annotation:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10
#!genome-build-accession NCBI_Assembly:GCF_000001735.3
#!annotation-source TAIR and Araport Araport11
##sequence-region NC_003070.9 1 30427671
##species Taxonomy browser (Arabidopsis thaliana)
NC_003070.9 RefSeq region 1 30427671 . + . ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003070.9 RefSeq gene 3631 5899 . + . ID=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=rna0;Parent=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;Name=NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 3631 3913 . + . ID=id1;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 3996 4276 . + . ID=id2;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2
NC_003070.9 RefSeq exon 4486 4605 . + . ID=id3;Parent=rna0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2

the below data is the reference genome I downloaded from ensembel genome browser:

1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT
CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA
TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT
GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAA
GCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTT
ATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTT
ACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTG
TAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGG
GATGGTCCTTTAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAA
AGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTATCGCCTCGACGATGCTCTAT

jennaj · October 16, 2023, 10:26pm

is the chromosome naming in the GFF3 data. Is it really for just one chromosome? You will need annotation for all chromosomes instead.

is the chromosome naming in the fasta, with

as the part of the fasta that is not description content

These don’t match up, so the tool cannot actually use the annotation.

How to use a custom reference genome. Be sure to follow the fasta formatting guidelines. Custom genome + custom build: How to use a genome that is not natively indexed at the server you are working at - #2 by jennaj

Then, you need reference annotation that is based on that exact genome assembly, including the same chromosome identifiers. The headers in your current annotation indicate that it is just annotation for one contig, not all in the assembly. Plus the identifier that is included is a mismatch.

What to do:

Clean up the format of your genome fasta assembly.
Find the reference annotation for that full genome assembly, and load it to Galaxy.
Double check that both use the same chromosome identifiers, and that none are missing from either file. Then double check that your BAM data uses the same format, or consider remapping against your updated fasta.
Once your BAM is ready, and you have “matching” annotation available, that is when you can adjust downstream tool form settings to match the annotation’s attributes (9th column).

I know this seems complicated, but you only need to get the assembly and annotation into Galaxy once, then formatted once. Consider putting the files in a dedicated history used just for storing the paired reference datasets. Copy those into histories for mapping and other downstream steps.

So far, you have all of the information you need except for the full matching reference annotation.

Ahmad_B · October 19, 2023, 9:51pm

Hello Jennifer
I tried with same genome ref seq data and annotation file from NCBI and also ensemble genome browser. but I get the zero data same as my previous messages. I do not know what am I doing now?

jennaj · October 20, 2023, 5:26pm

Hi @Ahmad_B

Are the reference genome and reference annotation a “match” this time? And, you remapped against that reference genome (it it was updated at all)?

If you want to post back the first few lines of each again, we can try to help again. You could also post back the job details view if you think the problem is introduced at that step (capture the link from the “i” icon inside one of output datasets).

Ahmad_B · October 20, 2023, 6:47pm

genome file

NC_003070.9 Arabidopsis thaliana chromosome 1 sequence
ccctaaaccctaaaccctaaaccctaaacctctGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG
AATCCCTAAATACCTAAttccctaaacccgaaaccggTTTCTCTGGTTGAAAATCATTGTGtatataatgataattttat
CGTTTTTATGTAATTGCTTATTGTTGTGtgtagattttttaaaaatatcatttgagGTCAATACAAATCCTATTTCTTGT
GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATT
TGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT
GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGgaaaattatttagttg
taGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGCATTTAT
TCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTA
TCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACtcaaaaaagtatttttagatgtttgttttgc
ttctttgaagTAGTTTCTCTTTGCAAAATTCCTCTTTTTTTAGAGTGATTTGGATGATTCAAGACTTCTCGGTACTGCAA

annotation file

Chr1	TAIR10	exon	3631	3913	.	+	.	transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1	TAIR10	exon	3996	4276	.	+	.	transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1	TAIR10	exon	4486	4605	.	+	.	transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;
Chr1	TAIR10	exon	4706	5095	.	+	.	transcript_id AT1G01010.1; gene_id AT1G01010; gene_name AT1G01010;

Ahmad_B · October 20, 2023, 6:50pm

also tried with this annotation file

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10
#!genome-build-accession NCBI_Assembly:GCF_000001735.3
#!annotation-source TAIR and Araport Araport11
##sequence-region NC_003070.9 1 30427671
##species Taxonomy browser (Arabidopsis thaliana)
NC_003070.9 RefSeq region 1 30427671 . + . ID=id0;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_003070.9 RefSeq gene 3631 5899 . + . ID=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=rna0;Parent=gene0;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580,Genbank:NM_099983.2;Name=NM_099983.2;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1;transcript_id=NM_099983.2

jennaj · October 20, 2023, 7:40pm

Those files are still a mismatch.

What to do:

See if the data provider has a matched set of files
Or, see if there is a mapping file available to use for a Replace tool and use it (scroll down on the tool form to find the available ‘mapping files’): Replace column by values which are defined in a convert file
Get both from a different data provider

Ahmad_B · October 25, 2023, 12:17am

Hello Jennifer
I tried several data but I can not get the trure results. Cn you send me the suitablr ref genome and annotation file for Arabidopsis Thaliana to run it?