I am using HISAT2, but my reference genome Rickettsia rickettsii is missing in the dropdown list.

Sonenshine · July 11, 2024, 7:14pm

Can someone in Galaxy add the genome for Rickettsia rickettsii to the dropdown list for HISAT2for the reads to

jennaj · July 11, 2024, 7:55pm

You can map to any reference genome that you want to within Galaxy. You can also create a custom database key to assign to datasets!

FAQ covered in this prior topic.

If you are planning analysis that includes reference annotation, it is a good idea to also get that data at the same time, then to ensure that all the reference files work together. This other guide can be useful even if you are not doing differential expression analysis since it explains how tools use important data labels such as chromosome identifiers.

FAQ: Extended Help for Differential Expression Analysis Tools

Hope this helps!

Sonenshine · July 12, 2024, 8:03pm

Hi Jennifer:

I tried all of the tools you suggested. The custom genome build linked to UCSC. All I could find there were eurkaryote genomes, Archea but NO prokaryotes.

I could not find a tool called Custom Reference Genome.

The Reference Genome I am trying to install is: Rickettsia rickettsii SSR24105541.

Any other suggestions?

Thanks

Daniel

(admin redacted for privacy)

jennaj · July 12, 2024, 8:50pm

Hi @Sonenshine

I’ll try explaining again, let’s start with this part:

UCSC will not host every genome. Other common sources can include NCBI, and sometimes smaller labs.

How I found this one:

I used just an internet browser search to find → KEGG GENOME: Rickettsia rickettsii R.
That genome card includes the Genbank accession, which points to → Rickettsia rickettsii str. R genome assembly ASM83152v1 - NCBI - NLM
The FTP link at Genbank is a directory with the files you will want (genome fasta + genome GTF reference annotation) → Index of /genomes/all/GCF/000/831/525/GCF_000831525.1_ASM83152v1
The README in that directly explains what each file contains. → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/831/525/GCF_000831525.1_ASM83152v1/README.txt

The exact files you will want are these. Capture the URLs, paste them into the Upload tool, and use all defaults at this step. You can paste in both at the same time (one per line), or do it separately. And, you can explore the other files in here, but these are the two baselines for most analysis usually done in Galaxy, and these two will work with the most tools.

Reference genome in fasta format → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/831/525/GCF_000831525.1_ASM83152v1/GCF_000831525.1_ASM83152v1_genomic.fna.gz
Reference annotation in GTF format. → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/831/525/GCF_000831525.1_ASM83152v1/GCF_000831525.1_ASM83152v1_genomic.gtf.gz

For this part:

Right, this is a function, not a specific tool, sorry if that wasn’t clear!

For the how-to: I just updated this a few months ago, so it should be current, and the process is general enough that it should work at any Galaxy server. But if you find a server where it doesn’t work, share back the server URL and we can try to help with it.

Guide with details → FAQ: How to use Custom Reference Genomes?

Click into this FAQ for how it works. If you get the data into your history, and try to prepare it and something goes wrong, we can help more. All is just a few steps, and you only need to do it once, then can use the data with any tool.

Thanks!

Sonenshine · July 13, 2024, 3:52pm

Hi Jennifer:

Thanks so much for taking the time and finding all these links for me. Most grateful!

I have begun uploading the files as per your recommendation. The next step will be to see if HISAT2 will recognize the files and accept them as the reference genome. If there are annotation problems, I will redirect to the annotation tutorial and try to ID all the necessary features so that both sample and reference genomes have the same annotations.

Will let you know.

Daniel

(admin redacted)

Sonenshine · July 13, 2024, 8:19pm

Hi Jennifer:

Success, in part. I got HISAT2 to work, followed by Multiqc. Results showed 64% mapping between the two genomes, which is pretty much what I expected since there is considerable biological difference between them (even though they are members of the same genus).

The next step, featherCounts, is blocked. According to the training description, it is optimized for the mouse genome. I checked the help section which offered a tool called “ggfreads” which is supposed to do the same tasks as featherCounts. However, my version of Galaxy doesn’t recognize it.

Any suggestions how to solve this block?

Thanks

Daniel

~WRD0000.jpg

jennaj · July 15, 2024, 10:18pm

Hi @Sonenshine

Do you mean the tool featureCounts? As far as I know, you can use that tool when counting up any genome’s annotation data versus hits.

The other tool you mention, gffread, is for reformatting or parsing reference annotation. If you prepared your annotation already when loading up the data, you shouldn’t need to do that again.

Now I’m curious where you are running Galaxy! But seriously, if you want to compare to how the tools are set up and working at a public Galaxy server, that is usually helpful when working somewhere else. And, if you are following a GTN tutorial, check the “Available at these Galaxies” pull down menu at the top – those are the servers known to support that specific tutorial, and would be the best comparison choices. If you are following some other tutorial, try at one of the UseGalaxy servers to start with.

Sonenshine · July 16, 2024, 6:49pm

Hi Jennifer:

I got featureCounts to work and I was able to progress through the entire tutorial, ending up mapped counts. Next, I wanted to do a volcano plot and heat map, but the second tutorial, counts to genes, is optimized for mouse and would not work with the bacteria I uploaded. The annodata tool in the tutorial Counts to Genes shows three columns, Entrez ID, symbol and gene annodata. That seems to be what limma needs to work with.

How can I get to limma-voom looking like the attached file, with EntrezID, gene ID, logFc2, pvalue,…etc.? Can you recommend a different tool for this purpose?

Thanks

Daniel

(attachments)

jennaj · July 17, 2024, 9:36pm

Hi @Sonenshine

Limma also works with any genome. XRef: Bioconductor Forum (query=limma)

However, AnnotateMyIDs is different. This tool maps from standardized gene identifiers over to public data.

You can map other gene identifiers over to ENSEMBL, or use an ENSEMBL based annotation from the start. People do both.
The identifiers supported by the public data repositories are under the listing for ID Type.
A more complicated way to use this tool is to create and use your own mapping.
All of this would be true no matter where you used this tool, and any reference genome or set of original gene identifiers.

More here about that custom mapping function is on the tool form Help:

This tool uses the select function from the Bioconductor AnnotationDBi package. Note that if you request columns that have multiple matches for your IDs, select will return one row in the output for each possible match . This has the effect that if you request multiple columns and some of them have a many-to-one relationship to the IDs, things will continue to multiply accordingly. So it’s not a good idea to request a large number of columns unless you know what you are asking for should have a one-to-one relationship with the initial set of IDs. In general, if you need to retrieve a column like GO or KEGG , that has a many-to-one relationship to the original IDs, it is most useful to extract that separately.

In short, you don’t need to use AnnotateMyIDs to use Limma. You just need a three column file with the gene annotations. Plus, you can skip providing that file entirely (for those that don’t want the extra Glimma labeled plots). A regular volcano plot is output with either.

So, since you do want those plots, you will need to create the gene annotation input. It has a specific format. If you don’t have the content, use NA.

The Help on the tool form has more:

Gene Annotations: Optional input for gene annotations, this can contain more information about the genes than just an ID number. The annotations will be available in the differential expression results table and the optional normalised counts table. They will also be used to generate interactive Glimma Volcano, MD plots and tables of differential expression. The input annotation file must contain a header row and have the gene IDs in the first column. The second column will be used to label the genes in the Volcano plot and interactive Glimma plots, additional columns will be available in the Glimma interactive table. The number of rows should match that of the counts files, add NA for any gene IDs with no annotation. The Galaxy tool annotateMyIDs can be used to obtain annotations for human, mouse, fly and zebrafish.

Example:

GeneID Symbol GeneName

11287 Pzp pregnancy zone protein

11298 Aanat arylalkylamine N-acetyltransferase

11302 Aatk apoptosis-associated tyrosine kinase

11303 Abca1 ATP-binding cassette, sub-family A (ABC1), member 1

11304 Abca4 ATP-binding cassette, sub-family A (ABC1), member 4

11305 Abca2 ATP-binding cassette, sub-family A (ABC1), member 2

A GTF won’t have the longer descriptions, and I can’t remember if yours has the attributes for gene_name (where the gene symbol is usually stored). But maybe you can find that data from the same place you sourced the GTF.

The idea with Limma is that is just interested in gene_id values, which it is interpreting from the count files.

Then, if you do include the extra annotation, Limma expects the same set of gene_ids across all inputs (first column of both the count files + the optional annotation file), then it will link everything together to label the extra plots.

Some help for using tools to parse files into a desired format → Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Introduction to Galaxy Analyses

Public tools don’t pre-index every genome but there is usually a way to use the tool anyway, it just takes some extra data preparation to replace the missing index.

Hope this helps!

Sonenshine · July 17, 2024, 11:16pm

Hi Jennifer:
Thanks a bunch. You have provided many options for me to try.
Another option would the last step in the tool reads to counts (under transcriptomics). As I recall, it had 3 columns with the last one the counts.
Your opinion?
Thanks again.
Daniel

jennaj · July 18, 2024, 12:40am

Hi @Sonenshine

I’m not sure I understand … could you explain a bit more about this? I know the step just not what your question is about it. Thanks!

Sonenshine · July 18, 2024, 7:58pm

Hi Jennifer:

I got HISAT2 to work OK. My next question concerns whether I can use the output from the tutorial, “reads to counts” to population the next tutorial “counts to genes with Limma voom”? The counts to genes suite of tools needs the input in the form of columns, including gene ID, the counts recorded for each match, …etc.

One of my colleagues here at NIH said I need to include two sample genomes with identical column format for limma to work?

Is this correct?

Thanks

Daniel

jennaj · July 19, 2024, 5:56pm

Hi @Sonenshine

Thanks for explaining!

You need to generate counts for both DESeq2 and Limma, and the same counts files can be used with both, so you only need to do that once.

The counts should be against the same reference genome and use the same reference annotation for all of your samples.

What to do from here: After mapping with HISAT2, use Featurecounts or HTseq-counts for the counting step.

Then use the differential expression tools: DESeq2 or Limma.

Both of those tools require at least two factor groups with at least two count files. That means a minimum of four count files.

The example in the tutorial for Limma, the counts are formatted into a matrix. That is just one way to use the tool. Use the Count Files or Matrix? toggle to input separate files instead. The result is the same either way.

The tutorial does some more advanced data organization but that isn’t required.

Hope this helps!

Sonenshine · July 20, 2024, 11:59am

Great. That gives me new tools to test and move forward. Will try them ASAP

Daniel

jennaj · August 23, 2024, 11:38pm

2 posts were split to a new topic: FastQC Troubleshooting

Topic		Replies	Views
adding Bacillus Licheniformis genome builds to customprodb + HISAT2 usegalaxy.eu support	1	558	April 30, 2019
How to add a new reference-genome on HISTAT2? I need S. agalactiae BM110 usegalaxy.eu support reference-genome	5	211	July 1, 2024
Expanding the built-in reference genomes usegalaxy.eu support custom-genome , reference-annotation , reference-genome , custom-build	2	346	July 12, 2023
RNAstar Select Reference Genome - Species Not Available - contact the Galaxy team (--genomeDir) usegalaxy.eu support custom-genome , mapping , reference-annotation , reference-genome , custom-build , reference-transcriptome	1	19	December 4, 2024
Custom genome help and troubleshooting plus where to find HISAT2 alignment statistics custom-genome , custom-build	5	1667	October 31, 2019

GeneID	Symbol	GeneName
11287	Pzp	pregnancy zone protein
11298	Aanat	arylalkylamine N-acetyltransferase
11302	Aatk	apoptosis-associated tyrosine kinase
11303	Abca1	ATP-binding cassette, sub-family A (ABC1), member 1
11304	Abca4	ATP-binding cassette, sub-family A (ABC1), member 4
11305	Abca2	ATP-binding cassette, sub-family A (ABC1), member 2

I am using HISAT2, but my reference genome Rickettsia rickettsii is missing in the dropdown list.

Related topics