Reference genome not in database offered by galaxy - Solution: Use a custom genome

Hi there,

I need to upload the FASTA file of my reference genome to galaxy as a dataset and change the settings in history to allow me to select that reference genome when I import gene data of the organism that I performed RNAseq analysis on. I really need help with this please, the species I need is not in the standard database that galaxy allows you to choose from when importing data. Any help will be greatly appreciated.

Kind regards

J

1 Like

Hi @jasmus

For help with custom genome formatting as fasta, please see: Datatypes - Galaxy Community Hub

More help:

Hi there @jennaj

Thanks, I did try and upload the fasta file of the genome I downloaded from NCBI to galaxy as a database, but the FTP upload is not working. I used FileZilla and I cannot connect to galaxy server. I made a separate post regarding this issue, but it still hasn’t been resolved it seems. Any help would be greatly appreciated.

Kind regards

J

1 Like

Hi @jasmus

There are a few more details in this topic about using FTP. Maybe it will help? I also added another tag to your post that will find most prior Q&A about FTP.

Hi @jennaj

Sorry for the trouble, but I am still not able to upload files via FTP link. My exact problem is I am wanting to do a PLS-DA on gene count data of an organism that is not in the database that galaxy offers. This is a problem as you need to specify the organism and so on when you import data sets, I understand this is so that the program can recognize genes and annotate it correctly and so on. I really just want to resolve this as soon as possible please. If you could indicate exactly how I can solve this issue in a clear step wise manner I would truly appreciate it a lot.

P.S. The FTP link using FileZilla just gives me a timed out error or it gives me an error saying my login is not correct, but as the other poster commented there is nothing wrong with my login details and it can’t be a waiting period issue either as I have not tried it again since I originally tried to upload with FTP in quite some time now.

Kind regards

J

1 Like

Hi @jasmus

You don’t need to specify the “datatype” or “database” when importing data. Most of the time is actually better to leave these at default (for both) when using the Upload functions.

Why?

  • If Galaxy guesses the wrong datatype, that could indicate a format problem.
  • The exact genome build/version usually does not apply to starting data (reads – these are associated with an organism, but not a genome build/version aka database). Once mapped or other downstream tools are used, then the result would be associated with a database (coordinates differ by the original database assembly).
  • Most tools do not require that database is assigned to use a custom genome. If one does, or you just want to label your data with specificity, then load the custom genome (in fasta format) and promote it to a custom build in Galaxy. This will create a new “database” specific to your account, and you can assign it to any dataset. I added tags to this post that point to other Q&A plus FAQs that cover the “how-to”.

But first, you need to get the data into Galaxy! Few questions to troubleshoot FTP. Paste back what values you are using, we might be able to spot the problem.

  1. What is the URL of the Galaxy server you are working on?
  2. What are you entering in Filezilla for the “Host” field?
  3. Are you using your account’s email address and password (same as used on that exact same server) for the “username” and “password” fields? Don’t post the values back here – just confirm that is what you are using. Some people try to use their “public name” in Galaxy for the “username” in Filezilla – that won’t work – you need to use your full email address (this is a case-sensitive value – Me@school.edu is NOT the same as me@school.edu).

The size of the dataset(s) probably do not matter until you can connect at all. For reference, the maximum size is 50 GB for most data and ~35 GB for others (BAM in particular, possibly others).

If Filezilla itself is connected to the server, then times out during the data transfer, you can “resume” the transfer – log in with Filezilla to the target server, accept any security certificates, and find the resume function to start again. Note that this won’t work if you quit out of Filezilla – the app needs to stay open. If the data is large and your connection is slow, you might need to do this a few times. In those cases, sometimes this works better if you transfer only one file at a time – but much depends on your internet upload speed.

There is no waiting period or throttling for data transfer rates at the usegalaxy.* servers that I know of. Other public Galaxy sites might have that in place though. Or, your internet provider might throttle.

Let’s start from there to troubleshoot this more :rocket:

Hi there @jennaj

  1. The URL I am using is https://workflow4metabolomics.usegalaxy.fr/ (I basically just want to use the Workflow4Metabolomics steps to analyse my data by PLS-DA analysis).

  2. I enter this into the FileZilla host field ftp://usegalaxy.org

  3. I am using the exact details that I used to set up my account yes. I am using my educational institute email and password. It just tries to connect me and then has an “error timed out” output.

Here is the error event log:

Status: Selected port usually in use by a different protocol.
Status: Resolving address of usegalaxy.org
Status: Connecting to 129.114.60.56:22…
Error: Connection timed out after 20 seconds of inactivity
Error: Could not connect to server
Status: Waiting to retry…
Status: Resolving address of usegalaxy.org
Status: Connecting to 129.114.60.56:22…
Error: Connection timed out after 20 seconds of inactivity
Error: Could not connect to server

Then regarding the upload of files so there is no issue when I upload a table of gene names and their associated counts (basically a feature counts table in a CSV format) this uploads fine and it runs fine in the PCA tool. This is because the organism (S. cerevisiae) is in your database. But when I upload a similar table in csv format for a yeast species that is not in your database then it does not recognise the gene names and it makes an additional row above gene names called “V1, V2, V3, V4…V5689” and so on. So, if I can just solve that from happening then I can do the analysis for that yeast species too. I hope this make sense and we can figure out a solution to this please. Thank you for your help and patience thus far, I appreciate it.

Kind regards

J

Hi @jennaj

I have imported the custom genome and allowed it in my list of genomes now as an option to select. I am busy trying to run PLS-DA analysis as mentioned previously as part of the Workflow4Metabolomics package that uses Galaxy servers and the ropls package. So, I am trying to do this for datasets that I have constructed exactly as the example shows. However, my datasets are in csv file format. I am getting the following error though when I try and run this analysis on Galaxy. If you could please help I would greatly appreciate it. The error reads as follows:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 19 did not have 5 elements
Calls: readAndCheckF → read.table → scan
Execution halted

Any help would really be greatly appreciated.

Kind regards

J

Hi @jasmus

This didn’t work for two reasons:

  1. The FTP URL is always specific to the server you are working on, and your account there. I don’t see the current FTP URL on their site, but you could ask them. Usually, it is the last part of the server URL (for this case that would be “usegalaxy.fr”) – but NOT always. They have a forum specific for their community where the information may be posted, see: Workflow4Metabolomics - Galaxy Community Hub

  2. If you are ever working at UseGalaxy.org, the FTP URL is “usegalaxy.org” – without the “ftp://” at the start.

  3. This help applies to most public Galaxy servers that support FTP. The FTP URL will be the last part of the server URL, or the server, or help on their server or directory page will specify what to use or how to ask about details. All known public Galaxy servers are in the directory here (some do not use this forum, or have a supplemental forum or contact): Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub

What happens if you convert csv to tabular format and input that instead? Click on the pencil icon for the dataset to reach the edit attributes forms, click into the second tab named “Convert”, and find the option in the drop-down menu. Only a few tools interpret csv (comma-separated values) and most expect tabular (tab-separated values).

The tool form will usually note what is expected/supported in the help section with examples. Another quick way to find out for any tool: create a new empty history, then load up a tool form. The options on the form where input datasets are selected will list out the expected input datatype(s).

That server is set up a bit differently than other public servers, with special tools/workflows etc. Their forum is probably the best place to ask questions if you run into problems. There could be a known problem or you might be the first to report something they could fix.

Hope that helps!