Setting a custom database input for tools -- in a workflow or on a tool form

wormball · June 15, 2021, 4:43pm

Hello!

My workflow requires to upload and select “hg38.analysisSet.fa” every time i run it. But it is waste of time cos this file is already on the server and hardcoded into plenty of our tool xmls anyway, and only MergeBamAlignment wants it to be supplied explicitly. Can i set hg38.analysisSet.fa as the genome input file for MergeBamAlignment in the workflow editor?

Thanks in advance.

jennaj · June 15, 2021, 5:26pm

Hi @wormball

If you are working at a public Galaxy server, the custom genome can be promoted to a custom build. This will generate an index for the fasta specific to your account – along with a “database” (aka dbkey). It will be included in the list of built-in databases. That database can be set on tool forms (in a workflow or otherwise).

FAQs Galaxy Support
- Start here Preparing and using a Custom Reference Genome or Build
Q&A Search results for 'custom-build' - Galaxy Community Help
- Start here Bowtie 2 - in Workflow no custom reference genome available -- Solution: Create a Custom Build - #2 by jennaj
Tutorials Galaxy Training!
- Start here Adding a custom database/build (dbkey)

If you are working at your own Galaxy server, add/index the custom genome using Data Managers. This creates a new database, available to all users on the same server, and can be set on tool forms (in a workflow or otherwise).

Note: If the fasta has already been loaded and indexed by Data Managers, and is not showing up in the list of databases for a particular tool, then you may have missed one of the key indexes – there are four primary Data Managers that should be run for all to fully integrate them, then layer in tool specific indexes (for mapping tools, etc). See the Q&A “Start here” topic for a clear listing of the recommended minimum DMs to run. If you are not using Data Managers, all can be done manually – the other admin help links describe what is expected + troubleshooting.

FAQs Galaxy Administration
- Start here Admin/DataIntegration
Q&A Search results for 'data-manager' - Galaxy Community Help
- Start here Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting - #2 by jennaj
Tutorials Galaxy Training!
- Start here Reference Genomes in Galaxy

Hope that helps!

wormball · June 17, 2021, 11:19am

Thanks Jennifer! By the way, is there a way to scroll pages like this Reference Genomes in Galaxy in other way rather than with keyboard? It may be perfect if i read a lecture on this topic, but i do not think i ever will.

wormball · June 17, 2021, 4:04pm

I tried to follow this Galaxy Community Hub - Galaxy Community Hub but unsuccessfully. I edited tool-data/all_fasta.loc like this:

hg38.analysisSet.fa	hg38	Human (Homo sapiens): hg38 analysis set	$GALAXYROOT/tools/melanoma_tools/genome/hg38.analysisSet.fa

Then i looked at tool-data/shared/ucsc/builds.txt and saw there already existing line for hg38:

hg38	Human Dec. 2013 (GRCh38/hg38) (hg38)

The article gives pretty obscure instruction on what to do with builds.txt:

To modify the builds.txt file, add a line to $GALAXYROOT/tool-data/shared/ucsc/builds.txt for your genome. The format of this line can vary, but should contain enough information to uniquely identify the genome, the source, any external build nomenclature, and the dbkey selected for use within your Galaxy instance. See the public Main server for examples: https://usegalaxy.org

So i simply restarted the galaxy and got no effect.

What am i doing wrong?

jennaj · June 18, 2021, 5:41pm

There is some work in progress to improve accessibility scope. For now, the slides are in the format the information is available unless you want to review the code that generated them. Video/slide accessibility improvements · Issue #2494 · galaxyproject/training-material · GitHub

For the other issue (loc/build files), doing this manually is only recommended if you know what you are doing, and even then is not so easy. Try removing anything you have done manually and using the data managers instead.

If your fasta hg38.analysisSet.fa is exactly the same as the hg38 available from UCSC, then you don’t need to use the local fasta from the history for the first indexing step (fetch data) – hg38 can be directly retrieved from UCSC. Or, you can get data from the CVMFS option on that DM (the genome and all indexes that UseGalaxy.org hosts).
If your fasta hg38.analysisSet.fa is NOT exactly the same as the hg38 available from UCSC, then you can still use the “fetch data” DM, but be sure to assign a unique database name (dbkey) or expect other problems. Your unique dbkey will not be in the UCSC builds list because it is not a match, and that doesn’t impact anything else – the genome (or whatever content the fasta represents) will be loaded and indexed by DM and available in your instance. You won’t be able to view data via a link at UCSC – but that is a good thing since the two are not a match and the data would have scientific content problems – or more likely just fail to render at all. But you can use a local IGV – create a custom genome in that for data visualization. Make sure the dbkey you set in Galaxy is the same as the dbkey you set in IGV, that dbkey is assigned to datasets, and make sure the IGV application is open before clicking on the “local IGV” link.

I added some tags to this topic that link to more help.

wormball · June 22, 2021, 10:46am

It feels like i have to earn a doctoral degree in galaxy data management simply to feed the galaxy one bloody file. :’(

I installed “picard index builder” data manager and found it in the admin>local data (however i could not find “fetch data”). In the “Source FASTA Sequence” field it shows only one file, namely the hg38.analysisSet.fa i mentioned in the all_fasta.loc (but not the one i have in history). I clicked “execute”, and it executed my dreams of easy data management. The command line was:

python '/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_picard_index_builder/e1a567161eab/data_manager_picard_index_builder/data_manager/picard_index_builder.py' '/home/transgen/galaxy/database/objects/3/b/c/dataset_3bce0b06-9cc2-4173-9afa-1b2e9a06aec6.dat' --fasta_filename '$GALAXYROOT/tools/melanoma_tools/genome/hg38.analysisSet.fa' --fasta_dbkey 'hg38' --fasta_description 'Human (Homo sapiens): hg38 analysis set' --data_table_name picard_indexes

The error was:

[Tue Jun 22 12:54:57 MSK 2021] picard.sam.CreateSequenceDictionary REFERENCE=/home/transgen/galaxy/database/jobs_directory/000/669/working/dataset_3bce0b06-9cc2-4173-9afa-1b2e9a06aec6_files/hg38.analysisSet.fa OUTPUT=hg38.analysisSet.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Jun 22 12:54:57 MSK 2021] Executing as transgen@transgen-4 on Linux 5.4.0-70-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Picard version: 2.7.1-SNAPSHOT
[Tue Jun 22 12:54:57 MSK 2021] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=514850816
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: /home/transgen/galaxy/database/jobs_directory/000/669/working/dataset_3bce0b06-9cc2-4173-9afa-1b2e9a06aec6_files/hg38.analysisSet.fa
	at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:523)
	at htsjdk.samtools.reference.FastaSequenceFile.<init>(FastaSequenceFile.java:59)
	at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:129)
	at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:82)
	at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:70)
	at picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:152)
	at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:137)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.nio.file.NoSuchFileException: /home/transgen/galaxy/database/jobs_directory/000/669/working/dataset_3bce0b06-9cc2-4173-9afa-1b2e9a06aec6_files/hg38.analysisSet.fa
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.newByteChannel(Files.java:407)
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
	at java.nio.file.Files.newInputStream(Files.java:152)
	at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:519)
	... 9 more
Error building index.

jennaj · June 22, 2021, 4:41pm

Hi,

This DM has to run first.

https://toolshed.g2.bx.psu.edu/view/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421

Yes, the order of the DMs matter, and attempting to manually manipulate the loc files is not easy or recommended. Please try removing anything you have done manually, then run the Data Managers.

Run the first 4 core DMs (used by many tools in non-obvious ways), in this exact order, then layer in other DM as needed for specific tools.

If you do that, all should work out without troubles. You don’t need to use the fasta hg38.analysisSet.fa from the history if it is exactly like UCSC’s hg38 (instead use the option to pull the data from UCSC). And if it isn’t exactly like UCSC’s hg38, then you definitely should not assign that dbkey to the data, or expect problems.

wormball · June 23, 2021, 12:37pm

And if it isn’t exactly like UCSC’s hg38, then you definitely should not assign that dbkey to the data, or expect problems.

I think my hg38.analysisSet.fa is from Index of /goldenPath/hg38/bigZips/analysisSet and is slightly different from hg38.fa. What is dbkey and what should i type in this field? And how can i use files from my history as inputs?

I commented my changes to all_fasta.loc and restarted the galaxy. However my hg38.analysisSet.fa is still present in the genome selection list (and hg38.analysisSet.fa from the history is still absent). But when i choose this hg38.analysisSet.fa in “Create DBKey and Reference Genome”, it also gives an error:

python '/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py' '/home/transgen/galaxy/database/objects/c/c/5/dataset_cc5ffff9-35ab-4878-9080-05e9d2b53732.dat' --dbkey_description 'hg38.analysisSet.fa'

Traceback (most recent call last):
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 497, in <module>
    main()
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 478, in main
    tmp_dir=tmp_dir)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 300, in download_from_ucsc
    url = _get_ucsc_download_address(params, dbkey)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 260, in _get_ucsc_download_address
    path_contents = _get_files_in_ftp_path(ftp, ucsc_path)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 65, in _get_files_in_ftp_path
    ftp.retrlines('MLSD %s' % (path), path_contents.append)
  File "/home/transgen/galaxy/database/dependencies/_conda/envs/__python@3.7/lib/python3.7/ftplib.py", line 468, in retrlines
    with self.transfercmd(cmd) as conn, \
  File "/home/transgen/galaxy/database/dependencies/_conda/envs/__python@3.7/lib/python3.7/ftplib.py", line 399, in transfercmd
    return self.ntransfercmd(cmd, rest)[0]
  File "/home/transgen/galaxy/database/dependencies/_conda/envs/__python@3.7/lib/python3.7/ftplib.py", line 365, in ntransfercmd
    resp = self.sendcmd(cmd)
  File "/home/transgen/galaxy/database/dependencies/_conda/envs/__python@3.7/lib/python3.7/ftplib.py", line 273, in sendcmd
    return self.getresp()
  File "/home/transgen/galaxy/database/dependencies/_conda/envs/__python@3.7/lib/python3.7/ftplib.py", line 246, in getresp
    raise error_perm(resp)
ftplib.error_perm: 550 /goldenPath/hg38.analysisSet.fa/bigZips/: No such file or directory

Run the first 4 core DMs (used by many tools in non-obvious ways), in this exact order
Fasta fetcher 50 – has an option to pick UCSC as the data source.
SAM indexer
Picard indexer
2bit (twoBit) indexer

I could not find “SAM indexer” or “2bit (twoBit) indexer”, but only “data_manager_sam_fasta_index_builder” and “data_manager_twobit_builder” (or “sam_fasta_index_builder” and “twobit_builder_data_manager” as they are called in the version list). I tried to run “Create DBKey and Reference Genome” and then “Picard index” (i had not discovered other data managers yet at this time), and they worked well on sacCer3 and hg38. Then i tried MergeBamAlignment with newly acquired hg38, and it gave me “Do not use this function to merge dictionaries with different sequences in them. Sequences must be in the same order as well” error, but i think it is cos the input files were generated using hg38.analysisSet.fa instead of hg38.fa.

Also i found “Generate GATK-sorted Picard indexes” but i am not sure if i need this (our workflow uses gatk extensively, but it is gatk4 which has no native galaxy support as i know).

wormball · June 23, 2021, 2:55pm

I finally found where to set input files from history/disk. It is under “Choose the source for the reference genome” > “History” or “Directory from server”.

I then ran “Create DBKey and Reference Genome” for hg38.analysisSet.fa from “Directory from server”. However it also ran into error:

python '/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py' '/home/transgen/galaxy/database/objects/5/4/e/dataset_54ec8413-f237-4552-95e9-7eb14010ac69.dat' --dbkey_description 'hg38.analysisSet.fa'

Traceback (most recent call last):
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 497, in <module>
    main()
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 478, in main
    tmp_dir=tmp_dir)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 348, in copy_from_directory
    fasta_readers = get_stream_reader(open(input_filename), tmp_dir)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 231, in get_stream_reader
    if tarfile.open(fileobj=StringIO(start_of_file)):
TypeError: a bytes-like object is required, not 'str'Traceback (most recent call last):
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 497, in <module>
    main()
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 478, in main
    tmp_dir=tmp_dir)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 348, in copy_from_directory
    fasta_readers = get_stream_reader(open(input_filename), tmp_dir)
  File "/home/transgen/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/4d3eff1bc421/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.py", line 231, in get_stream_reader
    if tarfile.open(fileobj=StringIO(start_of_file)):
TypeError: a bytes-like object is required, not 'str'

Also all the data managers tell me “Requested version unavailable.” in orange box. I do not know what they mean but it has no other tangible effect.

Then i ran it again with “Create symlink to original data instead of copying”, and it surprisingly succeeded! I ran three other data managers and then ran MergeBamAlignment, which also worked fine.

So i think the right steps are as follows:

Install data_manager_fetch_genome_dbkeys_all_fasta, data_manager_picard_index_builder, data_manager_sam_fasta_index_builder, data_manager_twobit_builder
Go to admin > server > local data
Run Create DBKey and Reference Genome with “Choose the source for the reference genome” set to “Directory from server” (with path to corresponding file) and “Create symlink to original data instead of copying” set to “yes” (i still do not know what “dbkey” and other fields should be but i set them equal to the file name). Maybe “Choose the source for the reference genome” > “History” works too but i had not tried this yet.
Run SAM FASTA index builder, Picard index builder, TwoBit builder with your genome (however i am not sure if the two latter are needed in my particular case)

Topic		Replies	Views
Bowtie 2 - in Workflow no custom reference genome available -- Solution: Create a Custom Build usegalaxy.eu support custom-genome , reference-annotation , reference-genome , custom-build	1	997	August 27, 2019
Custom set of reference indexes for transcriptomics processing tool-dev	2	464	November 2, 2019
Custom genome help and troubleshooting plus where to find HISAT2 alignment statistics custom-genome , custom-build	5	1528	October 31, 2019
BWA-MEM built-in genome(s) usegalaxy.eu support bwa-mem	6	2416	September 8, 2020
Reference Genome in some tools - Fully indexing genomes with Data Managers galaxy-local , data-manager , reference-genome , variant-analysis	3	1191	January 27, 2020

Setting a custom database input for tools -- in a workflow or on a tool form

Related Topics