Recursive download of multiple files with "wget"or "curl"

Hi all,

I’m trying to download a large collection of files, and the download seems to fail at arbitrary points. I saw this issue coming up in multiple posts and the usual solution is to download the files individually. However, doing this manually to dozens of files is quite problematic - not to mention that eventually I also need to track all the downloads to ensure their completion. Hence I was wondering if there is a tool or some command that can fetch for me download links of all files in a collection so I can feed them into “wget” one by one via script instead of endless manual clicking?

Thank you in advance for your help!

Hello @Evgeni_Bolotin

Yes, you can use curl or wget. This requires that you capture all the links, so can also be a bit tedious, but this is your choice. How to → FAQ: Downloading datasets using command line

Instead, a very fast way to download all of the data at once is to do this:

  1. Copy just the files you want to download into a new simple history. This can be a collection! The Copy Datasets function will copy over all of the dataset elements in a collection when just the top level collection folder is copied. Give your new history a meaningful name!

  2. Use the Export History to File function for your new history. This compressed archive file can be downloaded or moved to other Galaxy servers or other cloud storage locations (for example, to back up data or to temporarily free up disc space in your account).

  3. The downloaded history archive is simply a compressed .tar.gz directory. Uncompress and all of your datasets will be inside of it!

Please let us know if this actually helps or not. Downloading individual files, even as part of a batch, will always be less efficient than using a dedicated compression format designed for data streaming. Think of the “individual file” methods as connivence features: best suited for small 1-off purposes. :slight_smile:



XREF 1 → Export data as compressed file - #4 by jennaj

XREF 2 → For the reverse, the command line method for moving data around in batch GitHub - galaxyproject/galaxy-upload: Galaxy Upload Utility for the CLI

XREf 3 → Finally, the Galaxy API can be used for all Galaxy functions inside of simple or sophisticated scripts that could include custom error trapping, rerun logic, file size/integrity checks, etc. Galaxy API Documentation — Galaxy Project 25.0.2.dev0 documentation

Hi Jenna,

Thank you for your help. I agree that batch download will be more efficient. Unfortunately it still runs into problems I have outlined above. Downloads of large-sized collections seem to stop at some arbitrary point, whether I use the CLI or browser download. Export to history of such collections can take a particularly long time and I’m not sure that it won’t fail mid-download as well.

The only reliable solution I saw on forums dedicated to Galaxy is to download files in such collections one-by-one, since for smaller downloads everything seems to work well. Eventually, I’ve cobbled-up a solution: retrieve JSON data of a desired collection via ‘parsec’, parse name and internal Galaxy ID of each file in the collection from JSON file and feed this information in a loop to ‘curl’ to download each file of a collection.

I would be glad to hear if there is a more efficient way to do the task above, since I’m not particularly adept working with APIs in general and Galaxy in particular. Still, I’m a bit surprised that user-friendly platform such as galaxy doesn’t have a tool for something like this, especially given that the problem of large downloads is far from being something new.

Best Regards,

Dr. Evgeni Bolotin

Yes, well said @Evgeni_Bolotin and I’m glad you found a solution for this time! :scientist:

The problem of transferring really large amounts of data over sometimes slower/personal internet connections is pretty common, and like you stated, not new. Even the large cloud providers like AWS avoid hosting “personal on demand data” like this (without charging a premium everyone wants to avoid!) and use specialized distributed (and sometimes complicated) protocols for hosting the large, static public datasets.

What we have decided to do is focus on avoiding the need to transfer data around at all! Inputs and outputs in Galaxy could have always technically existed anywhere and now that is actual as of the last year or so. The “server side” data storage we offer is mostly for connivence. Large throughput projects will instead benefit from a data management plan at the start!

I’m not sure if you have seen the “BYOS” (Bring Your Own Storage) options the public servers yet, but this is the approach we are focusing on the most.

These options under User → Preferences fit together. (See note below).


Manage Your Repositories. Add, remove, or update your personally configured location to find files from and write files to.


Manage Your Preferred Galaxy Storage. Select a Preferred Galaxy storage for the outputs of new jobs.

*Note: Most of these options are available at the UseGalaxy servers and some, like UseGalaxy.eu, offer even more choices, sometimes regional, under User → Preferences → Manage Information. Here you’ll find not just more personalized data storage choices but more personalized computational infrastructure choices (“BYOC” Bring Your Own Compute). All this will continue to expand in scope across all UseGalaxy servers!

The idea is to set up your storage profiles, then decide how to sort your own data out to those locations. This can be at the account level, history level, workflow level (entire workflow, or some tools, or certain classes of data..), and collection/dataset level.

Then, Galaxy becomes the tool kit, the methods log, and the data catalogue. Location agnostic “storage” (and, optionally, compute) means it doesn’t matter where the data originally comes from or where the resulting outputs go. Priority projects can be launched from the same server where everyone else you may be working with is working, even if they use different resources. There can be no upload or download step at all.

Where data lives and where it is processed are the least interesting part of analysis yet are the big technical bottlenecks. We hope to streamline that so the actually interesting parts, how data is processed and what resulted, work really well for everyone. Upload will become more of an indexing step and download a “what happened” reporting/sharing step.

We’ll still offer local storage but this will become more of a supplementary resource rather than the primary resource over time. Even 1 Tb of space is often not enough but no one wants to move around a Tb file!

Hope this helps and please let us know your thoughts! :slight_smile:

Thank you very much! Since what I’m doing now is a bit of a test project, I did not look very closely at possible storage options. But it is definitely worth investigating, thank you!

1 Like