I’m running various tools on a group of datasets (QC, adapter trimming, mapping etc.) and noticed that I’m running out of storage space rather quickly. I was wondering whether generating a collection list of paired-end runs uses the same amount of storage as all the individual files. Shouldn’t a collection list just be a shell for the original datasets?
Yes, the collection contains clones of the original datasets and do not consume extra quota space. New work resulting from running tools will create new data, and those consume additional quota space.
Note: some tools require uncompressed inputs. Galaxy will create new uncompressed versions of data when needed at runtime if the original data was compressed. In some cases it is better to start with uncompressed data, or to permanently delete (purge) the compressed version after the uncompressed version is created.
This FAQ explains how to find and manage all of your data:
Did the compressed fastq uncompress during your specific processing/tool choices?
If not, there is just one copy of the data, and you can either retain the starting data or purge it after it is not needed anymore to free up quota space. You can always download it as a backup. The FAQ I shared has many details about ways to do that.
If yes, then you could uncompress the starting data yourself within Galaxy then purge the compressed version to avoid the data duplication. Or, if not needed anymore, purge both.