Too slow with CVMFS reference data

Does a job using CVMFS take a long time? I implemented CVMFS configuration on my local galaxy with reference to https://galaxyproject.org/admin/reference-data-repo/ Then I executed bwa job for test using hg38, but it is still running after about 6 hours. Smaller one such as dm3 finished but it took about 50 minutes. It seems to be slow for the size of the data. In both cases, input data is the same, paired-end fasq filest about 400MB each and the second run with dm3 finished less than a minute.

I think it’s not a problem of our network because downloading data from other sites, such as UCSC chromosome data, comes out to 2-3MB/s. I would like to know if the galaxy with CVMFS performance is normal, and if there is a setting that would improve performance.

Thank you,
Yukie

1 Like

What have you set as your cache size for cvmfs? This can affect production loads.

1 Like

Which Stratum 1 was selected and the connection speed to that stratum 1 would also make a major difference. If you’re going to make heavy use of the CVMFS repo it’s recommended to at least run a local squid cache if not a full stratum 1 (which can be private to you).

1 Like

Hi @hexylena ,

I set CVMFS_QUOTA_LIMIT=“100000”. I would like to avoid re-downloading once downloaded as much as possible. Is it too large?

Thanks,
Yukie

Hi @nate ,

Thank you for telling me about the squid. I don’t use it that much now, but I’ll try when I need it.

Thanks,
Yukie

Follow-up comment. I tried to change Stratum1 server and it led to improved performance. Bwa job with same input files and dm3 finished less than 10 minutes. Thanks for the advice, both of you.

Best,
Yukie