Trying to use Wig/BedGraph-to-bigWig tool for a file generated with mapping to CHM13 T2T but the bedgraph file I am trying to convert does not allow me to designate the reference genome as CHM13 T2T (it does not come up on the list).
Wig/BedGraph-to-bigWig interprets the assigned database metadata of the input file. This is true for any input for this tool. If no database is assigned, the tool will not recognize any input datasets (a built-in quality check).
If you mapped in Galaxy, the output should have the database assigned as CHM13_T2T_v2.0.
If you generated the data outside of Galaxy, there are two choices:
If you sourced the genome from UCSC, then assigning that database should be OK → Changing database/build (dbkey)
If you sourced from somewhere else, you can do a comparison to the UCSC version.
- If not an exact match (chromosome names, sizes, sequence), you can create a custom database → Adding a custom database/build (dbkey)
- That creates a custom database that can be assigned to datasets, and other tool can interpret the metadata. That database is specific for your account and works just like native database assignments.
- Be sure to use that fasta as a custom genome fasta from the history with other tools. → How to use Custom Reference Genomes?
- Note about genome mismatch problems: tools may error or just produce weird results. The latter isn’t always obvious. This would happen with any reference data mismatch, any tool, in Galaxy or otherwise. For most use cases, the important part is to use the same reference data (genome, annotation) with all tools that are part of the same analysis.
I ran some tests to make sure this is all working correctly. This is the test history and it includes a few manipulations and external downloaded files (this genome has a few aliases, and can be a bit confusing). The last dataset is the result of this tool run on a BWA-MEM result. All use the index already available in Galaxy sourced from UCSC. https://usegalaxy.org/u/jen-galaxyproject/h/test-chm13-dbkey
Hope that helps!
Thanks. Yes, mapping works fine with BWA-MEM and the correct genome is shown. I run RMDUP on this and the genome also shows fine. But when the files go into MACS2, the bedgraph files that are generated show a “unspecified” database/build. When I try to put in the genome to fix this, the T2T build is not there as a choice. From this point on, none of the files generated from the bedgraph files will allow me to specify T2T as a genome build. Perhaps this doesn’t matter and the data is there regardless?
Thank you for sharing more details. I also see that my last queue job finished and failed in my test history.
An index file is missing! That probably explains all the problems.
We will need to fix this. Thank you so much for following up. I was digging for the problem with my tests. It is a bit complex.
- UCSC uses hs1 to represent the assembly in their browser.
- They use a couple of “alias files” for other functions (examples are downloaded into the test history) but it all works together at their site, and what is exposed is hs1.
- We get both the assembly and these specific index files from UCSC.
- Galaxy uses one of the other aliases to name the assembly, which creates some mismatches and missing data. This includes having no direct link out to UCSC from results due to the alias mixups.
This will take some time to sort out and repair.
You could use the custom genome/build option instead. This would mean starting over from mapping. I would avoid using the same dbkey/database as UCSC or Galaxy to avoid clashes with whatever the final fix is. So, give it a unique name for the database metadata attribute.
The source data is here:
- Top directory Index of /goldenPath/hs1/bigZips
- Fasta https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz
- Reference annotation is in the /genes directory one level down. The best choice is probably https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/genes/hs1.110.20220412.ncbiRefSeq.gtf.gz
This workaround gives you full control when using tools in Galaxy. It will not help with viewing data at UCSC, but you can use custom genomes with both IGV and IGV. Later, once Galaxy has this repaired, you could swap the database key on datasets to what we end up assigning to this same data (probably hs1). I’ll mark this topic for updates as we decide what to do.
Thank you again for following up!! I tested for 2 days trying to figure it out but your example solved the riddle This genome is newer…
Great - thanks for helping with this. Just so I am clear though, if the bedgraphs were derived from CHM13 data but then specified as hg38 then the repeat information is there or not there?
Update! We were able to create the missing index for the CHM13_T2T_v2.0 assembly. My rerun job of the failed test was successful. You should also rerun, starting from the first step that lost the database assignment (MACS2?).
We will be creating a brand new set of indexes for the hs1 assembly. Likely soon.
So, no need for the custom genome/build choice.
- These two assemblies have the sequence content in common
- But they have a different database/dbkey assignments and different chromosome names, and should remain distinct assemblies.
- We’ll update the longer database names to note the differences eg which to use if the UCSC genome browser will be used, or their annotations.
I’m not sure what you mean.
On the MACS2 tool form, the effective genome sizes are for human (the specific assembly is not noted).
That genome size is simplified as the general “effective size” all recent human assemblies, including hg38, CHM13_T2T_v2.0, and hs1. The Galaxy indexes are all based on unmasked sequence as single strand genomic. Why? Because that is the type data the tools work with.
Whoops! I forgot that the MACS2 effective genome size is a bit stale, for hg19.
I couldn’t find the effective genome size for CHM13_T2T_v2.0 or hs1, but this issue at the MACS Github report explains how to calculate that (and clarified what value is default on the tool form for human). Effective genome size for HG38? · Issue #270 · macs3-project/MACS · GitHub
That length might be the full assembly “ungapped” length, which happens to be the same as the full assembly length. Compare:
- hg38 https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/
- CHM13_T2T_v2.0/hs1 T2T-CHM13v2.0 - hs1 - Genome - Assembly - NCBI
The MACS2 tool form has another option under this same pull-down menu to enter a custom value. Maybe try using that? Apologies for misunderstanding what you were asking about! The assembly is newer to me too, and while I understand the data conceptually, I haven’t worked with it directly enough to learn all of the technical bits that are useful for practical analysis yet.
If you were interested in analysis that utilized the full set the CHM13_T2T_v2.0 assembly assets, I would suggest you try exploring the original assembly author publications and resources. The assembly scope for this genome is special, and the parts that work with existing tools were exacted and added to Galaxy, UCSC, IGB, IGV, etc. Keep in mind that most of that extended data will be referenced by the root accession not the database/dbkeys these applications are using to represent one slice of that data.
Thanks; I am re-running these. I have a question about a failed multibigwigsummary run. Is this also having to do with a CHM13 issue? The error shows this:
multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(*args)) File "/usr/local/lib/python3.8/site-packages/deeptools/getScorePerBigWigBin.py", line 18, in countReadsInRegions_wrapper return countFragmentsInRegions_worker(*args) File "/usr/local/lib/python3.8/site-packages/deeptools/getScorePerBigWigBin.py", line 107, in countFragmentsInRegions_worker score = bwh.stats(chrom, exon, exon) RuntimeError: Invalid interval bounds! """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/multiBigwigSummary", line 14, in <module> main(args) File "/usr/local/lib/python3.8/site-packages/deeptools/multiBigwigSummary.py", line 251, in main num_reads_per_bin = score_bw.getScorePerBin( File "/usr/local/lib/python3.8/site-packages/deeptools/getScorePerBigWigBin.py", line 257, in getScorePerBin imap_res = mapReduce.mapReduce((bigWigFiles, stepSize, binLength, save_file), File "/usr/local/lib/python3.8/site-packages/deeptools/mapReduce.py", line 142, in mapReduce res = pool.map_async(func, TASKS).get(9999999) File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value RuntimeError: Invalid interval bounds! /usr/local/lib/python3.8/multiprocessing/pool.py:138: ResourceWarning: unclosed file <_io.TextIOWrapper name='/corral4/main/jobs/049/800/49800431/_job_tmp/_deeptools_0vgs42ac.bed' mode='w+t' encoding='UTF-8'> task = job = result = func = args = kwds = None ResourceWarning: Enable tracemalloc to get the object allocation traceback /usr/local/lib/python3.8/multiprocessing/pool.py:138: ResourceWarning: unclosed file <_io.TextIOWrapper name='/corral4/main/jobs/049/800/49800431/_job_tmp/_deeptools_tba_5s4y.bed' mode='w+t' encoding='UTF-8'> task = job = result = func = args = kwds = None ResourceWarning: Enable tracemalloc to get the object allocation traceback
Both the source data files and the bed file for this tool run fine in other combinations so it seems strange that this combination fails to run.
Nevermind, I worked this out.
This part of the error indicates a mismatch between the chromosomes + coordinates between the two inputs. Double check that all are based on the exact same reference genome assembly/build.
You have mentioned both hg38 and CHM13_T2T_v2.0 – so I’m not sure which this was based on, but using data anchored to one assembly for all inputs will work best. Those two are close enough that you might not realize there is a mismatch problem at first. Content issues do not always trigger a tool failure, instead, tools can produce odd “putatively successful” results that have a scientific problem.