Metilene tool fails on ENCODE bigWigToBedGraph converted data

Dear specialists,
Would you please provide guidance on why Metilene fails despite that the data is BamSort’ed.
The last column is “Percentage of reads that show methylation” from ENCODE bigWig data converterd to BedGraph by “bigWigToBedGraph”.

The link to the history is below;

I just generated a bug report from the Metilene tool.

Thanks for your kind help.

Try sorting all of the inputs against the same reference genome when running bedtools SortBED (instead of letting the tool “guess” the order).

To do this, you’ll need to get the hg38.fa assembly into the working history, then select that fasta from the history as the Genome file on the tool form.

How to: Importing data from a data library

  1. Go to: Shared Data → Data Libraries masthead menu.
  2. Navigate the resources: Libraries → Genomes + annotations → Genomes → hg38.fa
  3. Data imported from a library does not count towards quota usage.

@jennaj
Hi, I am trying to run metilene on the methyl deckle produced bed graphs, sorted using bedtools.

The job fails with the report:

[WARNING] Tue Jun 20, 12:40:47, 2023 Input files need to be SORTED, i.e., use “bedtools sort -i file >file.sorted”
[INFO] Tue Jun 20, 12:40:47, 2023 Checking flags
[INFO] Tue Jun 20, 12:40:47, 2023 Write metilene input to metilene_g1_g2.input


Please could you suggest. According to the training manual we can input the files straight after methyldackel outputs with column 1 - chr(number), column 2 - start, column 3-end, column 4 - fraction of methylation.

Thanks,
Akanksha

Hi @akanksha_bafna

The chromosome order of the reference genome and your file must still be different.

Did you try sorting using the instructions above? Using a reference genome from the history?

Is this the tutorial you are running through? DNA Methylation data analysis

The tutorial files were curated to not need an extra sort step. When using your own data, that can situation can differ.

If the files do not work directly from the upstream tools, and sorting without using a reference genome is not enough, then you’ll probably need to do that full sort again using one. That reference genome should be the same exact reference genome used for mapping.

This can get more complicated depending on where the data was sourced, if a custom genome is involved, etc. Technically, the tool is comparing the BED to the FASTA to do manipulations – and processes those in order. If it finds lines that are not in the same order, it will fail. Maybe check that the chromosome identifiers are actually in the same exact format between the two and in the same order?

This all assumes that the sort order is the actual problem and the content/format of your files is otherwise Ok. These tools can help when performing extra checks when manipulating data to make complicated tools happy: Data Manipulation Olympics

If you need more help, please share more details. Copy/paste the full job information (inputs, parameters) and stderr/stdout logs and consider sharing a history link plus the exact tutorial/step. This is the information we are looking for: Troubleshooting errors and What information should I include when reporting a problem?

Let’s start there :slight_smile:

Dear @jennaj

  1. Yes , I downloaded the mm10.fa following the instructions and sorted all my input files with that.
    I , then executed the metilene using the sorted files but got error report:
    Execution resulted in the following messages:

Fatal error: Exit code 139 ()
Tool generated the following standard error:

[WARNING] Tue Jun 20, 15:22:47, 2023 Input files need to be SORTED, i.e., use “bedtools sort -i file >file.sorted”
[INFO] Tue Jun 20, 15:22:47, 2023 Checking flags
[INFO] Tue Jun 20, 15:22:47, 2023 Write metilene input to metilene_g1_g2.input


[BASIC CALL:] metilene_executable -t 4 -a g1 -b g2 metilene_g1_g2.input >out.file
Please adjust executable of metilene, number of threads (-t) and names of groups (-a, -b). For further parameters: metilene_executable --help
the following ids belong to group A (n=6):
0: column: 0, name:g1
1: column: 1, name:g1
2: column: 2, name:g1
3: column: 3, name:g1
4: column: 4, name:g1
5: column: 5, name:g1
the following ids belong to group B (n=6):
0: column: 6, name:g2
1: column: 7, name:g2
2: column: 8, name:g2
3: column: 9, name:g2
4: column: 10, name:g2
5: column: 11, name:g2
Mode 2 – pre-defined regions
region testing chr1-[84934573,84935054]
region testing chr1-[63176548,63177427]
region testing chr1-[125435175,125435976]
region testing chr1-[3531625,3531843]
/data/jwd02f/main/060/889/60889739/tool_script.sh: line 61: 715480 Segmentation fault (core dumped) metilene -M 300 -m 10 -d 0.1 -t ${GALAXY_SLOTS:-4} -f 2 -B “/data/dnb09/galaxy_db/files/f/3/c/dataset_f3cd4d7f-b3a7-4e2c-9196-e71427b40ff1.dat” -v 0.7 metilene_g1_g2.input > ‘/data/jwd02f/main/060/889/60889739/outputs/galaxy_dataset_3e803c99-10a3-4599-878a-da21a2095333.dat’

  1. I then downloaded the files from the tutorial DNA Methylation data analysis and tested the step 6 with the given instructions. The tool ran OK but did give the detailed report message as accessed via Galaxy 298-294 from my history.
  1. This gives me the confidence that I am executing the tool correctly with the files used at step 6 in the tutorial. On checking the bed graph files I noted they are in Ensembl format so I used replace column tool and generated the replaced files for input and CpGislands.bed from mm10.

I got the below report:

Dataset Information

Number 338
Name metilene qval<0.05 out on data 333, data 331, and others
Created Tuesday Jun 20th 15:40:38 2023 UTC
Filesize -
Dbkey mm10
Format bedgraph
File contents contents
History Content API ID 4838ba20a6d867656c076424a99355b2
History API ID d12281b8962b0af8
UUID 012c2be8-6498-490c-a1ce-8645a8a7dd3c
Full Path /data/dnb09/galaxy_db/files/0/1/2/dataset_012c2be8-6498-490c-a1ce-8645a8a7dd3c.dat
Originally Created From a File Named metilene_qval.0.05.out

Tool Parameters

Input Parameter Value
Input group 1 321 : Replace column on data 319 and data 99

322 : Replace column on data 319 and data 100

323 : Replace column on data 319 and data 101

324 : Replace column on data 319 and data 102

325 : Replace column on data 319 and data 103

326 : Replace column on data 319 and data 104|
|Input group 2|320 : Replace column on data 319 and data 98

327 : Replace column on data 319 and data 105

328 : Replace column on data 319 and data 106

329 : Replace column on data 319 and data 107

330 : Replace column on data 319 and data 108

331 : Replace column on data 319 and data 109|
|BED file containing regions of interest|333 : Replace column on data 319 and data 235|
|Options||
|The allowed nt distance between two CpGs within a DMR|300|
|The minimum # of CpGs in a DMR|10|
|Specify how many replicates must contain data for a certain CpG position in group 1|Not available.|
|Specify how many replicates must contain data for a certain CpG position in group 2|Not available.|
|The minimum mean methylation difference for calling DMRs|0.1|
|Stringency of the valley filter (0.0 - 1.0)|0.7|

Job Outputs

Tool Outputs Dataset
metilene on 334 : metilene on data 333, data 331, and others
metilene bedgraph on 335 : metilene bedgraph on data 333, data 331, and others
metilene qval<0.05 bedgraph on 336 : metilene qval<0.05 bedgraph on data 333, data 331, and others
metilene qval<0.05 plots on 337 : metilene qval<0.05 plots on data 333, data 331, and others
metilene qval<0.05 out on 338 : metilene qval<0.05 out on data 333, data 331, and others

Job Information

Galaxy Tool ID: toolshed.g2.bx.psu.edu/repos/rnateam/metilene/metilene/0.2.6.1
Command Line metilene_input.pl --in1 ‘/data/dnb09/galaxy_db/files/8/e/2/dataset_8e2ad3c4-1080-4d13-8f40-39786a80c94f.dat,/data/dnb09/galaxy_db/files/f/e/b/dataset_feb8ea7d-3f17-478b-8dbe-5f01b983c0f1.dat,/data/dnb09/galaxy_db/files/2/d/f/dataset_2df32a8a-18a6-4d5a-a19d-357cfc0827fb.dat,/data/dnb09/galaxy_db/files/e/a/0/dataset_ea0fe099-2409-4ec6-9cca-2f05ba36c188.dat,/data/dnb09/galaxy_db/files/0/7/d/dataset_07d1c53d-42fc-4960-b946-90bfce39d808.dat,/data/dnb09/galaxy_db/files/b/c/e/dataset_bcee76ce-fc02-4e3b-b6e7-c58373b1092c.dat’ --in2 ‘/data/dnb09/galaxy_db/files/5/d/2/dataset_5d25cb54-e330-4b14-8395-78a46a0cbc64.dat,/data/dnb09/galaxy_db/files/7/9/e/dataset_79e3632f-2b0b-477a-a08b-b30e2be2be39.dat,/data/dnb09/galaxy_db/files/a/a/2/dataset_aa2d041f-fdb5-4be7-9bd4-8e2ae1abc3df.dat,/data/dnb09/galaxy_db/files/3/7/6/dataset_3767deb2-bf60-4364-a22c-7e13ed751186.dat,/data/dnb09/galaxy_db/files/e/a/b/dataset_eabdbce2-9bfb-499e-ba7b-242aca2ad277.dat,/data/dnb09/galaxy_db/files/2/1/0/dataset_210f3cfc-b425-4634-b5b1-76927ff419cf.dat’ >/dev/null && metilene -M 300 -m 10 -d 0.1 -t ${GALAXY_SLOTS:-4} -f 2 -B “/data/dnb09/galaxy_db/files/b/4/c/dataset_b4ca7b9d-cd1f-4c50-a81a-f5185365af43.dat” -v 0.7 metilene_g1_g2.input > ‘/data/jwd01/main/060/893/60893411/outputs/galaxy_dataset_eeb5bad2-7323-4b1b-afa2-6cb70acdbf04.dat’ && cut -f1-4 /data/jwd01/main/060/893/60893411/outputs/galaxy_dataset_eeb5bad2-7323-4b1b-afa2-6cb70acdbf04.dat > /data/jwd01/main/060/893/60893411/outputs/galaxy_dataset_ec3fb5e3-3cbe-417d-b9b9-045b900b0900.dat 2>/dev/null && metilene_output.pl -q ‘/data/jwd01/main/060/893/60893411/outputs/galaxy_dataset_eeb5bad2-7323-4b1b-afa2-6cb70acdbf04.dat’ -p 0.05 -o ./metilene
Tool Standard Output empty
Tool Standard Error [WARNING] Tue Jun 20, 17:40:58, 2023 Input files need to be SORTED, i.e., use “bedtools sort -i file >file.sorted” [INFO] Tue Jun 20, 17:40:58, 2023 Checking flags [INFO] Tue Jun 20, 17:40:58, 2023 Write metilene input to metilene_g1_g2.input ***** [BASIC CALL:] metilene_executable -t 4 -a g1 -b g2 metilene_g1_g2.input >out.file Please adjust executable of metilene, number of threads (-t) and names of groups (-a, -b). For further parameters: metilene_executable --help the following ids belong to group A (n=6): 0: column: 0, name:g1 1: column: 1, name:g1 2: column: 2, name:g1 3: column: 3, name:g1 4: column: 4, name:g1 5: column: 5, name:g1 the following ids belong to group B (n=6): 0: column: 6, name:g2 1: column: 7, name:g2 2: column: 8, name:g2 3: column: 9, name:g2 4: column: 10, name:g2 5: column: 11, name:g2 Mode 2 – pre-defined regions region testing 1-[84934573,84935054] region testing 1-[63176548,63177427] region testing 1-[125435175,125435976] region testing 1-[3531625,3531843] /data/jwd01/main/060/893/60893411/tool_script.sh: line 61: 666148 Segmentation fault (core dumped) metilene -M 300 -m 10 -d 0.1 -t ${GALAXY_SLOTS:-4} -f 2 -B “/data/dnb09/galaxy_db/files/b/4/c/dataset_b4ca7b9d-cd1f-4c50-a81a-f5185365af43.dat” -v 0.7 metilene_g1_g2.input > ‘/data/jwd01/main/060/893/60893411/outputs/galaxy_dataset_eeb5bad2-7323-4b1b-afa2-6cb70acdbf04.dat’
Tool Exit Code: 139
Job Messages * { “code_desc”: “”, “desc”: “Fatal error: Exit code 139 ()”, “error_level”: 3, “exit_code”: 139, “type”: “exit_code” }
Job API ID: 11ac94870d0bb33a2cfd9c39a49c0448

Dataset Storage

This dataset is stored in a Galaxy object store with id files25.

Inheritance Chain

metilene qval<0.05 out on data 333, data 331, and others

Job Metrics

cgroup

CPU Time 1 minute
Failed to allocate memory count 0E-7
Memory limit on cgroup (MEM) 20.0 GB
Max memory usage (MEM) 774.9 MB
Memory limit on cgroup (MEM+SWP) 8.0 EB
Max memory usage (MEM+SWP) 774.9 MB
OOM Control enabled No
Was OOM Killer active? No
Memory softlimit on cgroup 0 bytes

core

Cores Allocated 10
Memory Allocated (MB) 20480
Job Start Time 2023-06-20 17:40:52
Job End Time 2023-06-20 17:42:30
Job Runtime (Wall Clock) 1 minute

hostname

hostname vgcnbwc-worker-c36m100-9944.novalocal

AWS estimate

0.03 USD
This job requested 10 cores and 20 Gb. Given this, the smallest EC2 machine we could find is m5d.4xlarge (64 GB / 16 vCPUs /Intel Xeon Platinum 8175). That instance is priced at 1.088 USD/hour.
Please note, that those numbers are only estimates, all jobs are always free of charge for all users.

Job Dependencies

Dependency Dependency Type Version
$bedtools $galaxy_package $2.24
$metilene $conda $0.2.6
$r-base $conda $3.5.1
$coreutils $conda

My history can be accessed via : Galaxy | Europe
Please let me know if you need any further information

Ok, thanks for running all of that. The cluster node is dying during the processing. The EU cluster nodes are the largest available at the public services.

You’ll need to make the job “smaller” per run:

  1. fewer lines in: BED file containing regions of interest, or limit per chromosome.
  2. Adjust advanced parameters to make the job stricter. The number replicates per group seem like a good place to start.
  3. Result can be visualized in a genome browser to help with these decisions, and the author manual and discussions/publications online will help more.

Others may have more advice :slight_smile:

Dear @jennaj

I tried to give just two files but still got the error.

Please could you advise in detail what should be done to the methyl deckle output (fractions) before it can be used for metilene.

I have tried sorting the files by mm10.fa
I have tried converting the files to ENSEMBL to match with the tutorial files used in step 6.

Thanks

Try a very very small test run. Just a few lines in each.

If that fails, then check that those lines will interact and produce output versus the parameters set. You may find a format or content problem during this test that you can then apply to the rest of the data.

When that test is working, you’ll know that the inputs are formatted Ok, and that the parameters are Ok. Then you can try adding in more data to see what the public server can support.

Hi,
I tried to run the tool with 1) methyl dackel output files 2) Sorted by mm10.fa 3) Sorted by local installed genome file . In all the three instances, I did not assign any regions of interest and it executed with same output.
This makes me think that it is not able to carry out when I execute with CpG islands.bed file created by mm10 UCSC table browser.
Any suggestions?

@jennaj
In addition to the above details , I think by just providing the input files- metilene is conducting de-novo DMR analysis. When I tried to download the CpG islands.bed file from the UCSC using table browser for human genome (hg38)>> Convert to ENSEMBL (replace column tool) so that it can be used with the tutorial step 6 input files >> and run the metilene it failed. Hence the CpG islands.bed file in the tutorial is pre-formatted in a way that is different from the former and can be used by metilene. Please the galaxy wrapper creator add details on this so regions of interest can be interrogated.
Thanks.

Hi @akanksha_bafna

Maybe examples would help? These use the tutorial’s data and workflow. Notes are at the top about the sources.

@jennaj
Thanks . I did have a deep look at the input and output files previously, checking the tutorial workflow and datasets. That hinted me to change the files from UCSC to Ensemble (to match the tutorial datasets). Nonetheless, running input files (sorted vs directly from methyl deckle) in either file format generated same output when run without the regions of interest file. The CpGislands.bed file used in the tutorial runs fine with the used tutorial inputs , but when I download the hg38 file from UCSC to run for tutorial inputs or mm10 with my inputs , metilene fails! This prompts me to think that some adjustments are made to the used CpGislands.bed which is not clearly spelled out.

Hi @akanksha_bafna

OK, so the data files are Ok format wise, and the tool works correctly when given proper inputs, but fails when given a job that is “too large”.

Thw CpG_islands.bed is the same source as in the publication, linked from the tutorial: Hierarchical Clustering of Breast Cancer Methylomes Revealed Differentially Methylated and Expressed Breast Cancer Genes. It was downsized to work at the public server. The introduction to the tutorial discusses this.

This is what they state about the data sources. The exact details are not included … that is part of why we made Galaxy … to enable full sharing of data.

External datasets

The UCSC Genes (knownGene), Ensembl Genes (ensGene), Gencode Genes (wgEncodeGencode), CpG Islands Tracks (cpgIslandExt), ENCODE Integrated Regulation Tracks (wgEncodeReg), ENCODE/Broad Institute Histone Modifications (wgEncodeBroadChipSeq), and NKI LaminB1 track (laminB1) were obtained from the UCSC genome browser (http://hgdownload.soe.ucsc.edu/). The bisulfite sequencing data for HMEC and HCC1954 were obtained from NCBI GEO (GSE29127) [38]. The processed bisulfite sequencing Wig files for breast luminal epithelial cells (UCSF-UBC), breast myoepithelial cells (UCSF-UBC), esophagus (UCSD), gastric (UCSD), H1 (UCSD), H9 (UCSD), HUES64 (BI), lung (UCSD), penis foreskin fibroblast primary cells (UCSF-UBC), penis foreskin keratinocyte primary cells (UCSF-UBC) were obtained from the NIH Roadmap Epigenomics data listings at NCBI GEO (NIH Roadmap Epigenomics - GEO - NCBI). The ChIP-seq data for MCF7 RNA PolII binding profile was obtained from GEO (GSE32692). The ChromHMM classification in HMEC and MCF7 cells were obtained from GEO (GSE57498) [51]. The processed “Level 3” DNA methylation and RNA-seq data of TCGA breast invasive carcinoma (BRCA) dataset were obtained from the TCGA Data Portal (https://tcga-data.nci.nih.gov/tcga/) (S13 Table). The fragile sites data were obtained from the GENATLAS database (http://www.dsi.univ-paris5.fr/genatlas/). The CrossMap (v0.1.4) tool was used to convert hg19 coordinates in the BED and Wig files to hg18 coordinates and to perform format conversion when necessary [75].

So, first, make sure you are using data from mm10.

Then, you can try limiting the number of regions to avoid the “segmentation fault” error (means the data is too large/dense for the public resources).

From your original error:

/data/jwd01/main/060/893/60893411/tool_script.sh: line 61: 666148 Segmentation fault (core dumped)

And my original reply. This loops fully back to the first item I mentioned.

The public services cannot process all very large work. The CpG islands are very dense tracks. Try making that file smaller. The Table Browser will allow for complex joins. So, if you are only interested in TSS start sites, then filter by known genes that overlap with those regions, and that becomes your CpG reference data.

Dear colleagues: @jennaj @akanksha_bafna
Thanks for the detailed troubleshooting.
As mentioned on the very beginning of this discussion, I was trying Metilene on ENCODE bigWig (converterd to BedGraph by “bigWigToBedGraph”) which was hg38 whole genome methylation data with replicates (2 Gb each). It failed in all cases, and I gave up.
The information that you provide here is very valuable to prevent people losing time on some “mission impossible”.
The tool developers, Galaxy admins or managers should be able to evaluate and state these facts within each tool, where possible, to prevent us losing time and keeping the server busy for unachievable tasks.
Thanks a lot, and best of luck with your work.

@jennaj
Thank you very much for the detailed explanation.
I ran my dataset by limiting to chr1 CpGislands.bed from UCSC and it worked.
Hence , the tool cannot run as you mentioned with large data points.

I , now have just one query .
Q1. I gave the two input files from the tutorial step 6 (Galaxy | Europe) without specifying any regions of interest and got output of 2578 regions. Does that mean the tool is running on advanced parameter -f = 1(de-novo)? This parameter cannot be entered in the galaxy wrapper so I would just like to know .

Q2. If so, when I did the same for my input files , I got just 7 regions?
Output can be accessed: Galaxy | Europe

Could you share if the execution is fine and in de-novo mode ? and this is actual observation and no problem with the dataset.

In both cases as mentioned in Q1 and Q2 , the tool report suggested to sort files but in Q1 the tutorial files were sorted. So, I am not sure if that has anything to do with the output. As mentioned earlier that report message didn’t change when I used sorted files or upstream output ones.

The command-line is on the job details page. Maybe check there to see if the flags you want to use were applied?