Using Arriba for Fusion detection on Galaxy

I was trying to detect some fusions on galaxy using a public GEO dataset FASTQ files to train on but then I faced an obstacle that everytime I run Arriba it gives me an error saying “This job was terminated because it used more memory than it was allocated”

I used RNA STAR as my aligner, using GENCODE FASTA file as my reference genome and GENCODE GTF as my annotation file. I used these two as well for Arriba.

STAR command on galaxy:
gunzip -c ‘/jetstream2/scratch/main/jobs/68966065/inputs/dataset_2eb47a98-5fc8-4320-89a9-75bdae642e92.dat’ > refgenome.fa && mkdir -p tempstargenomedir && STAR --runMode genomeGenerate --genomeDir ‘tempstargenomedir’ --genomeFastaFiles refgenome.fa --sjdbOverhang ‘100’ --sjdbGTFfile ‘/jetstream2/scratch/main/jobs/68966065/inputs/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat’ --sjdbGTFfeatureExon ‘exon’ --genomeSAindexNbases 12 --runThreadN ${GALAXY_SLOTS:-4} --limitGenomeGenerateRAM $((${GALAXY_MEMORY_MB:-31000} * 1000000)) && STAR --runThreadN ${GALAXY_SLOTS:-4} --genomeLoad NoSharedMemory --genomeDir tempstargenomedir --readFilesIn ‘/jetstream2/scratch/main/jobs/68966065/inputs/dataset_7bbd3e7b-773a-4c2b-8de6-d4a2c13c3322.dat’ ‘/jetstream2/scratch/main/jobs/68966065/inputs/dataset_52a6a41c-8d00-47b4-948f-4977ce20c733.dat’ --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None --quantMode - --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 50 --outSAMunmapped Within --outFilterType Normal --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 50 --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.3 --outFilterMismatchNoverReadLmax 1.0 --outFilterScoreMin 0 --outFilterScoreMinOverLread 0.66 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0.66 --outSAMmultNmax -1 --outSAMtlen 1 --seedSearchStartLmax 50 --seedSearchStartLmaxOverLread 1.0 --seedSearchLmax 0 --seedMultimapNmax 10000 --seedPerReadNmax 1000 --seedPerWindowNmax 50 --seedNoneLociPerWindow 10 --alignIntronMin 21 --alignIntronMax 0 --alignMatesGapMax 0 --alignSJoverhangMin 5 --alignSJstitchMismatchNmax 0 -1 0 0 --alignSJDBoverhangMin 5 --alignSplicedMateMapLmin 0 --alignSplicedMateMapLminOverLmate 0.66 --alignWindowsPerReadNmax 10000 --alignTranscriptsPerWindowNmax 100 --alignTranscriptsPerReadNmax 10000 --alignEndsType Local --peOverlapNbasesMin 0 --peOverlapMMp 0.01 --chimSegmentMin 5 --chimScoreMin 0 --chimScoreDropMax 200 --chimScoreSeparation 5 --chimScoreJunctionNonGTAG -1 --chimSegmentReadGapMax 0 --chimFilter banGenomicN --chimJunctionOverhangMin 5 --chimMainSegmentMultNmax 10 --chimMultimapNmax 0 --chimMultimapScoreRange 1 --limitOutSJoneRead 1000 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outBAMsortingThreadN ${GALAXY_SLOTS:-4} --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM $((${GALAXY_MEMORY_MB:-0}*1000000)) --chimOutType WithinBAM && samtools view -b -o ‘/jetstream2/scratch/main/jobs/68966065/outputs/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat’ Aligned.sortedByCoord.out.bam

Arriba command on galaxy:

ln -sf ‘/corral4/main/objects/9/6/b/dataset_96b1bdc6-dedc-47a3-9529-9ec40f5fc78f.dat’ genome.fa && ln -sf ‘/corral4/main/objects/6/a/f/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat’ genome.gtf && arriba -x ‘/corral4/main/objects/3/3/b/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat’ -a ‘genome.fa’ -g ‘genome.gtf’ -f ‘blacklist’ -o fusions.tsv -O fusions.discarded.tsv && samtools sort -@ ${GALAXY_SLOTS:-1} -m 4G -T tmp -O bam ‘/corral4/main/objects/3/3/b/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat’ > Aligned.sortedByCoord.out.bam && samtools index Aligned.sortedByCoord.out.bam && convert_fusions_to_vcf.sh ‘genome.fa’ fusions.tsv fusions.vcf && mkdir fusion_bams && extract_fusion-supporting_alignments.sh fusions.tsv Aligned.sortedByCoord.out.bam ‘fusion_bams/fusion’ && draw_fusions.R --fusions=‘fusions.tsv’ --alignments=‘Aligned.sortedByCoord.out.bam’ --annotation=‘/corral4/main/objects/6/a/f/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat’ --output=fusions.pdf --transcriptSelection=provided

Is there any solution to allow Arriba to detect fusions? and Is trimming my annotation file and FASTA file to detect fusions only an efficient way to use less memory and consume less time?

Welcome @Daghshy

Yes, the tool is running out of working memory during runtime. Gencode was a good choice for reference data.

What you can try:

  1. Make sure your BAM file is filtered so that it only retains high quality alignments. Proper pairs, primary alignments, remove unmapped. GEO is known to sometimes have problematic data quality. You’ll need to control for that, retaining the value but getting rid of the excess noise. Filtering against the reference genome is a good strategy.

  2. If that is not enough, you can consider filtering out the haplotype and alt chromosomes from the BAM, and maybe chrY (PAR regions), so that only chr1-22, chrX, chrM remain, then rerun to see what happens.

  3. Then, if you are running this at UseGalaxy.org, you can copy the input files into a new smaller history, and transfer that over to the UseGalaxy.eu server and try the run there. Each server hosts slightly different cluster resources. It is worth trying at each.

If the job fails for all of these cases it might actually be too large to process at the public servers. But you are welcome to share it back and I can review the parameters/data closer, then share that with the developers to see if they can increase the resource allocation.

Hope this provides some ideas, and we can follow up! :slight_smile:

I tried re running the workflow on use galaxy.eu and it totally worked.. Thank you very much