Uploading problem of BAM file in Galaxy history

kdsung · September 23, 2024, 9:05pm

I am attempting to upload a BAM file (141 GB) from bacterial genome sequencing using PacBio to my Galaxy history. However, after several hours, the upload has not completed. Could you advise me on how to resolve this issue?

Additionally, could you guide me through the steps to assemble the BAM file in Galaxy once it is successfully uploaded?

Thank you for your assistance.

jennaj · September 23, 2024, 9:52pm

Welcome, @kdsung

That BAM is a bit large to load, but might be possible, it depends a bit where you are working. One of the UseGalaxy servers would be a good choice. Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub

UseGalaxy.org is one choice, and that is how you tagged your question. When loading, make sure the connection on your end is stable and fast, and allow the job to process. You can resume if it is interrupted.

Note: personal/home internet connections are optimized for “down” speeds, and tend to have slower “up” speeds. The latter is what is sending the data to our server and is the likely bottleneck. Getting on a stronger “up” connection could help with a file this large.

If you plan to do an assembly, you may only need the fastq reads from the BAM file (the file is a “sequence only” BAM, correct?). You can extract those once the data is in Galaxy (Samtools SamToFastq), or you could extract the reads before loading the data into Galaxy. If you extract locally, you could split up the fastq data and load it in smaller files, then merge later.

More about Upload → Getting Data into Galaxy

For assembly itself, please start with these tutorials.

Hope this helps!

kdsung · September 24, 2024, 1:48pm

Thank you so much for your kind response. It helps a lot!!!

kdsung · September 24, 2024, 3:38pm

BAM file was successfully uploaded overnight. But when i run SamToFastq, bedtools Convert from BAM to FastQ, and Convert BAM to FASTA multiple sequence alignment, there was the following error on the step #2.
Convert BAM on data 1
Traceback (most recent call last):
File “/usr/local/bin/bam2msa”, line 43, in main
samfile = pysam.Samfile(bam_file, ‘rb’)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “pysam/libcalignmentfile.pyx”, line 751, in pysam.libcalignmentfile.Alignm
If you don’t mind, would you please let me know how i can fix the problem? I am very sorry to bother you with this matter and really appreciate your great help.

jennaj · September 24, 2024, 4:41pm

HI @kdsung

It looks like the tool is detecting a problem with the BAM – maybe the format.

Are you able to click on the eye icon in the BAM dataset to open the peek view in the center panel? If yes, what does that plain text look like?

Another check is to confirm that the Upload process was complete. One simple way to do this is to compare the overall size of the file before and after. You could also do something like a checksum – use the tool Secure Hash / Message Digest (link at ORG) in Galaxy, run the same on your computer, and compare.

You could also back up and confirm that the file is intact on your computer, since an upstream transfer could have corrupted the file. Samtools is the general utility to use – if you install that on your own computer, it will make working with this file type much easier, even if just used for “sanity checks” like this. You don’t need to convert to SAM.

Given the size of the file, I would suggest that you put this into the larger storage space at UseGalaxy.org. You will be getting rid of this BAM once the sequences are extracted anyway. This post explains more about that space: What should I do if my data exceeds the given 250GB of storage?

And finally, the UseGalaxy.eu can sometimes process the largest data due to how they configure their public clusters. You cold try a cross-comparison to see what results. You can either Upload directly there, or attempt to transfer the data between servers. This might be needed for the Assembly step too but you won’t know until you try. The post above has instructions for moving data between servers.

SAM/BAM format (specification – an internet search will find much more)

If the file is actually “sequence only”, these usually have a single header line, followed by data lines – one per sequence. All of the “mapping” statistics will have default placeholder values, and the nucleotide sequence and quality score “sequence” will both be present. The sequences are what you are extracting. Now, there might be some variation here, and source of the file may have more details about what to expect but the basic format constraints should be intact.
If the file is from a prior alignment run, then you’ll have more header lines followed by data lines with statistics – one per alignment. This is what most of the online resources about the BAM/SAM formats are describing.

Please give those a check. It is hard to guess more without seeing the actual tool runs, and your file details.

kdsung · September 24, 2024, 5:09pm

I greatly appreciate your kind advice. Your guidance has been very helpful as I work on uploading the BAM file to Galaxy.
Unfortunately, I wasn’t able to check using the eye icon as I had already deleted the BAM file due to its large size.
Once again, thank you for your prompt response and support.

kdsung · September 25, 2024, 1:38pm

I ran SamToFastq on a HiFi BAM file from a bacterial genome sequence, and it produced five outputs:

2: SamToFastq on data 1: reads as fastq
3: Interleaved pairs from SamToFastq on data 1
4: Paired-end forward strand from SamToFastq on data 1
5: Paired-end reverse strand from SamToFastq on data 1
6: Paired-end unpaired reads from SamToFastq on data 1

I expected to obtain paired-end forward and reverse strand data. However, only the interleaved pairs (#2) contained data (5.5 GB in fastqsanger format), while the other outputs are all 0 GB. No errors were reported.

Could you help me understand why SamToFastq did not provide separate paired-end forward (#4) sand reverse (#5) strand data as expected? Additionally, can I use the interleaved pairs (5.5 GB in fastqsanger format) for assembly? If so, could you please guide me on how to proceed with the assembly using the interleaved pairs?
I apologize for the trouble and greatly appreciate your help with this matter.
Thank you!

jennaj · September 25, 2024, 6:32pm

Hi @kdsung

Thanks for posting to the main thread.

Most downstream tools will expected the forward and reverse reads in separate files, not the interleaved organization.

You can run the interleaved reads through a tool to separate them. You can use a few of the Seqtk tools to do this (search the tool panel with that keyword to find them), or you can use a dedicated tool like FASTQ de-interlacer on paired end reads (link at ORG).

Why the tool wrote the data out the way it did involves a few factors, including the way the BAM was originally organized. But that doesn’t matter – you have the data you need, and changing the “shape” of your data is certainly possible (and expected!).

Galaxy hosts tool from all kinds of open source groups with diverse authors and diverse data expectations. Making small adjustments is super common to have these all work together. You’ll find most command line utilities in Galaxy, and you can even go into an interactive environment to use R, Python, Notebooks, and others directly if that is what you are more familiar with.

When and if you are interested in short tutorials about what is possible, I would suggest these:

Introduction to Galaxy Analyses / Tutorial List
- Hands-on: NGS data logistics / NGS data logistics / Introduction to Galaxy Analyses << this one will answer most of this question with more details
GTN Materials Search (query=olympics) – SQL, R, JQ, plus the GUI tools
- Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Foundations of Data Science << this one covers common data manipulations with examples and explanations
- Hands-on: Galaxy Basics for everyone / Galaxy Basics for everyone / Introduction to Galaxy Analyses << simple introduction to workflows

Later on your can extract a workflow from your history, so you can run all of the tedious intermediate steps in a batch without worrying about making small data-entry errors and such.

Hope this helps, and so glad you were able to get this BAM loaded and the reads extracted!! I was a bit worried about the size so it is great to learn you have gotten this far along!

kdsung · September 25, 2024, 7:14pm

Thank you so much for your kind help. I will try FASTQ de-interlacer. In addition, i will take short tutorials. Again, I really appreciate your great kindness.

kdsung · September 25, 2024, 7:15pm

After SamToFastq only provided Interleaved pairs, I ran bam2fastx and it gave 1.6 GB of fastqsanger.gz. Then I ran Flye for assembly. But there was the following error in all steps.

Flye on data: consensus
Flye on data: assembly graph
Flye on data: graphical fragment assembly
Flye on data: assembly info

Execution resulted in the following messages: Fatal error: Exit code 1 ()

Tool generated the following standard error:

[2024-09-25 17:23:17] INFO: Starting Flye 2.9.5-b1801

[2024-09-25 17:23:17] INFO: >>>STAGE: configure

[2024-09-25 17:23:17] INFO: Configuring run

[2024-09-25 17:24:11] INFO: Total read length: 2699188740

[2024-09-25 17:24:11] INFO: Reads N50/N90: 6925 / 4548

[2024-09-25 17:24:11] INFO: Minimum overlap set to 5000

[2024-09-25 17:24:11] INFO: >>>STAGE: assembly

[2024-09-25 17:24:11] INFO: Assembling disjointigs

[2024-09-25 17:24:11] INFO: Reading sequences

[2024-09-25 17:25:04] INFO: Counting k-mers:

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

[2024-09-25 17:27:48] INFO: Filling index table (1/2)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

[2024-09-25 17:30:32] INFO: Filling index table (2/2)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

[2024-09-25 17:34:47] INFO: Extending reads

[2024-09-25 17:46:46] INFO: Overlap-based coverage: 741

[2024-09-25 17:46:46] INFO: Median overlap divergence: 0.0441275

0% 100%

[2024-09-25 20:43:24] INFO: Assembled 0 disjointigs

[2024-09-25 20:43:24] INFO: Generating sequence

[2024-09-25 20:43:24] INFO: Filtering contained disjointigs

[2024-09-25 20:43:24] INFO: Contained seqs: 0

[2024-09-25 20:43:25] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

[2024-09-25 20:43:25] ERROR: Pipeline aborted

Galaxy job runner generated the following standard error:

WARNING:galaxy.model:Datatype class not found for extension ‘gfa’

If you don’t mind, would you please help me how to fix the problem? I really appreciate your great help!!!

kdsung · September 25, 2024, 8:41pm

I ran FASTQ de-interlacer but it looks like it didn’t work.

Please find outputs below.

FASTQ de-interlacer right singles from data 3:

There were 582275 reads with no mate.

De-interlaced 0.0 pairs of sequences.

FASTQ de-interlacer left singles from data 3:

There were 582275 reads with no mate.

De-interlaced 0.0 pairs of sequences.

FASTQ de-interlacer right mates from data 3:

There were 582275 reads with no mate.

De-interlaced 0.0 pairs of sequences.

FASTQ de-interlacer left mates from data 3:

There were 582275 reads with no mate.

De-interlaced 0.0 pairs of sequences.

Would you please let me know why FASTQ de-interlacer didn’t work? How should I do at this point? Thank you so much for your help in advance.

jennaj · September 25, 2024, 9:02pm

Hi @kdsung

I can’t tell what is going on with just this information. Would you please share your history? How to do that is in the banner topic at this forum. Thanks!

kdsung · September 26, 2024, 1:57pm

How can i share my history with you? I have tried copying the history but couldn’t paste it. In addition, I have tried to share my history with you but couldn’t add your email address since your email address (notifications@galaxy.discoursemail.com) was not recognized. I am very sorry to bother you with this matter.

jennaj · September 26, 2024, 5:27pm

Hi @kdsung

Click into the Sharing your History link in here → How to get faster help with your question

To share a history link, you seem to be on right view. You do not need to share with an email address. Just toggle the top slider for “sharing”. A link will be generated. You can copy and paste that back here. Once we are done, you can unshare.

Very very few people out of all the people in the world would understand what you are sharing, and those people are unlikely to be interested in doing your research for you so it is totally fine to share publicly here. And, if your data does have more stringent security requirements (protected patient data or similar), then working at a public Galaxy server is probably not appropriate.

Thanks!

kdsung · September 26, 2024, 5:55pm

here is history link. Galaxy
Thank you so much for your great help!!!

jennaj · September 26, 2024, 10:23pm

Hi @kdsung

Your starting data appear to be the hifi_reads.bam described here reads.bam | CCS Docs.

HIFI reads are not paired end sequencing. These can be thought of as really long high quality “single end” reads … but in actuality these are reads that are each a mini-assembly.

So – punch line – exact the reads using the single end options. Maybe review how others are doing this, since the defaults might not produce exactly what you want.

Beyond that, you seem to have at least two different sample replicates in your history so far.

We have an example of this sort of workflow in our GTN Tutorials.

With all Assembly tutorials here

And a protocol developed by the UseGalaxy.org.au and other Australian scientists (the Galaxy workflows should work on any public UseGalaxy server)

With the top level VGP introduction here. I’m not sure of your species, but maybe interesting anyway.

Summary of what I would suggest

Run through at least this tutorial. Really, it will help with making use of the other Galaxy resources so much easier. It takes maybe an hour? SO worth it.

Then:

Copy your original two uploaded BAMs into a brand new history and then purge your original history (you don’t need any of that since you are starting over from a correct data extraction step).
If you have more BAMs from other samples, also load those into your new history. You can process all together.
Then put the BAMs into a collection folder (a list of similar datasets), then extract the reads.
Remember to do read QA on any raw read data. Assembly is very sensitive to read quality! You won’t like the results if you skip this, assuming the job will even process with raw data (very good chance it won’t).
Now, with prepared HIFI reads, you can think about any reference data you want to incorporate and any short reads you may want to incorporate.
Then you can think about how to assemble: choosing the correct tool for your read types and reference data (if any).

Hope this helps!

kdsung · September 27, 2024, 2:39pm

I will follow your instruction. Thank you so much for your great help!!!