my problem is about converting my whole genome sequencing data (bam files) to fasta/fastq format. I did the sequencing with illumina and now I´ve got several bacterial sample genomes. Further I´d like to look at four genes and their up-and downstream regions to see if there are any changes compared with the NCBI reference. I´d also like to make a phlogenetic tree out of these genes from different strains later. So the fist step is, as I´ve heard, to convert them into paired ends (pe) and do the genome assembly. I´ve tried this, but the outcome of the pe is still a .bam file which is giving only a few bases when converting to fasta. If I want to convert the whole sequencing data of one sample (bam file) to fasta, to may copy the gene sequence, the data inquiry is too large and it´s not convertable. Do you may have a hint for me? I don´t know how to deal with that.
This is a very broad question. And not easy (or even possible) to answer with a simple reply. There are many steps needed to achieve your goal. I can show some steps so you could see a bit what could be needed or as inspiration but this is not exactly what you need to do for your project (I simply dont know all the details). So, on top of my head, and I hope someone correct me if I am wrong. You could start with:
QC (can be done with FASTQC)
Primer trimming and/or quality trimming (Many tools for this and quality trimming is not always necessary)
Then you have two choices, you can map your reads against a known reference or do a denovo assembly. Because you already mention that you want to compare with other sequences in the NCBI I am thinking that mapping would be an good option. For mapping you generally do:
Annotation can be done with tools like prokka or bakta or (maybe even better) using existing annotaton infomation of your chosen reference often in the format of a gff3 file.
For doing an assembly there are also many options. Maybe helps to checkout this page Assembly / Tutorial List
Then Hi @bacterial_dna – The only thing I would add is some technical help for that first step. Did the the sequencing center deliver the Illumina WGS reads in a BAM format to start? That is pretty common and getting the reads out into fastq datasets is a single step.
Search the tool panel at a UseGalaxy server (https://galaxyproject.org/ → Regions) with some keywords to find appropriate tools. Something like “bam to fastq”. Either of these are good choices. If one fails, try a different one, or try different settings. If you get errors, you can share back your history for more help.
bedtools Convert from BAM to FastQ
SamToFastq extract reads and qualities from SAM/BAM dataset and convert to fastq
The reads won’t be mapped yet, so the file is not really a full BAM yet but a special class of “read only BAM”. You want all the data to start with. One extracted, organize your reads, then QA and all the other downstream steps.
no, I did the sequencing and a bioinformatician gave me the BAM-data. I haven´t worked with that before.
The tool BAM→FASTA is not really working, because the BAM file is too large and when I use a part of the genome, only a few bases are appearing, much shorter than the gene I have in the part, and I get mostly the single reads in one file I think.
It seems like your BAM contained sequencing reads! This is what is expected.
From here, you can do things like:
assemble the reads into contigs (consensus sequences based on read evidence)
compare the reads to known genomes to identify variants
classify the reads against known species
For your question here:
This falls into an assembly protocol. The topics at this forum with the assembly tag are good reasources. One prior topic is here:
Please give that a review and let us know if we can help more! Later on, should you run into tool issues, you can start up new topics for community help.
I´ve tried many times to convert. So what I did is to use the bam→ fasta tool and try to convert. For the coordinates I set the position of the gene I want to analyze. Then I was able to load it into the tool, otherwise it had been too large. So what I got were single reads in a fasta file, but no long consensus sequence. So I´ve tried to start the assembly and used the tool to make paired ends. I´ve got the same problem there. I got a file with many small pieces, but no long sequence. Is there any chance to get a consensus-sequence out of the bam file, so to have one long sequence for a gene (or in my case for four genes) and may a region +1000bp upstream and downstream, so that I can analyze the file in such programms as MEGAX or Snapgene (like with sanger sequencing)? I just need a complete fasta for these genes and I couldn´t find any solution for a complete sequence
Is there somebody out there who can help me with that?
I have Bacterial DNA from Sequencing with Illumina NovaSeq1000. I did the practical part, but the .bam data I got already processed from a bioinformatician (owner of the machine). I got them already mapped against the reference genome from NCBI. My project is about different Enterobacteriaceae, where I want to have a closer look at four genes and compare their genetical patterns as well as their promoter regions within the same species and with other species and later design a tree and may find conserved regions. But I don´t have the fasta-data to proceed with the programms I know from sanger sequencing. There I would know how to go on.
This protocol generates Whole Genome Sequencing reads, often termed as WGS. The software can produce reads, but also some of the downstream data files like mapping results and variant calls. It depends on what choices were made.
Your BAM file will always include the original fastq reads. And, it may just have reads, or it may also contain the mapping results. It now sounds like you have the latter.
The reference genome used for the mapping will matter, since you will need a copy of it to run the BAM through a variant calling tool.
Did the bioinformatician also give you a copy of the reference genome fasta? Or, do you know the accession identifiers for it? If not, we may be able to locate it anyway. And, you could also just do the mapping all over again (using the reads in the BAM – I can help with this again).
Where to look: the “header” or top portion of your BAM will likely specify the genome used in some special data lines starting with the @ characters. The goal here is to identify the reference genome used, then to source it from NCBI to get it into your Galaxy history. This will allow you to use it with the next steps.
The graphic you posted first is from a display application tool in Galaxy. If you go to this file again, and click on the eye icon again, you will be able to toggle into the Raw data tab, and be able to see the header in plain text. Would you like to post back a screenshot of that? Try to get all of the header lines and the first few data lines underneath (2 or 3 of those is enough). It is ok if this is several screenshots.
Worst case, we can help you to back up and run from step 1. What you have is enough to do this, and we can solve the current extraction error since I think I see what was likely going wrong.
So, please post back some screenshots of the BAM header so we can locate your reference genome. You could also generate a history share link and post that back into a reply since that will make our advice quicker and more specific! If you would rather share your history in a chat message, I am going to start one up now. I would really like to get this solved for you. What you want to do is definitely possible in Galaxy, including generating the types of output you could export for the other applications!