I’m using Bowtie2 via Galaxy to align my RNAseq reads against a published transcriptome. If I want more stringent conditions for the alignment, what parameters I should adjust from the default?
Thank you very much
Bowtie2 already reports the “best” alignment(s) by default.
To end up with a set of “more stringent alignments” than the default, you can either tune parameters for the mapper, filter the result, or both. What type of data you are mapping (single-end/paired-end) will impact which mapping parameters or result filters can be applied.
To tune mapping parameters, see the tool form for a summary of advanced options and what each does. Tool form help is based on the manual, which is also worth reviewing: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
To filter the output result, retaining reads up a certain MAPQ value (30 indicates a unique alignment) and/or only properly-paired alignments (if using paired-end reads) are commonly applied. Example tool:
Bamtools >> Fiilter.
Thank you jennaj for your reply.
This is a paired end RNAseq. and I wish to use the aligner to eliminate reads that are originated from a contamination, so they won’t be included in the assembly. So i’m trying to align against the in silico transcriptome of the contaminant organism, and further use only the unaligned reads to the assembly.
but when I try to blast in NCBI random results from the ‘unaligned reads’ output of the Bowtie2 I still get results for the same contaminant. Moreover, if I just test and run again for the second time, the unaligned reads to the same in silico transcriptome of the contaminant, I still get high alignment rates. so something doesn’t make sense.
I’m not sure which parameter I should change, as there are so many options…
It sounds like you are not capturing enough hits with Bowtie2. Meaning, you need more permissive alignment results, not stricter.
Bowtie2 and a contaminate reference transcriptome target isn’t the best way to positively identify all contaminate reads. NGS mappers are designed to find and retain high-quality, same organism, matches. Any reads (contaminate or not) with lower quality or insufficient coverage won’t be retained as a positive hit.
If the target organism genome/transcriptome does not exist, then using a more permissive mapper to identify contaminates would be a better choice. Is there a reason why you are not using BLASTN for this step?
If the target organism genome does exist, do the filtering the other way around by mapping against it with a spliced-mapper like HISAT2. Contaminate reads won’t pass mapping criteria (along with any failed reads from the target organism).
- yes, the genomes are available for both the contaminate organism and the organism-of-interest. BUT the genome of the organism-of interest is considered as a draft genome and is not complete, that is why I preferred to do a de novo assembly, rather than aligning to the existing genome.
- I gave both Hisat2 and Bowtie2 a try, and run them against the contaminant in silico transcriptome, got the similar results.
- you reckon that BLASTN is a better tool for this? with the genome or in silico transcriptome?
but how should I set the threshold for a good match? eval<0.0005 / % coverage / % identity ? or change something from the default parameters before running the blastn?
Thanks for all your help!
what would you use than: BLASTN/HISAT2/BOWTIE2 with the reads? and what parameters / thresholds to achieve a more permissive alignments?
BLASTN as wrapped for Galaxy is already set to be very permissive (no minimum percent ID or coverage). E-value (p-value) is not a useful metric in my opinion, it is based on the query/target size and HSP-specific. I would stick with the default for that one: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
It may be too permissive, but that depends on how similar the two organisms are, whether it is important for you to remove all of the contaminates at the risk of losing any target genome reads that map to both (IF any do). Raise (strict, perfect complete matches) then lower the percent identity/coverage at increments, keeping in mind that contaminate reads could be degraded in some way, it depends on factors like what they represent and how they were introduced during library construction or during sequencing.
The more similar the two organisms, the more likely some cross-over mapping is to occur at some rate, but I wouldn’t expect HISAT2 to capture high MAPQ/Properly paired result that includes contaminant reads.
So, you’ll need to test out thresholds, examine the results, and make some decisions. This is not an exact process, especially since the target genome is in an unfinished state.
Example: You might choose to keep reads that map well to the target genome first (with HISAT2), and then evaluate the remainder of reads as potential contaminants based on the mapping results to the other transcriptome. You might also want to test different thresholds at this point, assemble, the re-evaluate how those longer assembled contigs map to both targets to make your final decision.
Thanks for all your advice!
I’m thinking maybe the way to go is to create a database of both the contaminant and the organism of interest and plus a close species to the organism of interest that its genome is more complete, and to use that to blastn my reads against. than filter only the desired ones.
But someone also recommended me on STAR aligner. The output I’m getting is only a bam file, any idea if there is an option to output also a fastq file with the mapped reads and a fastq with the unmapped reads? I could not find that option in the settings in Galaxy.
Try these tools to manipulate BAM results. Alternatives can be found with keyword searches at the top of the left tool panel:
Convert, Merge, Randomize BAM
Thank you jennaj.
I have assembled the reads using Trinity, after filtering them according to the blastn results. but now that I try to align back those filtered reads to the assembled transcriptome using Bowtie2 (to assess the quality of the assembly), I get an error message: " Error, fewer reads in file specified with -2 than in file specified with -1". I noticed that the number of reads in each forward doesn’t match the number of its reverse.
What shall I do??
The paired reads must be a match for the R1(forward) and R2(reverse) input to use
During your processing, if one read from a pair was filtered out with BLASTN (determined to be contamination) you will also need to filter out the other read from the pair before continuing.
Other tools (besides
Trimmomatic) that can create paired
FastQ Interlacerfollowed by
problem is these tools require an input of fastq. and my filtered read files are fasta (because I converted them fastq–>fasta to be able to blastn them). is there some other tool I can input with fasta files and will eliminate unpaired reads?
Thanks for all your help!
It would be better to gain back the Fastq formatted data to take advantage of the quality scores during mapping (and also the Trinity assembly – so you might want to re-do that step with the Fastq version). Plus, this allows use of the
Interlacer/Deinterlacer tools after.
Convert the original forward fastq and your new forward (contamination filtered) fasta data with
Fastq To Tabularand
Fasta To Tabular.
The sequence identifier for each tabular result will be in the first column. Join on those columns/identifiers using the tool
Join two files. Set the option for Output lines to be just the matching lines from the fastq input.
Convert the result back to Fastq with the tool
Tabular to Fastq.
Repeat for the reverse reads.
There are other ways to do the manipulation but this is the quickest way I know of.
I have just noticed that the option: " Rename sequence names in output file" was ticked as YES when I converted all the fastq read files to fasta. which means the identifiers won’t match between the filtered fasta files to the original fastq (after conversion to tabular).
any smarter ways how to solve this? other than repeating all the process…
Create fastq datasets with “neutral” quality score values with the tool
Combine FASTA and QUAL into FASTQ. Just enter the fasta dataset and leave the quality dataset input empty.
I would consider that a “last resort” option if you plan to use the reads again for other analysis purposes. Losing the quality scores greatly decreases the utility of the data.
I might try to join the two databases according to the sequence instead of the identifier, if it doesn’t work I will have to repeat, I hope what you suggest will work.
You have been a life saver, thanks so much!