I’m new to RNA seq and I’m trying to use blastx on galaxy to blast a fasta file of possibly novel lncRNAs using the ncbi nr database. I want to remove transcripts with significant homology to known proteins with: e-value < 1e-10, target coverage > 80%, and identity > 90%, so I can preserve transcripts that are most likely lncRNAs. I tried to change the e-value/expectation value to 0.0000000001 and query coverage to 80% but I cannot change the pident setting (percentage of identical matches). How do I do this? Also, when I run the tool on default settings it is running for more than a day.
Hi @bart_joosten,
could you describe to me your pipeline until now? Do you have raw RNA-seq data or assembled transcripts? Which species do the samples correspond to?
In addition to @gallardoalba 's question -
yes, runtime can be substantial for this tool and will depend on how many input sequences you’re trying to match against the database. This is simply how it is.
Regarding your identity threshold: % identity is part of the output you’re getting? If so, you can use standard Galaxy Filter (and possibly Text Processing) tools to perform posterior filtering.
I can also add my two cents. As already mentioned it is expected that it takes long. The ncbi nt and nr database are growing insanely fast. If I need to blast a large amount of sequences I always make a subselection first. So I only blast against homo sapien sequences for example. But this may be not so easy to do in galaxy.
As an alternative you could check out the diamond tool. I dont have experience with it but they claim to be 2,500 times faster then blastx. I think the key to your question is what @gallardoalba is suggesting. You could reduce your input by removing duplicate sequences for example.
@gallardoalba Hi, thanks for your reply. I’m trying to find novel lncRNAs. I have used fastq files from human samples and performed trimming (trimmomatic), alignment (hisat2), assembly (stringtie) and have used the merged assembled files (stringtie-merge) in the FEELnc package to find potential new lncRNAs (outputted in a GTF File). I translated the GTF file with possible novel lncRNAs into a FASTA file and uploaded this into galaxy to use with the parameters that I have specified. As has been mentioned, the running time is very long (now 2 days) because my FASTA file is very large. When I tried to use blastx from NCBI (blastx: search protein databases using a translated nucleotide query), it would only allow files with total query length of 100k, so I understand what’s causing the delay.
@wm75 yeah you are correct if I can get an output from blastx on galaxy I’ll perform posterior filtering. Thanks for the suggestion.
@gbbio I like your suggestion of using the diamond tool! It has all the parameters that I want to use on blastx and should perform faster. I’m trying it right now on galaxy.
@gbbio I have tried using diamond and it works well. However the diamond output in default outputs matching protein transcripts from all taxi but I am of course only interested in human transcripts but I’m having some problems selecting the human taxon id in the diamond tool. Entering the human taxon id 9606 gives an error in the diamond aligner tool (use --taxonmap parameter for the makedb command).
From what I understand I have to make a database using the NCBI nr protein database as a Fasta file and several other files (taxonmap, taxonnames, taxonnodes) specified in galaxy in the diamond makedb tool. However, I have some problems uploading these files. When I try to upload the file for the taxonmap in the choose remote file option (ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz) it gives me an error: Unsupported Media Type (415). So I was wondering: do you know how to solve this problem?