I have a trinity assembly with greater than 200,000 sequences. I wished to run it in blast2go after doing the transdecoder program. However, after completing transdecoder, i found that the blast2go is no longer available in galaxy community. So, i ran blastn separately. However, since my assembly contains much greater than 200,000 seqeunces, i had enquired earlier at “usegalaxy-eu” and i have been informed that since my data is huge; blastn might never finish running. So, I have been advised to communicate here regarding which tool should i use for further analysis. Could you please tell me how to annotate the CDS or the assembled nucleotide sequences? Which tool should i use?
I would suggest splitting the query sequences up into smaller batches. If you output tabular results from BLASTN, those could be concatenated after.
Tools involved:
- Split file to dataset collection
- BLASTN
- Concatenate datasets tail-to-head or Collapse Collection into single dataset in order of the collection
You might need to experiment to see how many collection elements (files) are needed to break up the data into jobs that will run on the public clusters. Also, be careful with the BLASTN parameters – it is very easy to “blow up” the results by setting the match criteria as too permissive. You can always filter your results and run BLASTN again on a smaller set of target sequences if interested in sub hits (get rid of reads that only capture non-specific hits).
Hope this helps!
Thank you. I will try to follow these steps. If I face any problem, I will contact you further. Please assist me.
I have split file to 50 dataset collection. Then, I have run blastn with 1 file. But still, the blastn jobs are still running. The job is yellow in colour. When will it finish? Is it running ok?
It sounds like the jobs are executing. These would process like any other tool, and turn green at the end once done. Since you are running a collection, those jobs will process individually and have different states until done. FAQ: Understanding job statuses
However, after talking with an EU person that helped you before, it came up that running a BLAST against a larger public mixed reference is probably not going to produce the results you are most interested in.
To get your assembled transcripts annotated, using annotation tools as described in this tutorial may be a better choice. Hands-on: De novo transcriptome assembly, annotation, and differential expression analysis / Transcriptomics. Maybe also scroll up to the prior section that covers assembly if you haven’t done any post-assembly quality filtering yet.