Sequence to gene name

Hi all!

I have a list of ~5000 sequences of length 80. I want to obtain gene symbols to these sequences. Initially I used R and retrieved gene symbol to more than half of these sequences. As the process was very time consuming, I thought I can use methods used in RNAseq, which I am not familiar with. I tried using Galaxy, and the result I got are correct (I checked against my partiall R results which in turn I had manually verified parts of using UCSC genome browser.
What I do is as follows:
1- Upload the sequences in FASTA format.
2- Hisat2, choose
Source for the reference genome: use a genome from history
Select the reference genome: hg38 ncRNA+CDS
Is this a single or paired library: single end
Specify strand information: F
3- and then pass results to HTSeqCount with following options:
GFF= hg38.gtf
Stranded=NO

I also used StringTie instead of HTSeqCount, and Salmon instead of HiSat2. But no succes.

1 Like

Hi @Nothing

Option A:

  1. Directly map with BLASTN to the genome, to get the mapping coordinates
  2. Filter the results so each has a unique hit
  3. Compare those coordinates with a gene/transcript annotation dataset’s coordinates (BED, GTF, etc)
  4. Rename the transcript identifiers with gene identifiers/symbols

Option B:

  1. Use the Jupyter Interactive Environment to use R (and other packages) directly in Galaxy. Launch this from an expanded dataset by clicking on this icon: visualize-this-dataset

Thank you very much for your response.