Missing gene names in DropletUtils/ RaceID output

Hello- I have gone from RNA STARSolo through DropletsUtil and RaceID with a single cell 10X dataset, based on these tutorials:

I seem to have missed a step where the genes should be labelled though as the heat maps etc from RaceID all have ‘ENSG’ numbers rather than gene names. Is this at the DropletUtil step? Could you please advise?

Hi Carolyn,
Yes, most likely you lost the gene symbols in DropletUtils. The gene information file (RNA STARSolo on bla bla: Genes raw) produced by STARSolo contains gene id in the 1st column and gene symbol in the 2nd column. So the DopletUtils takes the gene ids instead of gene symbols to construct its output matrix. You can cut c2 from “RNA STARSolo on bla bla: Genes raw” file and then use it for DropletUtils. But the problem is that the gene symbols are not always unique. We have to make them unique before inputting to DropletUtils. It is possible to make the names unique using some AWK program. But it can also be solved using anndata tools. A possible solution is the following:

  • First use Import Annadata. Please check out Anndata setp in this tutorial. The important parameters are Variables index: gene_symbols and tick Make the variable index unique by appending '-1', '-2'?. This will create an Anndata object with unique Gene symbols as var_names.
  • Then extract the gene symbols from that anndata object using Inspect Anndata tool with the parameter What to inspect?: Key-indexed annotation of variables/features (var). This will extract unique gene symbols but there is a header line called “gene_ids” in the second column.
  • Remove that unwanted header by Select last tool with the parameters Operation: Keep everything from this line on and Number of lines:2.
  • Then run the DropletUtils using the newly created file for the Genes List parameter. The resulting matrix should have gene symbols.

Hmm, pretty interesting. Anyway, thanks @pavanvidem for providing the solution.