UMI-tools deduplicate

I am analyzing 3′ Tag RNA-seq data using UMIs. I used UMI-tools extract or fastp to extract the UMIs from the read sequence and append them to the read name. However, during the mapping step with STAR, the extracted UMI is removed from the read name, causing the deduplication tool to fail.

I have tried multiple solutions, but none have resolved the issue.

Thank you in advance for your help.

Welcome @Dania_Shikhani

Would you like to share your history? How to get faster help with your question. With the example we can help to suggest modifications to the protocol to avoid duplicated read naming.

Let’s start there! It is hard to guess more for this one. :slight_smile:

Here you go.

In this history, I used fastp for extracting UMIs as RX tags (https://usegalaxy.eu/u/daniash/h/i-wish-it-s-the-final-trial )

In this history, I used UMI extract tools for extracting UMIs (https://usegalaxy.eu/u/daniash/h/soft-tissue-sarcoma-trial-5-step-of-umis-using-umi-extract-tools)

If you need further information, please let me know.

Thank you in advance

Hi @Dania_Shikhani

Great! Very helpful!

I see the problem now. Yes, RNA-Star is removing everything after the / delimiter in the sequence lines! This Star tool has an option to control for this (a “keep everything” option) but it is not available in the Galaxy wrapper (yet!). I’ll make an enhancement request but you don’t need to wait for that cycle to complete.

Instead, I’ll suggest building the identifier reformatting into the very start of your workflow when the accessions are extracted from SRA and written into the history.

I created a small test here with my suggestions below. Technical details are sometimes better with an example! This worked already but a clean example seemed good too!

1

Faster Download and Extract Reads in FASTQ → Advanced Options

  • Default format: @$ac.$sn/$ri
  • Suggested modified format: @$ac.$sn

The form will look something like this:

Then, continuing, this is what I needed to do to get the rest to run through successfully. If this is too much extra you can stop reading here! :slight_smile:

2

fastp → Read modification Options → UMI processing

I’ll suggest to continue to use a prefix but avoid special characters. I used UMI.

3

RNA Star

No special options! The UMI is not being interpreted or added as a tag to the BAM output.

For UMI tags added to the BAM, consider using RNA StarSolo instead.

4

UMItools → Umi extract method

Any in the suite can be configured the same way to interpret a UMI in the sequence names. Your sequence names have an _ underscore between the label and the UMI string (from fastp).




Please give this a review and let us know how it works for you! :slight_smile:

I tried your suggestion, but an error occurred while downloading the data.

Here is the history (Galaxy)

I tried your suggestion, but an error occurred while downloading the data.

Here is the history (Galaxy)

Hello @Dania_Shikhani

You can try again whenever SRA rejects a query. Their server just gets busy and it impacts everyone, not just Galaxy users.

The short answer is to try again! We know that these accessions are valid since they worked before.

I’m going to use this opportunity to explain the details for anyone else reading along. :slight_smile:

How to interpret an error from Faster Download and Extract Reads in FASTQ format from NCBI SRA

  1. Review the job logs. → FAQ: Troubleshooting errors

  2. Click on the i-info icon for one of the red datasets.

  3. Scroll down into the detailed Tool Standard Output (stdout) log. These are technical/processing errors discovered by the Galaxy wrapper.

  4. Also see the Tool Standard Error (stderr). This where to find reports about the processing details discovered by the underlying tool. Examples are content and parameter issues.

  5. These sections expand if you click on them!

  6. If the stderr has content, go into the Error tab and see if the Galaxy Wizard can describe what is happening.

  7. The Wizard did answer this one correctly (there wasn’t a fastq file to sort into a collection) but the message could be clearer about why and what to do, so I’m glad you asked! We’ll get that tuned up!

  8. Example of what to review. Whenever this is seen, the problem is either with the accessions (do not exist) or the SRA service itself.

    stdout

    Downloading accession: SRR19543607…
    Failed to call external services.
    Prefetch attempt 1 of 3 exited with code 1
    Failed to call external services.
    Prefetch attempt 2 of 3 exited with code 1
    Failed to call external services.
    Prefetch attempt 3 of 3 exited with code 1

    screenshot

What to do

  1. How to confirm that the accession is valid? Reviewing at NCBI is one way.
  1. How to confirm that the file is formatted correctly?
  • the datatype should be txt FAQ: Changing the datatype
  • one accession per line
  • extra whitespace (tabs, lines) will be stripped by our wrapper but you could also clean it up with a tool like Convert delimiters to TAB followed by Cut to isolate a single column
  1. If this is all correct or this same query worked previously, you can proceed directly to trying again! Waiting 10-15 minutes is usually enough. :rocket:

Please give this a try and see how it works now!

Note: I do see a problem with the final tool in my testing history above. Now that the tool is finding UMIs, it needs to know how to group them. How to group is a scientific decision for the protocol. I had used the exact same parameter as you were using, and the log message is stating that a different parameter combination is needed. I would try the suggestion! Once it works, you can modify a workflow to suit you goals (using my template or extract your own!).