about tutorial files and their sources

I’m trying to practice the tutorial titled ‘Whole transcriptome analysis of Arabidopsis thaliana,’ but I have a few questions. In the tutorial, the files below were used:

annotation_AtRTD2.gtf https://zenodo.org/record/4710649/files/annotation_AtRTD2_19April2016.gtf.gz
transcriptome.fasta https://zenodo.org/record/4710649/files/transcriptome_AtRTD2_12April2016.fasta.gz
star_miRNA_seq.fasta https://zenodo.org/record/4710649/files/star_miRNA_seq.fasta
mature_miRNA_AT.fasta https://zenodo.org/record/4710649/files/mature_miRNA_AT.fasta
miRNA_stem-loop_seq.fasta https://zenodo.org/record/4710649/files/miRNA_stem-loop_seq.fasta

Imagine I would like to do the analysis in the tutorial for a different plant species. Assume that I have found miRNA and RNA-seq files from NCBI, but what about the files similar to those mentioned above? Where can I get them? Which platforms provide these data?

Lastly, imagine I need to use SRA files like SRR26587422 and SRR26587423. In rule-based file uploads, it is necessary to input corresponding links for the SRA files. How am I going to get these links (I am not asking specifically for these links but generally for any SRA files)?

Hello @aaak

These are good questions, and I hope the information below helps!

Reference datasets: annotation, fasta

Sources of common annotation for items such as known gene bounds, transcribed regions, and various regulatory features can be found at UCSC, Genbank, Ensembl, Gencode, and others (including species specific websites). When working with a model organism, you’ll likely see more convergence and overlap between these data provider sources. Which is to use is a scientific preference, along with compatibility considerations with respect to analysis tool choices and any downstream browsers you might want to incorporate.

With that context: miRNA references are a bit more specialized. I would suggest reviewing current publications to see what others used plus any reviews that compare alternative sources (when there happens to be more than one). Some might be proprietary, others public or there might be some single lab that is focusing on this annotation discovery. Meaning, these artifacts may be available or not for a particular species – you’ll need to investigate.

Sequence Read Archives (SRA)

For an overview description of SRA and how to navigate, search and access the files → Home - SRA - NCBI

In practical use, this usually involves searching for a publication are you interested in, learning the identifiers involved, then retrieving the reads for analysis. These tutorials have examples of the process but other tutorials may include parts of those methods (or route to Zenodo for downsampled versions).

Why downsampled data? Practical reasons. Smaller files processes faster in a tutorial setting. Also, cleaner data can bring the focus back to the larger scientific goals of the tutorial, although you’ll find some tutorials with a focus on technical considerations as well as many embedded comments about those across all. The creation of these “smaller yet representative” data subsets was usually done by the tutorial authors: scientists and educators who applied their domain experience.

Dear @jennaj,

Thank you for your prompt answer! I am very familiar with the tutorial and have been using all the databases you have been referring to. Please allow me to elaborate on my problem in more detail so your answers will be more specific to my questions.

As you know, when an RNA-seq analysis is conducted on Galaxy, many files are produced depending on the data and tools used. To reduce this mess, we label the data. There are different labeling processes, and I am very familiar with them as well. One of these labeling methods is particularly useful. This method not only helps us see the labels in the history but also in the drop-down lists of input sections of any tools. Therefore, when the data is further processed, there is no need to determine which data belongs to a specific file because they are also labeled. However, to label data like this, as you understand from the tutorial, we need links to the datasets.Please check the tutorial link and the picture I got from it thus you will understand me better

image

I generally work with NCBI SRA data. Yesterday, I contacted them and asked if it was possible to have the links so I could use the rule-based uploader. I was told that NCBI does not offer such services and to contact the Galaxy project team! Consequently, my question remains the same: How am I going to find links for SRA files so I can use the rule-based uploader? I have tried all the methods that came to mind, but I either encountered an error or the FASTQ files did not appear in my history since all the links available on NCBI SRA datasets do not contain .gz files!

Regarding my second question, ChatGPT has identified some sources of files used in the tutorial titled “Whole Transcriptome Analysis of Arabidopsis thaliana”. For example, the two files listed below were obtained from the following link: https://ics.hutton.ac.uk/atRTD/

annotation_AtRTD2.gtf: https://zenodo.org/record/4710649/files/annotation_AtRTD2_19April2016.gtf.gz
transcriptome.fasta: https://zenodo.org/record/4710649/files/transcriptome_AtRTD2_12April2016.fasta.gz
However, I am still looking for other files including star_miRNA_seq.fasta, mature_miRNA_AT.fasta, and miRNA_stem-loop_seq.fasta. Additionally, I am interested in conducting similar analyses with datasets from rice. How can I find the corresponding files for rice to conduct both miRNA and RNA-seq analyses?

Thank you for your help in advance !!!

Hi @aaak

Thanks for explaining.

If you organize your own data into a collection first, you can add group tags using Tool Panel → Collection Operations → Apply Rules.

For where to source the annotation for rice, try asking at the forums where other people who work on rice discuss analysis topics. This forum is focused on using the Galaxy application, so you’ll find many people who can help with that here, but not so much for narrower scientific topics.

When a link is not defined, the rule-based uploader mentioned above will not work since it is based on the links that will be provided. Additionally, collection operations do not generate labels for selections from drop-down menus. I just used rice as an example, hoping to learn something, since my field of expertise is agricultural crops. I am ending my search for the answer to my question. Thank you for your answers

You don’t need a URL/link when using Apply Rules.

The labels can be added with that same tool. The content can come from a tabular file with the metadata that you want to create tags from (values in the tabular file turn into group labels).

Yes, this is what I understood. What species the files are from are not part of this yet since it is just data organization. In short, putting the files into folders and tagging them.

Ok. If you change your mind you can come back for more help :slight_smile:

Check it out and see it yourself if I need it or not!

I have no problem with adding the tags to history. What I want is to be able to see the tags in a dropdown list while using any tool. This is explained in tutoral ( Using Galaxy and Managing your Data / Hands-on: Name tags for following, but it cannot be used without a data link…

Hi @aaak

Thanks for posting the screenshot! Helpful and I see where the confusion is coming up.

There are two ways to “apply rules” to data.

  1. With Upload → Rule Based Uploader
    • this is the view in your screenshot
  2. With Collection Operations → Apply Rules
    • this is a different but very similar tool found in the tool panel, and is in my screenshot below

Galaxy homepage → Click on the ‘house’ icon (Tools and Current History) → Collection Operations (tool group) → Apply Rules (tool)

Screenshot

How to use both is nearly the same. The only difference is that one includes a URL for data loading, and the other uses a collection in your history that you have already created (with data already loaded). This includes adding in name: and group: tags to your data, which will propagate “into labels” on tool form drop down select menus, and anything else that collection tags do.

Hopefully this helps :slight_smile: and you can ask more questions.

I will give it a try and inform you about my experience soon. Thanks for your answers

1 Like