I am trying to educate myself on this NGS thing. Have some questions regarding my data, paired-end RNA seq. So I have mouse tissue, from Knockout and Control animal, 4 biological replicates for each. They were prepared using TruSeq Kits Single Index. I read on the Illumina page (TruSeq Single Indexes) that for trimming, I only need these following sequence:
Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Questions:
Should I use those 2 sequences for all my data sets?
Or do I need only to trim the index sequence, like 6 bp? I know the index sequence for each sample. But it seems inconvenience if I do trimming for all my datasets at once…
The aim of my sequencing is to know the differential gene expression from knockout vs. control as well as to identify splice variants. Is trimming still necessary? Because I read in some discussions for that aim, and if I align it with RNA STAR, trimming is not necessary anymore.
Try using either Trimmomatic or Cutadapt for trimming. Both contain this adaptor already indexed.
This tutorial has more help about common QA steps. Some of these should be run even if you decide that the reads do not need trimming.
You can run QA on all of your datasets in batch if using dataset collections and workflows. At a minimum, consider using collections as these help to keep data organized. Workflows are for faster and reproducible processing that you can later layer in (or adapt one of our workflows – most tutorials have at least one).
Please see our tutorials for example workflows that do exactly this. Some cover QA steps, and explain why for each. You should at least run a tool like FastQC to make sure the read data is intact and fully uploaded.
There is much Q&A at this forum about the tools you will be using. Search with keywords to find these: tool names, datatypes, error messages. I also added a few more tags to this post.
@jennaj great! Thank you for the direction. That really helps. One last question, my datasets’ database/build was set with mm10. I tried to change it to mm39 with the pencil icon and save it. But it doesn’t budge. Will it be an issue later on when I map the sequence to mm39 as a reference?
Hi @zee_azizah – do you mean that the fastq reads you uploaded had a database assignment?
Reads can be associated with a particular sample or species but are not associated with a specific genome/build until they are mapped against a known genome build or are assembled into a novel genome/transcriptome.
To remove the database assignment, click into the Edit Attributes section (pencil icon), and select this from the menu, then save. Next time you load read data, leave the database unassigned during Upload.
It is best to leave the “database” metadata unassigned until a tool assigns it, or you are specifically labeling data that is already dependent on a database. Adding a “database” will not cause tool problems, but it could certainly lead to confusion.
reference genome (fasta, or an index) → has a database
reference annotation (gtf, gff3) → has a database
bed or other coordinate data (bam, vcf) → has a database
fastq reads → no database until they are mapped or assembled into something new/novel that you give a name to
If you want to add extra labels that won’t get lost/replaced as tools rename datasets, consider adding a tag or #tag instead.
Examples: sampleID, read type/source, forward/reverse,… anything that helps you to know what data is.
The #hashed version flows down through tools.
Tags can be searched on within a history to subset the history view.
The managing data tutorials explain how to use these, including special use-cases when they can be used as an extra layer of metadata to “group” data in a way that tools can act on.