How to use the TPM data from either Stringtie or Sailfish or Killisto quant

Dear all, I’m trying to get TPM data from FastQ file(mouse, illumina, pair-end, 150bp, 20-30M), I tried Stringtie, Salmon, Sailfish, Killisto quant, and I did get the results, but I have a couple of questions about the data:

  1. Stringtie data showed one gene with multiple TPM values, my understanding is these different reads mapped to different sites of the same gene(there is information about the ch site), should I just plus all the TPM together as the final TPM for each gene?
    |Rb1cc1| cov 143.169418| FPKM 8.402312| TPM 17.528645|
    |Rb1cc1| cov 0.007920| FPKM 0.000465| TPM 0.000970|
    |Rb1cc1| cov 0.353770| FPKM 0.020762| TPM 0.043313|
    |Rb1cc1| cov 0.006247| FPKM 0.000367| TPM 0.000765|
    |Rb1cc1| cov 0.608617| FPKM 0.035718| TPM 0.074515|
    |Rb1cc1| cov 1.899041| FPKM 0.111451| TPM 0.232505|
    |Rb1cc1| cov 0.131482| FPKM 0.007716| TPM 0.016098|
    |Rb1cc1| cov 0.025351| FPKM 0.001488| TPM 0.003104|
    |Rb1cc1| cov 0.081178| FPKM 0.004764| TPM 0.009939|
    |Rb1cc1| cov 0.661685| FPKM 0.038833| TPM 0.081012|

  2. the data from Salmon, Sailfish, and Killisto quant are more confusing, the target ids are like what showed as following, which one I should use to count TPM? or it’s the issue of the reference genome I use(Grcm38)?
    ENSMUST00000130201.7|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072660.1|Mrpl15-203|Mrpl15|1894|protein_coding|
    ENSMUST00000156816.6|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072659.1|Mrpl15-206|Mrpl15|4203|protein_coding|
    ENSMUST00000045689.13|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072661.1|Mrpl15-201|Mrpl15|497|nonsense_mediated_decay|
    ENSMUST00000115538.4|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072664.1|Mrpl15-202|Mrpl15|910|processed_transcript|
    ENSMUST00000192286.1|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000127355.1|Mrpl15-207|Mrpl15|4600|retained_intron|
    ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-205|Mrpl15|1569|protein_coding|
    ENSMUST00000132625.1|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072663.1|Mrpl15-204|Mrpl15|654|retained_intron|

  3. honestly, it’s the first time I work with RNAseq data, so for all the programs, I always use the default setting, based on the sequencing parameter (mouse, illumina, pair-end, 150bp, 20-30M), does anyone can give me some advice about those parameters?

thank you so much, everyone!

Dear @yanjiezhu,
For both 1. and 2., what is happening is that your tools count for each individual transcript, e.g., ENSMUST00000130201.7 and ENSMUST00000156816.6 are two different transcripts of your gene Mrpl15. Bit more confusing in (1), Stringtie shows just the gene name, but each row stands probably for one transcript of the gene. Some tools like Salifish can be set to the specific ID they should consider for counting (look into the option The key for aggregating transcripts during gene-level estimates of Salifish). Other tools do not support such an option and in order to count only for the gene you would need to get an annotation file, where you just have one row for each gene.

Have a nice day and best wishes,
Florian

2 Likes

Hi Florian,
Thank you so much for the kind explanation. You are right, Stringtie shows here each row stands for one transcript of the same gene, I can see the details like this:
chr1 StringTie transcript 3206523 3216968 1000 - . gene_id MSTRG.2 transcript_id “ENSMUST00000159265.1” ref_gene_name Xkr4 cov “0.011633” FPKM “0.000683” TPM “0.001424”
chr1 StringTie transcript 3214482 3671850 1000 - . gene_id MSTRG.2 transcript_id “ENSMUST00000070533.4” ref_gene_name Xkr4 cov “0.117157” FPKM “0.006876” TPM “0.014344”

And I checked the option you suggested about Salifish, the default setting is gene_id, I changed it to gene_name and re-run it, the job is still waiting, but I’m not sure what I will get.
Looks like what I got from these programs is actually each transcript with a TPM? do you have any suggestion if I want to get the TPM for each gene?

Thanks,
Yanjie

Dear @yanjiezhu,
There are different approaches to get the count per gene.

A simple way is to download (or modify) your annotation so that each row stands for one gene and then you apply stringtie (or another counting tool).

Another way is to use the ouput of Salmon or kallisto and pick the transcript with the highest CPM.

Cheers,
Florian

1 Like

Hi @yanjiezhu,
in the case of Salmon, if you provide an annotation file, it generates two outputs, a quantification file (which provides information about the transcripts) and a gene quantification file (which provides information at gene level). Probably the last file is the one you are interested in.

Regards