Proteogenomics help needed: How to get genomic coordinates of identified proteins?

nathanieltay · May 11, 2022, 2:25pm

Hi,

I am trying to fetch the genomic coordinates of mass spec identified proteins. The process of identifying these proteins is as follows:

Align RNA-Seq paired-end read files to the human reference genome (input of 43 x 2 paired-end read files → output of 43 x BAM files).
Assemble transcripts via Stringtie with reference to Ensemble GTF (output = 43 x GTF files).
Merge all 43 GTFs into 1 single GTF file (output = 1 x merged.gtf file).
Extract transcript sequences of the merged.gtf using human reference genome (output = 1 x merged.fasta file)
Translate the sequences in 3 frames (output = 1 x 3-frame-translated.fasta).
Split the translated sequences at every stop codon (output = 1 x protein_db.fasta).

I would now like to map the proteins from step 6 back to their genomic coordinates. I tried to follow this Proteogenomics tutorial starting from the “Transcript Assembly” section. Starting at the “Evaluate the assembly with annotated transcripts” section, I input merged.gtf from the above step 3. I then follow this up with the “Translate transcripts” section which uses the merged.gtf and Homo_sapiens.GRCh38.dna.primary_assembly.2bit as input. Next, I tried to follow the instructions in the “Creating FASTA Databases” section but I cannot do so as I do not have a genomic_mapping.sqlite database as input.

Can anyone advise how I might go about solving this problem? I am open to using non-galaxy tools as well although I have had much less success there.