Human genome 'primary assembly' as reference for mapping ?

lakshmi_s · June 7, 2021, 11:31am

Hello,
Am a beginner in NGS analysis…kindly help
Earlier i was using http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz and inbuilt human genome 38 of galaxy platform as reference annotation and genome respectively for mapping (as suggested in various tutorials)

But the Genecode helpdesk recommended that I should use files with the comprehensive annotation in the primary assembly.
for eg:
http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.primary_assembly.annotation.gtf.gz

http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.genome.fa.gz

So my question is what should I use… ? What is the significance of primary_assembly ?Does that make my earlier analysis all wrong ?

David · June 7, 2021, 1:57pm

Welcome @lakshmi_s!

The README at Ensembl states that

In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set.

Gencode FAQ:

What is the difference between GENCODE GTF and Ensembl GTF?

The gene annotation is the same in both files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

In addition, the GENCODE GTF contains a number of attributes not present in the Ensembl GTF, including annotation remarks, APPRIS tags and other tags highlighting transcripts experimentally validated by the GENCODE project or 3-way-consensus pseudogenes (predicted by Havana, Yale and UCSC). See our complete list of tags for more information.

Please note that the Ensembl GTF covers the annotation in all sequence regions whereas GENCODE produces a similar file but also a GTF file with the annotation on the reference chromosomes only.

Which are the reference chromosomes?
The reference chromosomes are those in the primary genome assemblies, ie. chromosomes 1 to 22, X and Y in human; chromosomes 1 to 19, X and Y in mouse. The mitochondrial chromosome is also considered as part of the reference chromosomes. Some GENCODE files contain annotation on reference chromosomes only, thus excluding other sequence regions as unlocalized and unplaced scaffolds, assembly patches and alternate loci (haplotypes).

Did Gencode help-desk explain why you should use specific files rather than the Ensembl ones? Or maybe they’re not excluding Ensembl files.

lakshmi_s · June 9, 2021, 6:18am

Dear sir ,
Thanks for the reply

Actually, I specifically mailed them to suggest which gtf file i should download for analysing RNA-seq data using galaxy. (as there were a lot files in their download section and i was confused) . I was unable to download from ensembl (Index of /pub/release-104/gtf/homo_sapiens/ …here also many files ) and UCSC in the first place. Thats how reached genecode website and they suggested that i should use ’ primary_assembly’ files for RNA-seq analysis. Didnt say that I should use their files only.
I already used GRCh38.primary_assembly.genome.fa.gz and gencode.v38.primary_assembly.annotation.gtf.gz files from this link Index of /pub/databases/gencode/Gencode_human/latest_release/ for analysis. Can you please tell me whether these files contain all the information (reference chromososmes , mitochondrial genome patches, scaffolds and haplotypes etc). ? Also are these files updated to support Ensembl’s new changes to gene naming ?
Thanks in advance
Lakshmi S

David · June 9, 2021, 6:37pm

I’m 99% sure this is defined in the README files, but I can’t confirm it now.

I don’t know about this. Maybe someone else here at the Galaxy Help or Gencode team can answer this better.

lakshmi_s · June 10, 2021, 4:47am

Ok…Thank you…

David · June 10, 2021, 2:53pm

You’re welcome.
Please, let us know when you got an answer from Gencode folks, so more people with the same questions get help too

lakshmi_s · June 12, 2021, 6:38am

Genecode helpdesk says ">>> The gene names in the GENCODE files are normally up-to-date with

Ensembl. There may be a few differences in the latest release though,
since cloned-based names have been deprecated in Ensembl, being replaced
with the ENS stable ids. Clone-based names used to be assigned to genes
and transcripts where there was no name provided by external sources
such as HGNC or EntrezGene. I just checked the GENCODE files and some of
these genes do not have the expected ENS stable id as gene name but an
old name that was stored in our internal annotation database. I have
made a note of this to ensure that these genes have matching names with
Ensembl in the next GENCODE release"

David · June 12, 2021, 3:13pm

@lakshmi_s Thank you and thanks to GENCODE