How to find NCBI Reference data?
Q: Where can I find the URL links to a reference genome FASTA, annotation, or GenBank file?
A: Use the NCBI Genome Assembly page for the organism of interest — for example:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000819615.1/
This page provides:
- Download links for genome sequences (FASTA)
- Annotation files (GFF, GTF)
- GenBank format files (used with SnpEff build)
- Direct links to the associated FTP directories
Screenshot of a NCBI genome assembly view for GCF_000819615.1
Screenshot of a NCBI genome FTP directory view for GCF_000819615.1
File name | Description |
---|---|
genomic.gbff.gz |
Reference GenBank record (for SnpEff build) |
genomic.gtf.gz |
Reference gene annotation (GTF format) |
genomic.fna.gz |
Reference genome FASTA sequence |
How to get the data into Galaxy?
- Capture the URL link for your file(s)
- Paste into the
Upload tool
- Click on Start using all default settings
- The files will be loaded into your current Active history
Standardizing the format
- With some tools, getting the data loaded is enough!
- With others, you may want to remove the description content from the FASTA
>
lines and/or remove the#
header lines from the GTF.
How to use the data with tools or a workflow?
- Use the input option to Choose Reference Data from the History
- Select your dataset file.
- If the data requires a tool index (example: mapping tools), the index will be automatically created at runtime.
- If you need a custom database key for visualization options, you can create and assign one!
Resources
- Getting Data into Galaxy
- FAQ: How to use Custom Reference Genomes?
- FAQ: Working with GFF GFT GTF2 GFF3 reference annotation
- FAQ: Extended Help for Differential Expression Analysis Tools
- FAQ: Adding a custom database/build (dbkey)
- Assembly Version? See → Reference genomes at public Galaxy servers: GRCh38/hg38 example