Hi i am using paired end data and downloaded the sra file from ncbi and uploaded it on galaxy server. and can i use this data directly for assembly or first i have to split the files then use it
It depends on which assembly tool you are using, but some do require that interleaved reads are split into forward/reverse first. None accept an
sra archive directly (that I am aware of). QA/QC is also strongly recommended (
Please see these FAQs, in particular, the last one:
- How to format fastq data for tools that require .fastqsanger format?
- Understanding compressed fastq data (fastq.gz)
- Reformatting fastq data loaded with NCBI SRA
Also, there two primary tools to extract fastq data from NCBI. The first will output data as it is formatted from NCBI. The second will split any pairs and place all data into Dataset Collections and can extract fastq sequences from
sra data that has been Uploaded (probably the option you need).
- Download and Extract Reads in FASTA/Q format from NCBI SRA
- Faster Download and Extract Reads in FASTQ format from NCBI SRA
thank you for reply i used Faster Download and Extract Reads in FASTQ format from NCBI SRA and it yielded 4 files however according to metadata it contains data from two different tissue but here i only observed pair end data with separate forward and reverse file. where can i find separate data as the data is presented in single SRA file at ncbi.
I’m sure that I understand.
- Did you enter a distinct accession for each distinct sample? The form accepts individual or lists of accessions.
- Or, did you extract data from the SRA archive that you had already loaded? Which archive (URL)?
I used one accession which i extracted using Faster Download and Extract Reads in FASTQ format from NCBI SRA. does that single accession contains both tissue data? Below is metadata for reference.
Run, Assay Type, AvgSpotLen, Bases, BioProject, BioSample, BioSampleModel, Bytes, Center Name, Collection_Date, Consent, DATASTORE filetype, DATASTORE provider, DATASTORE region, dev_stage, Experiment, geo_loc_name_country, geo_loc_name_country_continent, geo_loc_name, Instrument, Isolate, Isolation_source, Library Name, LibraryLayout, LibrarySelection, LibrarySource, Organism, Platform, ReleaseDate, Sample Name, sample_type, SRA Study, tissue SRR4299636, CLONE, 298, 9557003942, PRJNA344442, SAMN05823576, Plant, 3924705078, TIANJIN UNIVERSITY OF SCIENCE AND TECHNOLOGY, 2015-11-27, public, sra, "ncbi,gs,s3", "gs.US,s3.us-east-1,ncbi.public", Young leaf, SRX2194189, China, Asia, China, NextSeq 500, Leaves, soil, PRJNA344442SAMN05823576, PAIRED, cDNA, TRANSCRIPTOMIC, Azadirachta indica, ILLUMINA,2017-04-26T00:00:00Z, Neem advewntitioius root, Tropic plant, SRP090539, Neem adventitious root in MS medium
admin edit: text reformat
RUN accession represents a single sequencing event. In this case, it is paired-end fastq data. At the highest level, the
BioProject accession only includes one sequencing run. Same for the
SRA Study accession. There is no information indicating that more than one tissue was sequenced (just
Neem adventitious root in MS medium).
Thanks again for reply, in their research manuscript they have mentioned about neem leaf and neem adventitious root transcriptome so, i thought that they done sequencing for both leaf and root sample. I have attached the image please have a look. can you suggest any way to compare the differential gene expression between these two tissue.
Publications should clearly reference all data.
The SRR accession appears to be a match for the
AR reads, and are the only reads submitted for the BioProject.
- Contact the author(s) for clarification.
- Consider the publication, and associated data, flawed and avoid. The SRR submission has some issues:
Library Nameis oddly annotated.
Sample Namehas a typo.
ReleaseDateis after the paper’s publication date.