I am working with bunch of datasets and one of them wont map to hg38 reference genome. I attached the fastq formats which i think there is a problem with that. I tried every thing else( changing references, or mapping options)
it seems i should do some thing to fix the fastq formats.
Also, It is miRNA-seq dataset.
do you get an error for the mapping step, or reads do not map to the genome? If the job failed, check the error message. Run FastQC on file with reads. If the reads are all unmapped, check several sequences through UCSC Blat search, to make sure the reads have a mach to the human genome. Alternatively, run Blast search against the human genome for several sequences.
The reads are longer than miRNA and presumably contain adapter sequences. You can trim reads before mapping. I assume all other samples have similar untrimmed sequences.
Thanks to your response.
No the job was not failed.
Just the mapping stats was less than 1% and it is unusual.
I did trimmed the adapters but still the results were the same.
The best guess is a read quality problem. This accession is from an very early GEO release using an older sequencing protocol translated by NCBI into Illumina fastq format. The GEO data were known to be difficult to work with (always, not just now).
- Run FastQC on the reads – largest flags to me are the highly variable quality scores, the low % when deduplicated, plus over 5% of the reads contain a single “overrepresented” sequence.
- I started up a BLASTN with that top “overrepresented” sequence as a query versus wgs. It is included in the shared history below and is still running.
- Web BLAT at UCSC didn’t produce any hits for that “overrepresented” sequence string versus human but that could be due to many reasons (splicing, length, not actually being human). I still totally agree that is worth trying in general.
- Consider checking how well QA worked for you: FastQC before/after QA for this accession and between all the accessions you are using from this source to decide if the data has enough quality and length to be useable.
The fastq format does seem Ok.
- NCBI SRA will report back with slightly different fastq formats/read labels depending on how the data is retrieved.
- Several examples are in the test history below.
- Technically, if the @ and + identifier lines are the same per fastq record or if the @ line has content and the + doesn’t, the format should be interpreted correctly by tools.
- Problems have come up when those lines mismatch – but that isn’t how NCBI reports the read data anymore (was present in an older subset of submissions, and now those are standardized too).
Test history if interested: Galaxy | Accessible History | test SRR6700254
If the reads are not usable, there is not much that can be done other than exclude them.