Issues with data format and running HiSat2

Im currently in the middle of my uni project and I’m trying to follow an RNA-Seq HiSat pipeline to analyse reads obtained from NCBI Gene Expression Omnibus (GEO). I’ve got the data, and tried to put it through HiSat. Didn’t work and came back with error messages, realised that I probably should have put it through FastQC/Trimmomatic first.
Tried however, it didn’t accept or recognise any fastq files, couldn’t understand why til I looked at the data files. It seems that I had “” instead of normal fastq files. Apparently from what i’ve been told, I’ve got from GEO are mixed messages. The first part “fastq.gz” would represent raw NGS reads, which should either be in fastqsanger format or can be converted to fastqsanger.
If FastQC doesn’t run and won’t accept the files then it is unlikely they are in a fastq format. The second part “” suggests these are aligned sequences with gene counts.

So i’ve now been trying to determine whether these files are in fastq format or not. Tried to convert the files using FastQ Groomer, didn’t accept the files. FastQC didn’t accept them so I’m assuming they’re not in fastq format. Tried running through HiSat again and I just encounter the same errors as before. I’ve looked almost everywhere to see if I can fix this problem and I’ve emailed again to ask what to do as I’m relatively new to using this tool. Does anyone know how I resolve this issue?

This was all done on my uni’s galaxy server.

The filename suggest it is something else then a fastq file to me. How did you got the files? Would it be possible for you to open the files on your local computer? So download them first, unzip (gunzip) them and just open it with a text editor?

1 Like

So initially, i downloaded load each of the files individually from GEO rather than getting the accession list and putting that into Galaxy. Okay, I will give that a try. What should I do once I open it with the text editor?

Okay ive tried to open it on my computer. The files aren’t zipped. So I tried using the 7-Zip option and extracting the file. Tried to open it with text editor but I get an error saying windows cannot open this type of file.

I dont know what you exactly did, you may did something wrong because you can basicly even open zipfiles in a text editor if you want. fastq files are just text files with a certain format, so if you would open the file in a text editor and see that it has a fastq format you know it is a fastq file. Do you have a link to the page where you are downloading from? Or an accession.

Link to GEO where I got the data GEO Accession viewer
I was specifically using rhe RNA-Seq runs only ,Accession No:GSE152547

Maybe I downloaded the files incorrectly I’m not sure. I’ve cleared my history and downloaded them again this time using the SRA (Sequence Read Archive) In GEO. Its given me my pair-end data (all the data is here 324 items), single-end data, other data and the faster q dump log. So far everything looks fine. I try to run it through fastqc and it gives me an error message saying a certain read in the data collection doesn’t exist as its been deleted. Little confused seeing as I just finished downloading the data this morning. Check the file in question, both the forward and reverse reads are there, no problem. I click on the forward read to view the file and an error message comes up informing that ‘this file does not exist as it has been purged’.

I haven’t purged or deleted any of this data so im confused as to why im facing problems now?

This sounds similar to another problem I helped with about a month ago.

Try this:

  1. Create a new history and give it a distinct name
  2. Download the data from SRA into that history
  3. Avoid using tools like this one Export datasets to remote files source while you are still processing any data in that history.

I’m going to close this topic out. If the problem can be reproduced again following the advice above, please ask a new question and include a shared history link for context. The data at that point would all be public anyway, and it would help us when reviewing for potential bugs.