fasta file begins with carat (>) and not @

I have FASTA file which begins with carat(>) and not @. Hence, FastQC gives following error:

Picked up _JAVA_OPTIONS: -Xmx28g -Xms256m
Failed to process AK1_clean_fa ID line didn’t start with ‘@’

How do I fix my FASTA file to begin with @? Thanks!

Hi @abhay03

Fasta files start with a > on the title line for each sequence. These data do not have quality scores (what most of the FastQC modules profile with statistics). Datatypes - Galaxy Community Hub

The tool Fasta Statistics can generate statistics and do some basic QA.

The FAQs here might help → datatypes. That entire page can be browser searched with keywords like datatypes. You can also search this forum.

This quote from one of those FAQs describes one way to convert fasta to fastqsanger:

If your data is FASTA, but you want to use tools that require FASTQ input, then use the tool Combine FASTA and QUAL. This tool will create “placeholder” quality scores that fit your data. On the output, click on the pencil icon galaxy-pencil to reach the Edit Attributes form. In the center panel, click on the “Datatype” tab, enter the datatype fastqsanger, and save.

Whether any of that is needed depends on the content of the original data, what your analysis goals are, and the tool(s) that are rejecting the fasta as a valid input. FastQC results would not be very meaningful for a fasta file converted to fastq except to generate a few basic statistics that don’t involve quality scores.

And, this recent Q&A is how to check what datatypes a tool is expecting as valid input: fastq unavailable -- Tool does not recognize inputs? How to check why - #2 by jennaj

If you want to explain more about your goals and target tools, we can help more :slight_smile:

Hi @jenaaj,

Thanks a lot for your reply and asking me about my project. This was quite helpful.

Converting FASTA to FASTQ won’t be of much help to me, right for FASTQC?

In my current project, I have six different knockout samples made in MCF10A cells, one sample with non targeting guideRNA and one wild-type sample, 3 replicates each, so a total of 24 samples sequences for small RNA-seq in and FASTA format.

Hence, I would like to use tools like mirdeep2, cutadapt (Trim galore) to identify differentially expressed microRNA in my knockouts vs control cell line.

Please let me know if you have any further questions. Thank you!

1 Like

Correct, using the original fasta is appropriate for those tools :slight_smile: and converting to fastq doesn’t have any benefits that I can think of.