What kind of error should be checked by a validator while validating biological file formats like GFF and FASTA

I’m working on a project to create a library(in Java) that can validate various biological file formats like GFF, FASTA, OBO etc.

But as I’m not from this field, So I’m little confused about what kind of validation should be performed by the validator program.

There are some online tools like Genome Tools that can validate GFF file format, So can anyone help me understand what kind of validation rules should be applied on easy of these files.

1 Like

Welcome, @Deepak_Singh

There was a dependency package create a few years ago for the open source tools you reference. I don’t think it is used by any current tools but maybe could help? https://toolshed.g2.bx.psu.edu/view/iuc/package_genometools_1_5_7/6c1cc41af15b

Most Galaxy tools now use conda for dependency resolution. So, check with the IUC about why this was dropped or to find out what alternative they might suggest. IUC Gitter chat: https://gitter.im/galaxy-iuc/iuc

Galaxy does check the format of many datatypes as part of autodetect/assignment of datatype (Upload, Edit Attributes). But those are not comprehensive validators that report back all the potential problems into a report (yet). But, you might be able to repurpose parts of that functionality into wrapped, standalone tools. https://github.com/galaxyproject/galaxy

Many datatypes are also defined in this FAQ, along with links to 3rd party sites that are involved in setting format specifications. See https://galaxyproject.org/support/#getting-inputs-right

There are also several wrapped Galaxy tools that check dataset formats (and produce a report). Picard > ValidateSamFile assess validity of SAM/BAM dataset is one example. Searching the ToolShed would be the best way to find all.

In the end, a format validator would need to both test for format compatibility versus public file specifications and meet whatever custom format (sometimes stricter, sometimes not) the target tools are expecting.

How to “validate format” effectively for all the formats and issues that can come up, so that they are accepted/interpreted correctly across computational tools, can vary and is part of the reason this forum for Galaxy exists. Even with all the format validation Galaxy includes problems still come up due to incompatible format variations (sometimes intentional by whoever hosts the data, sometimes due to some user-introduced error).

Frankly, this is one of the most complicated components of doing work in this field. If you do create a set of new format validators, those would be welcomed as new wrapped tools in the ToolShed.