STR-FM detects 7,500,000 lines in my first Illumina WES x100 chip FASTQ file checked!

Hello, I am surprise with results after running " Generate all possible combination of STR length profile" because of long amout of generated data. I’d like to know if this tabular file may be refined discarding repeated readings and errors. Thanks

1 Like

Hello, Given a large fasta input and default parameters, many results will be returned by this tool. Also, I would suggest modifying the genome to be just hg38 instead of the variant hg38Patch11 – the latter is not fully indexed for tools at https://usegalaxy.org and is not the genome version that most data providers base their data on (including UCSC/ENCODE).

Please see the following resources for how to use these tools:

Thanks, hg38Patch11 was chosen automatically by STR detection app. I have only UCSC Hg19 genome (BED format) imported in my history. I will try to run again with parameters you suggest.

1 Like

If your other input data is based on hg19, then use that same database with all tools/analysis.

I do see some other data assigned to hg_g1k_v37.

hg_g1k_v37 and hg19 represent the same genome release, but a different build/version. The chromosome identifiers are different between the two plus the 1000 genomes hg_g1k_v37 build only includes the base chromosomes + X + Y + Mito. The UCSC hg19 Canonical build is a match for the chromosomes included (but with the different identifiers) – some tools will offer this as a target choice (example: the BWA mapper). The UCSC hg19 build includes those chromosomes, plus haplotype and unplaced. Things can get even more complicated from there (example: the Ensembl GRCh37 human build/version, etc).

FAQ: Mismatched Chromosome identifiers (and how to avoid them)

The main point is that it is important to consistently use the same genome build/version throughout an analysis. If there is a mismatch (differing chromosome identifiers and/or chromosome sizes and/or chromosome content) an error can be produced. Sometimes “empty” or other types of unexpected results will be produced. Both can alert the end user that there is a problem.

Since hg19 and hg38 are so similar, mixing up human versions might not always be obvious if a tool produces a “green” (putatively successful) dataset result. So, double check the actual database version/build for data at the start of the analysis, when introducing new inputs, and/or choosing a target database on a tool form.

Concerning the hg38Patch11 assignment by the STR detection tool: The database was inherited from the input fasta dataset. It looks like the upstream fastq.gz dataset had the database manually assigned after the Cutadapt tool was used. So, be sure to go back upstream and correct whatever datasets you plan to process (or reprocess). You might need to rerun some steps/tools if the goal is to combine results.

In many cases, what the “database” assignment should be for fasta/fastq data doesn’t really matter – and it can be left unassigned – as it is just “human” sequence read data at that point and not linked to a particular human genome/build version’s chromosomes/coordinates. That said, some tools do interpret the database metadata of an input (instead of having the user select a target genome on the tool form), so assign the database correctly when you can – or at least be aware that as possibly being wrong/missing and contributing to later errors/unexpected results should they come up.

Thanks!