Hello, I am surprise with results after running " Generate all possible combination of STR length profile" because of long amout of generated data. I’d like to know if this tabular file may be refined discarding repeated readings and errors. Thanks
Hello, Given a large fasta input and default parameters, many results will be returned by this tool. Also, I would suggest modifying the genome to be just
hg38 instead of the variant
hg38Patch11 – the latter is not fully indexed for tools at https://usegalaxy.org and is not the genome version that most data providers base their data on (including UCSC/ENCODE).
Please see the following resources for how to use these tools:
Tool Overview/User manual: https://github.com/Arkarachai/STR-FM/blob/master/README.md
Example workflows (scroll down the page): https://toolshed.g2.bx.psu.edu/view/arkarachai-fungtammasan/str_fm/20dc70f85ff7https://toolshed.g2.bx.psu.edu/view/arkarachai-fungtammasan/str_fm/20dc70f85ff7
Thanks, hg38Patch11 was chosen automatically by STR detection app. I have only UCSC Hg19 genome (BED format) imported in my history. I will try to run again with parameters you suggest.
If your other input data is based on
hg19, then use that same database with all tools/analysis.
I do see some other data assigned to
hg19 represent the same genome release, but a different build/version. The chromosome identifiers are different between the two plus the 1000 genomes
hg_g1k_v37 build only includes the base chromosomes + X + Y + Mito. The UCSC
hg19 Canonical build is a match for the chromosomes included (but with the different identifiers) – some tools will offer this as a target choice (example: the BWA mapper). The UCSC
hg19 build includes those chromosomes, plus haplotype and unplaced. Things can get even more complicated from there (example: the Ensembl GRCh37 human build/version, etc).
The main point is that it is important to consistently use the same genome build/version throughout an analysis. If there is a mismatch (differing chromosome identifiers and/or chromosome sizes and/or chromosome content) an error can be produced. Sometimes “empty” or other types of unexpected results will be produced. Both can alert the end user that there is a problem.
hg38 are so similar, mixing up human versions might not always be obvious if a tool produces a “green” (putatively successful) dataset result. So, double check the actual database version/build for data at the start of the analysis, when introducing new inputs, and/or choosing a target database on a tool form.
hg38Patch11 assignment by the STR detection tool: The database was inherited from the input
fasta dataset. It looks like the upstream
fastq.gz dataset had the database manually assigned after the
Cutadapt tool was used. So, be sure to go back upstream and correct whatever datasets you plan to process (or reprocess). You might need to rerun some steps/tools if the goal is to combine results.
In many cases, what the “database” assignment should be for
fasta/fastq data doesn’t really matter – and it can be left unassigned – as it is just “human” sequence read data at that point and not linked to a particular human genome/build version’s chromosomes/coordinates. That said, some tools do interpret the database metadata of an input (instead of having the user select a target genome on the tool form), so assign the database correctly when you can – or at least be aware that as possibly being wrong/missing and contributing to later errors/unexpected results should they come up.