Hello, I am surprise with results after running " Generate all possible combination of STR length profile" because of long amout of generated data. I’d like to know if this tabular file may be refined discarding repeated readings and errors. Thanks
Hello, Given a large fasta input and default parameters, many results will be returned by this tool. Also, I would suggest modifying the genome to be just hg38
instead of the variant hg38Patch11
– the latter is not fully indexed for tools at https://usegalaxy.org and is not the genome version that most data providers base their data on (including UCSC/ENCODE).
Please see the following resources for how to use these tools:
-
Publication: https://www.ncbi.nlm.nih.gov/pubmed/25823460
-
Tool Overview/User manual: https://github.com/Arkarachai/STR-FM/blob/master/README.md
-
Example workflows (scroll down the page): https://toolshed.g2.bx.psu.edu/view/arkarachai-fungtammasan/str_fm/20dc70f85ff7https://toolshed.g2.bx.psu.edu/view/arkarachai-fungtammasan/str_fm/20dc70f85ff7
Thanks, hg38Patch11 was chosen automatically by STR detection app. I have only UCSC Hg19 genome (BED format) imported in my history. I will try to run again with parameters you suggest.
If your other input data is based on hg19
, then use that same database with all tools/analysis.
I do see some other data assigned to hg_g1k_v37
.
hg_g1k_v37
and hg19
represent the same genome release, but a different build/version. The chromosome identifiers are different between the two plus the 1000 genomes hg_g1k_v37
build only includes the base chromosomes + X + Y + Mito. The UCSC hg19 Canonical
build is a match for the chromosomes included (but with the different identifiers) – some tools will offer this as a target choice (example: the BWA mapper). The UCSC hg19
build includes those chromosomes, plus haplotype and unplaced. Things can get even more complicated from there (example: the Ensembl GRCh37 human build/version, etc).
FAQ: Mismatched Chromosome identifiers (and how to avoid them)
The main point is that it is important to consistently use the same genome build/version throughout an analysis. If there is a mismatch (differing chromosome identifiers and/or chromosome sizes and/or chromosome content) an error can be produced. Sometimes “empty” or other types of unexpected results will be produced. Both can alert the end user that there is a problem.
Since hg19
and hg38
are so similar, mixing up human versions might not always be obvious if a tool produces a “green” (putatively successful) dataset result. So, double check the actual database version/build for data at the start of the analysis, when introducing new inputs, and/or choosing a target database on a tool form.
Concerning the hg38Patch11
assignment by the STR detection tool: The database was inherited from the input fasta
dataset. It looks like the upstream fastq.gz
dataset had the database manually assigned after the Cutadapt
tool was used. So, be sure to go back upstream and correct whatever datasets you plan to process (or reprocess). You might need to rerun some steps/tools if the goal is to combine results.
In many cases, what the “database” assignment should be for fasta/fastq
data doesn’t really matter – and it can be left unassigned – as it is just “human” sequence read data at that point and not linked to a particular human genome/build version’s chromosomes/coordinates. That said, some tools do interpret the database metadata of an input (instead of having the user select a target genome on the tool form), so assign the database correctly when you can – or at least be aware that as possibly being wrong/missing and contributing to later errors/unexpected results should they come up.
Thanks!