HmmCleaner
Arnaud Di Franco []
Version 0.1 / Feb 13, 2018
HmmCleaner
detects low similarity segments (LSS) through four steps. First, a pHMM is built from the MSA using HMMER
(Figure 1A). This pHMM can be built upon either (i) all sequences of the MSA (complete strategy) or (ii) all sequences excepted the currently analyzed one (leave-one-out strategy). Users can affect this step with the profile
option.
Second, each sequence of the MSA is evaluated with the pHMM (Figure 1B), which yields profile-sequence alignments.
Third, the analysis of each profile-sequence alignment is based on the four discrete categories of column-wise probabilities provided by HMMER
. The two first categories represent residues that fit poorly to the pHMM: blank character (null probability, parameter c1) and '+' character (low probability, parameter c2). In opposition, the two last last categories represent residues that fit to the pHMM: amino acid characters in lower case (good probability, parameter c3) and upper case (high probability, parameter c4). A cumulative similarity score increases when the residue is expected from the profile or decreases it otherwise (Figure 1C). Parameters c1 and c2 are therefore negative and parameters c3 and c4 positive. The cumulative score is computed from left to right starting with a value of 1. Its value is strictly restricted between 0 and 1 included. An LSS start at the last position with a cumulative score of 1 when this one reaches a null value. Its end is defined by the last position with a null value once the cumulative score goes back to 1 or when the end of the sequence is reached (Figure 1D).
USAGE
HmmCleaner.pl <infiles> [options]
REQUIRED ARGUMENTS
<infiles>
list of alignment file to check with HmmCleaner
OPTIONS
-costs <c1> <c2> <c3> <c4>
Cost parameters that defines the low similarity segments detected by
HmmCleaner. Default values are -0.15, -0.08, 0.15, 0.45 Users can
change each value but they have to be in increasing order. c1 < c2 <
0 < c3 < c4 Predefine value are also available with --large and
--specificity options but user defined costs will be prioritary if
present.
--changeID
Determine if output will have defline with generic suffix
(_hmmcleaned)
--noX
Convert X characters to gaps that will not be taken into account by
HmmCleaner.
-profile=<profile>
Determine how the profile will be create complete or leave-one-out
(default: complete) leave-one-out = without the analyzed sequence
(new profile each time) complete = all sequences (same profile for
each sequence) First case is more sensitive but need more ressources
(hence more time)
--large
Load predifined cost parameters optimized for MSA with at least 50
sequences. Can be use with --specificity option. User defined costs
will be prioritary if present.
--specificity
Load predifined cost parameters optimized to give more weigth on
specificity. Can be use with --large option. User defined costs will
be prioritary if present.
--log_only
Only outputs list of segments removed.
--ali
Outputs result file(s) in ali MUST format.
-v[erbosity]=<level>
Verbosity level for logging to STDERR [default: 0]. Available levels
range from 0 to 5.
--version
--usage
--help
--man
Print the usual program information