ABSTRACT WindowMasker is a program that identifies and masks out highly repetitive DNA sequences and DNA sequences with low complexity in a genome using only the sequence of the genome itself. WindowMasker is described in [1]. Please cite this paper in any publication that uses WindowMasker. The statically linked binary version of windowmasker program compiled for Linux platform is available via FTP at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker [1] Morgulis A, Gertz EM, Schaffer AA, Agarwala R. WindowMasker: Window based masker for sequence genomes. Submitted for publication. SYNOPSIS windowmasker -mk_counts [-in input_file_name] [-out output_file_name] [-checkdup check_duplicates] [-t_low T_low] [-t_high T_high] [-fa_list input_is_a_list] [-mem available_memory] [-unit unit_length] [-genome_size genome_size] [-exclude_ids exclide_id_list] [-ids id_list] [-infmt input_format] [-sformat unit_counts_format] [-smem available_memory] [-use_ba use_bit_arrays] [-t_low_pct pct] [-t_extend_pct pct] [-t_thres_pct pct] [-t_high_pct pct] windowmasker -ustat unit_counts [-in input_file_name] [-out output_file_name] [-window window_size] [-t_thres T_threshold] [-t_extend T_extend] [-t_low T_low] [-t_high T_high] [-set_t_low score] [-set_t_high score] [-infmt input_format] [-outfmt output_format] [-dust use_dust] [-exclude_ids exclude_id_list] [-ids id_list] [-text_match text_match_ids] [-use_ba use_bit_arrays] [-t_low_pct pct] [-t_extend_pct pct] [-t_thres_pct pct] [-t_high_pct pct] windowmasker -convert -in input_file_name -out output_file_name [-sformat output_format] [-smem available_memory] DESCRIPTION WindowMasker has two modules for masking DNA sequences. The WinMask module is used to mask potentially repetitive sequences by counting the number of times different n-mers (units) occur in the genome. The DUST module is used to identify and mask low- complexity regions. The WinMask module works in two stages. During Stage 1, unit counts are collected and stored in a separate file. During Stage 2 that file is used to mask the input sequences. Usually the unit counts file is created once per genome and then used multiple times for masking. In addition, an option is provided to convert between different unit counts formats. Stage 1 is selected with "-mk_counts" flag. Stage 1 processes input data in up to 4 passes. Pass 1 (optional). Checking for long possibly duplicate sequences in the input. This pass is made when "-checkdup true" is selected. Pass 2 (optional). Compute the total size in bases (bp) of the genome. This pass is made when neither unit length ("-unit") nor the genome size ("-genome_size") are given on the command line. Knowledge of genome size is necessary to compute the unit length automatically. Pass 3 (optional). Compute unit score thresholds T_threshold, T_extend, T_low, and T_high. In particular the values of T_low and T_high are necessary for the final pass to save only the counts for the units with score above or equal to T_low and to change the counts of units that are below or equal to T_high to the value of T_high. This pass can be avoided by using "-t_low" and "-t_high" command line options that explicitly set the values of T_low and T_high. Pass 4. Generate the file containing the unit length, significant (above T_low) unit counts, and threshold values. The counts of units that appear more than T_high times in the genome are assigned the count value of T_high. Stage 2 is enabled by providing "-ustat " argument. WindowMasker reads the data generated in Stage 1 and a set of input DNA sequences to output information about masked subintervals. If "-dust true" is specified, then the corresponding algorithm of the DUST module is applied to the input sequences in addition to window based masking. When DUST module is run, the results of the DUST and WinMask modules are merged together in the output. Specifically, a base is masked if it is masked by either DUST or by WinMask. The unit counts format conversion function is triggered by "-convert" flag. OPTIONS In this section, command line options to WindowMasker are explained. There are two subsections for Stage 1 options and Stage 2 options. Some options are applicable in both stages but could have somewhat different meaning. Options Selecting WindowMasker Mode of Operation -mk_counts The presence of this flag enables Stage 1 processing (counts creation). -convert If this flag is given, windowmasker functions as a converter between different supported counts formats. In this case options other than "-in", "-out", "-sformat", and "-smem" are not supported. -ustat unit_counts This option enables Stage 2 processing. It specifies the name of the file containing the unit counts for the genome in use. Unit counts are typically created during Stage 1 of WindowMasker processing. The format of the unit counts file is recognized automatically Common Options -exclude_ids exclude_id_list default: The name of the file containing the list of sequence ids (one per line) in the input that should not be processed. See section FILES for an example. This option and "-ids" option are mutually exclusive. -ids id_list default: The name of the file containing the list of sequence ids (one per line) in the input that should be processed. See section FILES for an example. This option and "-exclude_ids" option are mutually exclusive. -use_ba true/false default: true Optimize unit counts data structure performance using bit arrays if possible. This increases the masking speed by about 20% but also increases the process memory footprint. This optimization is currently available only for unit counts database created with "-sformat obinary" option. Stage 1 Options -checkdup true/false default: false If the value is true, then a pass is made to check the input for long possibly duplicate sequences. The algorithm is heuristic. For a pair of sequences a match is reported if at least 5 100 bp segments approximately 10000 bp apart match exactly. "Approximately" above means that some fuzziness is allowed in the distance between consecutive 100 bp segments. This criterion will detect all duplicates as well as some false positives. In our experience, duplicates arise when multiple, distinct assemblies of some part (e.g., one chromosome) of the genome pieces are considered as part of the sequenced genome. -fa_list true/false default: false If the value of this parameter is true, then "-in" option specifies a file containing a list of pathnames (one pathname per line) to the FASTA formatted files that will be used as input. Otherwise "-in" option specifies a single FASTA formatted file. -genome_size genome_size default: Explicitly specify the genome length in number of bases. If this parameter is not used then the genome length is computed as the sum of lengths of all input sequences. This parameter does not have any effect if "-unit" option is specified. Otherwise it is used to compute the unit length. -in input_file_name default: The location of the input file. The contents of the input file depends on the value of "-fa_list" and "-convert" options. If "-fa_list false" is specified, input file is a single FASTA ' formatted file. Unit counts will be generated based on the sequences in that file. If "-fa_list true" is specified, then input file contains a list of pathnames (one per line) of FASTA formatted files that form the genome. In this case unit counts will be generated based on all the sequences contained in the listed files. If "-convert" is specified, then "-in" option is required and its value must be the name of a valid unit counts file. -infmt input_format default: fasta The possible values are "fasta" for reading sequence data from FASTA formatted file, and "blastdb" for reading sequence data from a BLAST database. -mem available_memory default: 1536 Assume that this much RAM is available for WindowMasker. The value is in megabytes. Depending on the amount of available memory, passes 3 and 4 of stage 2 could contain additional subpasses. This is especially true for large (>=14) values of unit length. -out output_file_name default: The location of the output file. This parameter is optional and standard output is used if it is not specified. Section FILES contains the description of format of the file. If "-convert" option is given then "-out" option is required and specifies the name of the target output unit counts file. -sformat unit_counts_format default: ascii The format in which the unit counts data should be generated. The possible values are "ascii", "binary", "oascii", or "obinary". "ascii" makes windowmasker generate unit counts in human readable text form, which is highly portable but slow to load during Stage 2. The "binary" format is not portable but loads very fast. The last two formats correspond to optimized (via hash tables) unit counts data structures. They require more memory to create during Stage 1. However using them results in 2.5 - 4 times imrovement in performance of the WinMask module. "obinary" stores data in a binary form which is not portable. "oascii" stores data in a portable text format at the expense of startup performance during Stage 2. If "-convert" option is specified, then "-sformat" determines the conversion target format. -smem available_memory default: 512 This option is ignored for "ascii" and "binary" unit counts formats. For "oascii" and "obinary" this option specifies the upper limit (in megabytes) for the size of the unit counts data structure in memory. WindowMasker will try to produce the data structure smaller than the specified size. If that is not possible, error is reported. -t_low T_low default: Save only information about units with counts equal or bigger than T_low. If "-t_low" is not specified on command line, then its value is computed so that 90% of all units present in the genome have counts below T_low. -t_high T_high default: The units that appear more than T_high times in the genome are given the count value of T_high. If "-t_high" is not specified on command line, then its value is computed so that 99.8% of all units present in the genome have counts below T_high. -unit unit_length default: Specifies the unit length to use. The value is an integer between 1 and 16. If this parameter is not specified then unit length is computed automatically from the genome length L in bases as the smallest number N such that L/4^N > 5. -t_low_pct pct default: 90.0 Percentage value to automatically compute the value of t_low. This parameter overrides -t_low setting -t_extend_pct pct default: 99.0 Nmer percentage cutoff value to determine nmer counts necessary for nmers used for masking window extensions. -t_thres_pct pct default: 99.5 Nmer percecntage cutoff value to determine nmer counts needed to trigger masking. -t_high_pct pct default: 99.8 Percentage value to automatically compute the value of t_high. This parameter overrides -t_high setting Stage 2 Options -dust true/false default: false Use the symmetric algorithm of the DUST module in addition to the WinMask module to mask out low-complexity regions. -in input_file_name default: Specifies the input file for masking. WindowMasker accepts input in FASTA format. If this option is skipped, then the standard input will be used as the source of input. -infmt input_format default: fasta The possible values are "fasta" for reading sequence data from FASTA formatted file, and "blastdb" for reading sequence data from a BLAST database. -outfmt output_format default: interval The possible values are "interval" and "fasta". If "fasta" is selected the output is a FASTA formatted file with masked regions in lower case letters. If the interval format is selected the output is a file containing, for each processed sequence, a sorted set of continuous masked subsequences. For the format of the intervals file see section FILES. [NOTE: Case of the input does not matter. All input sequences are considered unmasked and window masker does not preserve the case of the input.] -out output_file_name default: Specifies the output file. The output file format depends on the value of "-outfmt" option. If this option is skipped, then the standard output will be used as the output file descriptor. -set_t_high score default: T_high The score value for units with unit count above T_high. See [1] for details. -set_t_low score default: (T_low + 1)/2 The score value for units with unit count below T_low. See [1] for details. -t_extend T_extend default: This parameter defines whether the interval between two windows with score higher than T_threshold is masked. The interval will be masked if every base of the interval lies within a window of score greater or equal than T_extend. For the definition of the window score see description of "-window_size" option. This parameter overrides the value contained in the unit counts file. -t_high T_high default: This parameter overrides the value contained in the unit counts file. Since the unit counts file does not contain unit counts values larger than T_high value computed at Stage 1, it only makes sense to specify a T_high value that is less than the one in the unit counts file. Units with counts greater than T_high will not be read from the unit_counts file and the value of "-set_t_high" will be used as the result of unit count lookup. -t_low T_low default: This parameter overrides the value contained in the unit counts file. Since the unit counts file does not keep counts of the units that appear fewer than T_low times in the genome, it only makes sense to specify a T_low value that is greater than the one in the unit counts file. In that case unit with counts that are less than T_low will not be read from the unit counts file and the value of "-set_t_low" option will be used as the result of unit count lookup. -t_thres T_threshold default: All windows with score above T_threshold are masked. For the definition of the window score see description of "-window_size" option. This parameter overrides the value contained in the unit counts file. -text_match true/false default: true This option applies to "-exclude_ids" and "-ids" options. If set to "false" the sequence ids are compared as instances of CSeq_id classes. If set to "true" the sequence ids are compared as strings. In that case each id is represented as a sequence of words separated by '|' characters. A sequence is found in "exclude_ids" or "ids" set if some element of the set contains a subsequence of the given sequence which spans that whole number of words. This option allows to overcome some problems resulting from direct CSeq_id comparisons. -window window_size default The WinMask module works by tracking the score of a sliding window which is defined as average of counts of all units within the window. This parameter defines the size of the sliding window. By default the value unit_size + 4 is used so there are 5 units in a window. -t_low_pct pct Percentage value to automatically compute the value of t_low. This parameter overrides t_low setting in the counts statistics file. If given, it also overrideds the -t_low command line parameter. -t_extend_pct pct Percentage value to automatically compute the value of t_extend. This parameter overrides t_extend setting in the counts statistics file. If given, it also overrideds the -t_extend command line parameter. -t_thres_pct pct Percentage value to automatically compute the value of t_threshold. This parameter overrides t_threshold setting in the counts statistics file. If given, it also overrideds the -t_thres command line parameter. -t_high_pct pct Percentage value to automatically compute the value of t_high. This parameter overrides t_high setting in the counts statistics file. If given, it also overrideds the -t_high command line parameter. FILES This section describes the file formats that are used by WindowMasker and provides some examples (FASTA file format is not described here). Genome in multiple FASTA files If the genome data is contained in multiple FASTA formatted files, then "-fa_list true" should be used and the value of "-in" option should name the file containing the list of the required FASTA files, one file per line. For example, for human genome build 34, one could have one FASTA file per chromosome named chrX.fa, chrY.fa, chr1.fa, chr2.fa, ..., chr22.fa located in the /human34 directory. Then the file given to the "-in" option should look like this: /human34/chr1.fa /human34/chr2.fa ... /human34/chr22.fa /human34/chrX.fa /human34/chrY.fa NOTE: Empty lines are ignored. Unit Counts This file is the output of WindowMasker Stage 1 processing if "ascii" unit counts format has been chosen. The file is an ASCII text file. It also serves as input for Stage 2 processing. When WindowMasker reads the unit counts file during Stage 2 empty lines and lines starting with '#' character are ignored. This way comments could be introduced into the unit counts file. The first line of the file contains one integer number which is the unit size. Then come the lines containing counts for the units which appeared more than T_low times in the genome (and its reverse complement). These are ordered by the unit numerical value. Each line has the following structure: UNIT_VALUE COUNT where UNIT_VALUE is an integer in hexadecimal format and COUNT is a decimal integer. The last section of the file contains the computed values for T_threshold, T_extend, T_low, and T_high, each on a separate line with the following structure: >PARAM_NAME VALUE where PARAM_NAME is one of t_low, t_extend, t_threshold, t_high and VALUE is the integer value of the corresponding parameter. For example the unit counts file for human build 34 looks like this (only 20 lines in the beginning and 5 lines in the end are shown): 15 0 154 1 154 2 154 3 154 4 154 <...> 1a0 154 1a1 47 1a2 95 1a3 64 1a4 71 1a5 101 1a7 80 1a8 101 1a9 63 1aa 100 <...> >t_low 16 >t_extend 57 >t_threshold 86 >t_high 154 Optimized Unit Counts This file is an output of WindowMasker Stage 1 processing when "oascii" unit counts format has been chosen. The file is also used as an input for the Stage 2 processing via "-ustat" command line option. The first line of the file contains the number AAAA and serves as a file format identifier. The second line must contain a single integer between 1 and 16 which is the unit size. The third line contains 4 integer values separated by spaces: the number of units with collisions M (32 bit integer) the hash key size (8 bit integer) the right offset r of the hash key (8 bit integer) the width w (in bits) of the field containing the number of collisions (8 bit integer) The next 4 lines contain windowmasker thresholds, one per line in the following order: T_low T_extend T_threshold T_high Then go 2^k 32-bit integers, one per line that for the hash table. Each integer has the following form: The low order w bits contain the number c of units that have a hash key equal to the index of this integer in the table. If c = 0 then the rest of the bits are 0. If c = 1 then bits [w,23] contain the count of the uniq unit with this hash key. Bits [24-31] contain the bits [0,r-1] and [k + r,31] of the unit value. If c > 1 then bits [w,31] contain the index of the start of the list of counts in the counts table (see below) of units with this hash key. The last M lines contain 1 16-bit integer each and form the counts table. Each number has the following form: bits [0-8] - unit count; bits [9-15] are bits [0,r-1] and [k + r,31] of the corresponding unit value. Note that unit counts file in "oascii" format contains exactly 10 + 2^k + M lines. List of Sequence Ids to Process List of Sequence Ids to Exclude from Processing Both files have identical format. WindowMasker uses the first word of the sequence FASTA title (with or without the leading '>' character) as the sequence id. There should be one sequence id per line. Below is an example of such a file. gi|20349357|ref|NT_033777.1| >gi|20340552|ref|NT_033778.1| WindowMasker Interval Output Format By default the output of WindowMasker Stage 2 is in Interval format ("-outfmt interval"). The file is an ASCII text file consisting of blocks of information for each input sequence in the order those sequences appear in the input FASTA file. Each block starts with the FASTA title of the sequence followed by the description of masked intervals, one interval per line. The intervals do not overlap and are sorted by their start position. (NOTE: the positions are numbered starting at 0.) Each line describing a masked interval has the following structure: START - END where START and END are decimal integers representing the end points of the masked interval. Below is sample part of WindowMasker output in interval format: >AC084726.10.8677.10816 2 - 27 45 - 63 75 - 103 144 - 191 201 - 222 266 - 308 324 - 355 398 - 426 446 - 468 569 - 628 647 - 676 711 - 774 815 - 865 897 - 924 961 - 1004 1039 - 1126 1212 - 1261 1285 - 1310 1367 - 1473 1479 - 1509 1521 - 1600 1626 - 1673 1683 - 1757 1766 - 1809 1817 - 1948 1956 - 2013 2026 - 2052 2082 - 2139 >AC084726.10.16360.16465 >AC084726.10.17911.20089 10 - 147 240 - 293 365 - 444 460 - 589 627 - 654 682 - 712 756 - 786 827 - 845 903 - 928 950 - 978 1060 - 1199 1225 - 1340 1369 - 1397 1443 - 1464 1482 - 1600 1703 - 1726 1747 - 1775 1823 - 1872 1904 - 1974 1982 - 2006 2064 - 2084 2099 - 2167 SAMPLE WINDOWMASKER SESSION Assuming the list of genome files for human build 34 are in the file ./fa_list the following command will run the first stage of WindowMasker. Checking for duplicates is requested, so all candidates for duplicates will be reported on standard error (stderr). The possible duplicates reported below turn out to be false positives. During Stage 1, WindowMasker also shows the progress by printing dots to standard error (stderr). ./windowmasker -mk_counts -fa_list true -in fa_list -checkdup true -out ustat.15 Possible duplication of sequences: subject: lcl|Hs1_27103_34 and query: lcl|Hs1_33153_34 at intervals subject: 5539 --- 35539 query : 36748455 --- 36778455 Possible duplication of sequences: subject: lcl|Hs1_78005_34 and query: lcl|Hs1_78000_34 at intervals subject: 173111 --- 233111 query : 26192 --- 86192 <...MANY MESSAGES SKIPPED...> Computing the genome length........................done. Pass 1................................................................................................ Pass 2................................................................................................ At this point file ustat.15 contains the unit counts that can be used in Stage 2. To mask the first human chromosome applying both WinMask and symmetric DUST, the following command can be used assuming chr1.fa is in the current directory. During Stage 2 no progress information is printed on standard error (stderr). ./windowmasker -ustat ustat.15 -in chr1.fa -out chr1.wm -dust true The output in the interval format is written to chr1.wm.