Installation ============ Input File Formats ============ Ribosomal RNA predictions file (parameter -inrrna) -------------------------------------------------- The format of ribosomal predictions file is derived from the existing format of output of one of the components of the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP). Each line of the input file specifies exactly one exon of rRNA annotation. The format of the line is: startstopannotation_idstrandtype[comments...] Example format: 100 500 1 + 16S comment 600 1600 1 + 16S comment 1800 1920 2 + 5S comment 2000 4900 3 + 23S comment "Start" and "stop" define the location of the exon (1-based) in the nucleotide sequence of the submission in absolute values. "Start" should always be less than "stop". "annotation_id" specifies a unique (for this submission) id of the rRNA annotation. Presence of the same id in different lines in the input file means that this annotation has more than one exon "strand" is either "+" or "-" "type" is type of RNA "comments" are optional, and are not used in computation tRNA predictions file (parameter -intrna) ----------------------------------------- Please refer to the documentation of tRNAscan e.g. http://lowelab.ucsc.edu/tRNAscan-SE/trnascanseReadme.html for the "abbreviated" output tRNA prediction file format. BLAST file format (parameters -inblast and -inblastcdd) ------------------------------------------------------- This format corresponds to the output produced by blastall -p blastp -m 0 ... (for -inblast parameter) or rpsblast -m 0 .... (for -inblastcdd parameter) command (see for description http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.615) Output File Formats ============ "bad.strand" and "missing" files ---------------------------------------------- File consists of one or more sections each referring to one "case" of a problem. Each section consists of the section head: ==== ==== short description of the problem: --- --- and several lines each representing one exon of either original or predicted annotation that is in the vicinity of problematic annotation. Each line contains several tab-separated fields: 1. Tag. Tag can be one of CENTER_REFERENCE,REFERENCE,VICINITY. It identifies whether it is an original or predicted annotation and its relevance to the particular problem. In particular, CENTER_REFERENCE refers to the predicted annotation which is directly relevant to the problem. Depending on the nature of the problem, "missing" or "bad.strand", it represent a missing annotation or predicted annotation matching an original annotation of the same type and wrong strand. REFERENCE refers to the predicted annotation that happens to be in the vicinity. VICINITY refers to the original annotation that happens to be in the vicinity. 2. Size of the location span containing all relevant annotations. Should be the same for all lines in one chapter. Unit: base. 3. Type of annotation. Could be tRNA, CDS, 5S, 16S or 23S. 4. Unique identifier of the annotation and locus tag. Format (). 5. One-number location of the annotation for the purposes of quicksort. Currently it is redundant, since it coincides with the start of annotation. 6. Name of the annotation as specified by the /product descriptor in the original ASN.1 file or automatically generated name, if it is a predicted annotation. 7. Start location. 8. Stop location. Stop is always larger than start. 9. Strand (+ or -). Note that the tool assumes a plus strand for annotations w/ unspecied strands. This format of output is geared towards use in graphical representation of the problems identified, as in the web version of the tool. "frameshift" file ----------------- File consists of one or more sections each referring to one "case" of a problem. Each section consists of the section head: ==== ==== which specifies the "left" or "right" protein annotation out of two proteins that might potentially constitute one frameshifted protein. In case of "left" protein annotation this section head is followed by one or more subsections, titled --- Potential frame shift evidence found --- each corresponding to one particular instance of BLAST hit elucidating the problem. ^^^^^^^^^^^^^^^^ Next 3 lines in each subsection list Genbank-style-like ids for two adjacent query protein annotations (q1 and q2 in the beginning of the line) and subject from the BLAST database. Example: q1 | lcl | ref|YP_001169943.1||Rsph17025_3775 | hypothetical protein q2 | lcl | ref|YP_001169942.1||Rsph17025_3774 | hypothetical protein s | gi|146280039|ref|YP_001170196.1| hypothetical protein ^^^^^^^^^^^^^^^^ Next 6 lines describe schematic alignment (or juxtaposition) of two query annotations relative to the subject. First line just lists again two tab separated query annotations ("left" and "right" annotations) with strand in parentheses after each annotation. Second line specify locations of query annotations in the format: [...][...] Example: [24690...24902][24362...24658] which is visible as [24690...24902] [24362...24658] Third line is empty. Three subsequent lines list the sizes (in bases) of the pieces of 3 involved proteins that align with each other: qqlbqlhqlasqrbqrhqra sslbslh.........slb... s...srb.........srhsra For example, with expanded size-8 tabs, those three lines might look like this: q 0 70 0 11 23 26 49 s 210 70 ... ... ... 168 ... s ... 393 ... ... ... 26 29 Each protein is broken into three pieces: a piece entering the BLAST hit alignment, preceding and consequent pieces, and a "spacer" piece separating adjacent original protein annotations. First letter of the three-letter abbreviations (for example, "qlb") refers to the query ("q") or subject("s"), second letter identifies which query ("l" for left and "r" for right) is aligned to the subject and third letter refers to the piece before ("b") the BLAST hit, BLAST hit ("h") piece and the piece after ("a") the BLAST hit. For example, "qlb" refers to the piece, preceding the BLAST alignment of the first query and subject (q - for "query", "l" - for "left" query and b - for "before"). Standalone "s" in the middle of the first line stands for the spacer between two queries. Thus, each column in that table represents an aligned portion of the two schematic alignment of two queries with the same subject. Lines 1 and 2 represent an alignment of "left" query and the subject and lines 1 and 3 represent an alignment of "right" query and the subject. Note that "srb" piece (393 in the example above) does not align with a single piece of the "left" query, but with the part of the nucleotide sequence of the input represented by pieces qlb, qlh, qla, s and qrb. Similarly "sla" (subject, left, after) (168 in the example above) aligns altogether with pieces qla, s, qrb, qrh and qra. This schematic representation of alignment is essentially a low-tech, text-only, no-GUI equivalent of the corresonding graphical output of the web version of this tool. ^^^^^^^^^^^^^^^^ Next three lines specify in slightly different way the difference between alignment of subject to "left" and "right" query. First two lines are deprecated and disappear in future versions. Third line in intentionally human readable format specify the nature of the potential frameshift, that is whether it involves the artefact of deletion or insertion occured during sequencing or evolution. Example: diff_left, diff_right: 79, 79 diff_edge_left, diff_edge_right: 79, 79 Potential deletetion of a nucleotide sequence equivalent to 79 occurred. 79 here is equal to the shift between locations of the subject when it is aligned to the left and right query. "overlap", "complete.overlap" or "rna.overlap" file --------------------------------------------------- File consists of one or more sections each referring to one "case" of a problem. Each section has a section head: ==== ==== which specifies the "left" protein annotation out of two overlapping proteins. This section head is followed by subtitle --- Complete overlap found --- or --- Potential overlap found --- or --- Potential RNA overlap found --- additionally identifying the type of file. ^^^^^^^^^^^^^^^^ Next 2 lines list Genbank-style-like ids for two adjacent overlapping protein annotations (q1 and q2 in the beginning of the line). Example: q1 | lcl | ref|YP_001169716.1||Rsph17025_3541 | hypothetical protein q2 | lcl | ref|YP_001169717.1||Rsph17025_3542 | hypothetical protein ^^^^^^^^^^^^^^^^ Next line lists again two annotations, this time separated by tabs with strand in parentheses after each annotation. ^^^^^^^^^^^^^^^^ Next line specifies locations of overlapping annotations in the format: [...] [...] Example: [12800...13294] -2[13260...13694] -3 which is visible as [12800...13294] -2 [13260...13694] -3 Only difference between two frame specifiers is important. Next line is empty and last line in section specifies the overlap rounded to the number of amino acids. "partial" file -------------- File consists of one or more sections each referring to one "case" of a problem. Each section consists of the section head: ==== ==== which specifies the problematic protein annotation. This section head is followed by one or more subsections, titled --- Potential partial protein annotation found --- Followed by two lines specifying the problematic protein annotation ("query") and subject from the CDD database ("Subject"). Format: where "tag" is "Query" or "Subject", "id" is Genbank-style name, "start" and "stop" define position of the BLAST hit alignment relative to the beginning of query or subject.