Introduction

This document describes blast4, an ASN.1 interface to BLAST. The National Center for Biotechnology Information provides free BLAST services to the public using this interface (over HTTP) and others. NCBI's BLAST source code is in the public domain, so other organizations may choose to run their own BLAST servers.

The functionality provided by this interface is similar to that provided by the URL API. Either interface will work for many applications, but application programmers may find this interface to be more convenient.

For more information on using NCBI's public BLAST servers using this interface, please refer to Appendix 2: Using NCBI's BLAST Servers.

We welcome your suggestions, comments, and questions about this specification. Please email them to us at toolbox@ncbi.nlm.nih.gov.

Protocol

blast4 clients send Blast4-request objects and receive Blast4-reply objects. Each request is answered with one reply. The particular encoding used for requests and replies depends on the communication mechanism used and is not part of this specification. Depending on the communication mechanism used, a session may consist of one request and one reply or of multiple requests and their replies.

Requests

A Blast4-request consists of an optional ident, which identifies the application sending the request, and a body, which contains one of several specific requests:

Blast4-request ::= SEQUENCE {
    ident                   VisibleString OPTIONAL,
    body                    Blast4-request-body
}

Blast4-request-body ::= CHOICE {
    finish-params           Blast4-finish-params-request,
    get-databases           NULL,
    get-matrices            NULL,
    get-parameters          NULL,
    get-programs            NULL,
    get-search-results      Blast4-get-search-results-request,
    get-sequences           Blast4-get-sequences-request,
    queue-search            Blast4-queue-search-request
}

The structure of Blast4-reply is similar:

Blast4-reply ::= SEQUENCE {
    errors                  SEQUENCE OF Blast4-error OPTIONAL,
    body                    Blast4-reply-body
}

Blast4-reply-body ::= CHOICE {
    finish-params           Blast4-finish-params-reply,
    get-databases           Blast4-get-databases-reply,
    get-matrices            Blast4-get-matrices-reply,
    get-parameters          Blast4-get-parameters-reply,
    get-programs            Blast4-get-programs-reply,
    get-search-results      Blast4-get-search-results-reply,
    get-sequences           Blast4-get-sequences-reply,
    queue-search            Blast4-queue-search-reply
}

errors contains any informational, warning, or error messages related to the processing of the request. Warnings indicate that the server processed the request successfully, but that the results may be different than the user anticipated. Errors indicate that the server was unable, in whole or in part, to process the request.

Although there are many requests, the queue-search and get-search-results requests are most important.

queue-search

The queue-search request is used to initiate a BLAST search:

Blast4-queue-search-request ::= SEQUENCE {
    program                 VisibleString,
    service                 VisibleString,
    queries                 Bioseq-set,
    subject                 Blast4-subject,
    paramset                VisibleString OPTIONAL,
    params                  Blast4-parameters OPTIONAL
}

program and service select a program in the BLAST family and a service offered by that program. The complete set of programs and services offered is returned by the get-programs request.

queries specifies the sequences to be searched.

subject specifies the sequences against which the query sequences will be searched. The sequences can be specified indirectly, through databases, or directly, through sequences:

Blast4-subject ::= CHOICE {
    database                VisibleString,
    sequences               SEQUENCE OF Bioseq
}

paramset is used to include a named set of parameters. Including a named set of parameters is equivalent to prepending the parameters in the set to params.

params is used to override default parameter settings selected by the server or parameter settings included via paramset. There are many parameters that can be specified, but none are required; the server will attempt to set reasonable values for those that are not specified. For more information, refer to Appendix 1: Search Parameters. To learn more about default values set by the server, please refer to the finish-params request.

The reply to a queue-search request contains a request-id, which can be used later to retrieve the results of the search:

Blast4-queue-search-reply ::= SEQUENCE {
    request-id              VisibleString OPTIONAL
}

get-search-results

The get-search-results request is used to retrieve the results of a BLAST search:

Blast4-get-search-results-request ::= SEQUENCE {
    request-id              VisibleString
}

Blast4-get-search-results-reply ::= SEQUENCE {
    alignments              Seq-align-set OPTIONAL,
    phi-alignments          Blast4-phi-alignments OPTIONAL,
    mask                    Blast4-mask OPTIONAL,
    ka-blocks               SEQUENCE OF Blast4-ka-block OPTIONAL,
    search-stats            SEQUENCE OF VisibleString OPTIONAL
}

The elements returned are all optional; which ones are included depends on the particular search.

finish-params

With the queue-search request, the actual parameter values may be different than those explicitly specified by the user; some may be read from a parameter set (a paramset), while others may be set, by default, by the server. For some applications and users, it may be important to know exactly which values the server will use to execute a search. The finish-params request takes arguments similar to those of the queue-search request and returns a complete, or finished, set of parameters:

Blast4-finish-params-request ::= SEQUENCE {
    program                 VisibleString,
    service                 VisibleString,
    paramset                VisibleString OPTIONAL,
    params                  Blast4-parameters OPTIONAL
}

Blast4-finish-params-reply ::= Blast4-parameters

Blast4-parameters ::= SEQUENCE OF Blast4-parameter

The params returned in the reply show the values of all search parameters whose values are not zero, false, the empty string, or null.

get-databases

The get-databases request is used to enumerate the names of databases known to the server. These names are the domain of the subject.database element of a queue-search request.

Blast4-get-databases-reply ::= SEQUENCE OF Blast4-database-info

Blast4-database-info ::= SEQUENCE {
    database                Blast4-database,
    description             VisibleString,
    last-updated            VisibleString,
    total-length            BigInt,
    num-sequences           BigInt
}

get-matrices

The get-matrices request is used to enumerate the scoring matrices known to the server. These are the matrices that can be specified by name in the matrix search parameter.

Blast4-get-matrices-reply ::= SEQUENCE OF Blast4-matrix-id

Blast4-matrix-id ::= SEQUENCE {
    residue-type            Blast4-residue-type,
    name                    VisibleString
}

get-parameters

The get-parameters request is used to enumerate the search parameters known by the server. This request is not intended to be initiated directly by an end user, and the results are not intended to be displayed to an end user; rather, this request helps clients to construct a user interface dynamically so they can accomodate changes in the set of known search parameters without modification. Clients are not required to use this request; they may choose instead to support just those search parameters that are known when they are written.

Blast4-get-matrices-reply ::= SEQUENCE OF Blast4-matrix-id

Blast4-matrix-id ::= SEQUENCE {
    residue-type            Blast4-residue-type,
    name                    VisibleString
}

get-paramsets

The get-paramsets request is used to enumerate the named sets of search parameters (the "parsets") known to the server. Parsets may make it easier for users to tailor their searches to achieve specific results, but they are never required.

Blast4-get-paramsets-reply ::= SEQUENCE OF Blast4-paramset-info

Blast4-paramset-info ::= SEQUENCE {
    program                 VisibleString,
    name                    VisibleString
}

Names of paramsets are unique (within the scope of a particular program) and are designed to be descriptive enough that no separate description is needed. Names are not required to follow any particular form.

get-programs

The get-programs request is used to enumerate the combinations of program and name that may be specified in a queue-search request.

Blast4-program-info ::= SEQUENCE {
    program                 VisibleString,
    services                SEQUENCE OF VisibleString
}

Names of paramsets are unique (within the scope of a particular program) and are designed to be descriptive enough that users will be able to make reasonable choices based on program and name alone. Names are not required to follow any particular form and may be relatively long (perhaps 40 characters or more).

Appendix 1: Search Parameters

Search parameters are specified as name-value pairs:

Blast4-parameter ::= SEQUENCE {
    name                    VisibleString,
    value                   Blast4-value
}

Blast4-value ::= CHOICE {
    -- scalar types:
    big-integer             BigInt,
    bioseq                  Bioseq,
    boolean                 BOOLEAN,
    cutoff                  Blast4-cutoff,
    integer                 INTEGER,
    matrix                  Blast4-matrix,
    real                    REAL,
    seq-align               Seq-align,
    seq-id                  Seq-id,
    seq-loc                 Seq-loc,
    strand-type             Blast4-strand-type,
    string                  VisibleString,
    -- lists of scalar types:
    big-integer-list        SEQUENCE OF BigInt,
    bioseq-list             SEQUENCE OF Bioseq,
    boolean-list            SEQUENCE OF BOOLEAN,
    cutoff-list             SEQUENCE OF Blast4-cutoff,
    integer-list            SEQUENCE OF INTEGER,
    matrix-list             SEQUENCE OF Blast4-matrix,
    real-list               SEQUENCE OF REAL,
    seq-align-list          SEQUENCE OF Seq-align,
    seq-id-list             SEQUENCE OF Seq-id,
    seq-loc-list            SEQUENCE OF Seq-loc,
    strand-type-list        SEQUENCE OF Blast4-strand-type,
    string-list             SEQUENCE OF VisibleString,
    -- imported collection types:
    bioseq-set              Bioseq-set,
    seq-align-set           Seq-align-set
}

The following table shows the legal name's and their corresponding value types:

parameter type description
cutoff cutoff Only hits with e-values below the cutoff e-value or normalized scores above the cutoff score will be reported.
db-genetic-code integer Code used to translate database from nucleotide to protein. See Table of Genetic Codes.
culling boolean If true, hit lists are culled by keeping at most a certain number (hsp-range-max?) of HSP's in a range. (where is the size of the range set?)
ungapped-alignment boolean If true, ungapped alignments are allowed.
entrez-query string Used to construct an oid list. (which is used how?)
i-thresh integer E-value threshold for inclusion in a PSI-BLAST multiple alignment. (See Gapped BLAST and PSI-BLAST: a new generation of protein database search programs and Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.)
filter string A string that specifies when and how the query sequences are to be masked. Please refer to Appendix 3: Filter Strings.
first-db-seq,
final-db-seq
integer Only sequences with oid's between first-db-seq and final-db-seq will be searched.
gap-open,
gap-extend
integer Penalties applied for opening and extending gaps, respectively. The penalty for a gap of N residues is gap-open + N * gap-extend. Meaningful only if gapped-alignment is true.
gi-list integer_list collection of sequences, specified by a list of gi numbers, against which queries will be compared.
hitlist-size integer maximum number of database sequences for which to save hits.
hsp-range-max integer maximum number of HSP's to save in any region. Meaningful only when culling is true.
matrix string, matrix Substitution matrix containing similarity scores for all possible pairs of residues, specified by either name or value. (See Basic local alignment search tool and Table of Genetic Codes.)
perc-ident real Only alignments in which at least this percentage of query residues are identical to the corresponding subject residues will be reported.
nucl-penalty,
nucl-reward
integer Penalty for a nucleotide mismatch and reward for a nucleotide match, respectively. Called the scores for mismatches and identities in DNA sequence comparisons in Basic local alignment search tool.
phi-pattern string TBD
pseudocount-weight integer Called "beta" in Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. (See Equation 5.)
genetic-code integer Code used to translate query from nucleotide to protein. (See Table of Genetic Codes.)
query-mask seq_loc_list Locations of query residues to be masked. Words spanning these locations are not included in the initial word table. With hard masking, these locations are also treated as unknown residues during extension.
required-start,
required-end
integer Only alignments which contain this region will be reported.
searchsp-eff real User-specified search space; overrides value calculated by BLAST.
strand-option strand_type Specifies whether to search the forward strand, the reverse strand, or both strands of the query sequences.
template-length integer Length of a megablast discontiguous words template. Meaningful only for service=megablast.
template-type integer Type of a megablast discontiguous words template. Legal values are TBD. Meaningful only for service=megablast.
use-comp-based-stats boolean If true, uses composition-based statistics as described in Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
window-size integer Called "w" in Basic local alignment search tool, "W" in Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
word-threshold integer Called "T" in Basic local alignment search tool.

Appendix 2: Using NCBI's BLAST Servers

NCBI provides a C++ wrapper to this interface (ncbi::blast::CRemoteBlast, as part of the xblast library). The C++ wrapper automatically encodes requests, decodes replies, and handles communication with the server. The C++ wrapper is the only supported way to use this interface.

Appendix 3: Filter Strings

Filter strings consist of any number of the following options, separated by spaces or semicolons. For options that take parameters, parameters follow the letter which specifies the option and are separated, from the option letter and from each other, by spaces. Default values are shown in parentheses.

Options:

m
Soft masking: residues specified by query-mask are masked only while building initial words, not during extension. (The default is hard masking, in which residues are also masked during extension.)
F
No filtering.
T
Normal filtering (DUST for blastn, SEG for others).
C
Coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)). Optional parameters: window (22), cutoff (40.0), linker (32).
S
SEG filter. Optional parameters: window (12), locut (2.2), hicut (2.5).

Examples:

C 28 40.0 32
Coiled-coiled with window = 28, locut = 40.0, and hicut = 32.
C;S
SEG and coiled-coiled.
m S
SEG filter with default arguments masks query, but only while the initial words are being built.

References