## Faster

The fast and accurate side-chain topology and energy refinement (FASTER) method (Desmet et al. 2002) is a heuristic iterative optimization method inspired by the previous success of the dead-end elimination (DEE) techniques. It attempts to take advantage of the elimination power of DEE without requiring its large computational run-times, while trading off a certain degree of optimality. In general, it does so by assuming a given rotamer assignment at all but one protein position and then exhaustively optimizing the remaining position [i.e., first-order "quenching" (Voigt et al. 2000)\. This method is related to DEE in the following way: If all positions except one are already assigned the respective rotamer from the GMEC, this quenching approach will correctly obtain the GMEC rotamer for the remaining position.

In detail, FASTER iteratively proceeds in attempting to approach the GMEC. Initially, pairwise interactions are ignored and it trivially optimizes each rotamer position individually, considering only the rotamer self-energies:

yielding the starting rotamer assignment at position i (rÂ° ). At a subsequent iteration 5, the rotamer choice at position i (r/ ) is reoptimized within the context of the previous iteration's rotamer configuration (r5-1, the intermediate minimum energy configuration, IMEC):

This update rule (Figure 14.2C) is employed to simultaneously make new rotamer choices at all positions i.

There exist various optimizations for FASTER (both integral and optional steps). These include the perturbation-relaxation-evaluation (PRE) procedure, where a given residue is fixed to a given rotamer choice ("perturbation") and then Equation 14.7 is used to update all positions other than the currently fixed position ("relaxation"). This is performed for all residues and for each possible rotamer choice, where for each such per-residue rotamer fixation, the rotamer assignment yielded is kept only if it has obtained the minimal energy observed thus far ("evaluation"). Furthermore, PRE can additionally be performed for higher orders (two or more fixed positions and application of Equation 14.7 for nonfixed positions), with an accompanying increase in run-time (e.g., cubic time for two fixed positions).

Additional preprocessing and midprocessing (e.g., considering only pairs of proximate residues in second-order PRE) conditions are applied to obtain further computational speedups. Further significant run-time enhancements (up to two orders of magnitude) were provided in improvements of the initialization method by using MC-derived low energy configurations instead of Equation 14.6. Improvements were

r3: Single rotamer choice at position 3

Pr(r3): Rotamer probability vector at position 3

2(r2): Position 2's rotamer belief vector, from position 3

r3: Single rotamer choice at position 3

Pr(r3): Rotamer probability vector at position 3

2(r2): Position 2's rotamer belief vector, from position 3

FIGURE 14.2 (see color insert following page 178) Comparison of iterative update techniques (FASTER, SCMF, BP) for computational protein design. (A) Toy protein design problem, consisting of four fully interconnected design positions. (B) The corresponding graph structure of this design problem, where each node corresponds to a position and edges to interactions between them (pairwise rotamer-rotamer energy matrices). (C) The calculations performed for position 3 at a given iteration, where dashed arrows indicate relevant incoming data and solid arrows the results of the update calculations performed for position 3. Note that for both FASTER and SCMF, a single calculation is performed for each position, the results of which are then forwarded to all other positions. In BP, however, a distinct calculation is performed for each neighbor of position 3.

also suggested to the relaxation step of the PRE scheme by exclusively relaxing positions that, energetically, interact strongly with the perturbed positions (Allen and Mayo 2006).

FASTER is the overall fastest search algorithm developed thus far and it was in fact shown to obtain the GMEC in many problems (Desmet et al. 2002; Allen and Mayo 2006). Moreover, it has been enhanced to provide sequences satisfying specific biological constraints, such as fixed amino-acid composition (Horn and Mayo 2006), in order to keep reference state energies relatively constant. This method also provides a collection of low energy rotamer assignments, which can also be utilized to approximate positional rotamer and amino-acid probabilities.

The FASTER algorithm has been applied in a number of experimental protein design studies. Some examples include the development of a monoclonal antibody directed against the von Willebrand factor that inhibits its interaction with fibrillar collagen (Staelens et al. 2006) (though it was used only for side-chain optimization of fixed mutants), the full-sequence design of the engrailed homeodomain (Shah et al. 2007), and the comparison of computationally designed single-site substitutions in the triple-hairpin structure of human immunodeficiency virus gp41 with patient data (Boutonnet et al. 2002).

Self-Consistent Mean Field (scmf)

The methodologies based on self-consistent mean field (SCMF) theory (Lee 1994; Koehl and Delarue 1994; Delarue and Koehl 1997; Voigt et al. 2000) calculate positional rotamer probabilities for the protein design problem. The mean field approximation is derived under the assumption that the probability distribution for a full rotamer assignment is the product of positional rotamer probabilities (Yedidia et al. 2005). We note that SCMF was the first method to specifically focus on the calculation of rotamer probabilities (and not specific rotamer assignments), thus setting it aside from previously applied techniques.

Intuitively, for a given position i, its multiple interactions with all other positions is summarized into an average interaction (mean field, MF), based on the current conformational probability vectors at interacting positions. Mathematically, the mean field applied to position i in rotamer state ri is

jeN(i) rj where N(i) denotes the set of positions neighboring (interacting with) position i and Pr(rj the probability of rotamer p at position j. This averaging approximation reduces the computational complexity of the problem, since for a given mean field energy applied to a position, the rotamer probabilities at that position can be directly determined using Boltzmann's law (Huang 1987; Yedidia et al. 2005):

where ai is a normalization factor, T is the system temperature, and k is Boltzmann's constant. Equations 14.8 and 14.9 are iteratively applied at all positions until convergence (Figure 14.2C). At termination, the amino-acid probabilities are calculated from the rotamer probabilities:

r is a rotamer of aa a where Pr(5i = a) is the probability of amino acid a at position i.

Since SCMF provides amino-acid probabilities and it is always guaranteed to converge (albeit to a local optimum in the approximated probability space), it has unique advantages over alternative design methods. There do exist richer and more general mean field approximations [e.g., (Jordan et al. 1999)], which could provide better solutions in practice. However, these have yet to be actively explored or assessed in the world of computational protein design. Nonetheless, similar approaches have been further refined in the context of protein design. For example, the SCADS (statistical computationally assisted design strategy) (Kono and Saven 2001; Calhoun et al. 2003) and related methods (Zou and Saven 2003; Biswas et al. 2005) generalize mean field theory and apply a self-consistent statistical entropy approach to atomic level protein design in order to predict these probabilities, in a way that is often robust to minor backbone changes.

We note that SCMF, which is designed to directly calculate amino-acid probabilities, could also be used to obtain low energy sequences, by independently choosing the most probable amino acids at each position. However, it has been shown that this has only limited success in yielding the GMEC (Voigt et al. 2000). Nonetheless, the SCADS method was successfully applied in the design of 88 residues of a monomeric helical dinuclear metalloprotein, shown to be structured in both the apo and the holo forms (Calhoun et al. 2003). This method was further used to design water-soluble analogues of the potassium channel KcsA (Slovic et al. 2004), a four-helix bundle protein that selectively binds a nonbiological cofactor (Cochran et al. 2005), a redox-active minimal rubredoxin mimic (Nanda et al. 2005), an ultrafast folding Trp-cage mutant (Bunagan et al. 2006), and ferritin-like proteins with hydrophobic cavities (Swift et al. 2006).

belief Propagation (bp)

Belief propagation (BP) (Pearl 1988; Yedidia et al. 2005) is an approximate inference technique applied to a formulation of protein design known as probabilistic graphical models (Lauritzen 1996). Here, we describe the form of BP utilized to calculate positional amino-acid probabilities (Moore and Maranas 2003; Fromer and Yanover 2008). For calculation of low energy sequences, a corresponding algorithm also exists (Hong and Lozano-Perez 2006; Yanover et al. 2006; Fromer and Yanover 2009).

Briefly, a graph is built where nodes represent design positions, node values correspond to rotamer choices, and edges between nodes represent pairwise rotamer interactions between respective positions (Figure 14.2A,B) (Hong and Lozano-Perez 2006; Fromer and Yanover 2009). In belief propagation, messages are passed between interacting positions, whose contents describe one position's "belief' about its neighborâ€”the relative likelihood of each rotamer at the neighboring position (Figure 14.2C).

A message from one node to a neighboring one is proportional to their pairwise energy (as a Boltzmann probability) and to the product of all incoming messages. Specifically, the message passed from position i to position j takes the form n^N(i)\ j n^N(i)\ j

where the contents of m^j indicate the relative likelihood of each rotameric state at position j. N(i) denotes the set of positions neighboring position i, T is the system temperature, and k is Boltzmann's constant.

After convergence of BP, rotamer probabilities at each design position are assigned as follows:

zEM t-T

where ai is a normalization factor. Note that these probabilities attempt to account for all possible sequences at all other designed positions, and are used to compute amino-acid probabilities as in Equation 14.10.

One of the major advantages of BP is that it has been shown to outperform other methods in certain protein design settings: in quickly and accurately predicting amino acid probabilities (Fromer and Yanover 2008) and in precisely discovering an ensemble of low energy sequences (Fromer and Yanover 2009). It thus provides a flexible framework to directly compute either low energy sequences or amino-acid probabilities, or both. As an example of its usefulness, BP was applied to identify residue-residue clashes in computational hybrids of the E. coli and human versions of glycinamide ribonucleotide (GAR) transformylase, with a large degree of success (Moore and Maranas 2003).

However, a major hurdle for BP is the lack of convergence observed, especially for larger design problems (Fromer and Yanover 2009). We note that the tree-reweighted belief propagation (TRBP) algorithm is a variant of BP that guarantees to find the GMEC if it converges to a unique solution at each position (Yanover et al. 2006; Weiss et al. 2007). Intensifying research to find both convergent and accurate BP algorithms (Weiss et al. 2007; Globerson and Jaakkola 2007) indicates that BP will continue to be a promising method for accurate large-scale protein design in the future, even when using relatively sophisticated energy functions.

We now find it instructive to compare the FASTER, SCMF, and BP algorithms, all three of which perform some form of iterative rotamer updates, which are exchanged between positions. In FASTER, at a given iteration each design position is assigned a specific rotamer, which is then used to update the rotamer choices at other positions. However, in both SCMF and BP, each position maintains (or receives) a probability vector over rotamer choices. These probabilities are then used to update the probability vectors at other positions. In contrast to BP, in both FASTER and SCMF, each position decides on specific rotamer choices (a single choice in FASTER or a probability vector in SCMF) and identically presents those choices to all other positions (Figure 14.2C). In BP, however, each position sends different rotamer probability vectors to its various neighbors; this subtlety may partially contribute to the success of BP, since it bypasses some of the inherent problems associated with such iterative updating performed on cycles of positions (e.g., Figure 14.2A,B).

Linear Programming (lp)

Linear programming (LP) is a general mathematical technique for global optimization of a linear objective function, subject to linear equality constraints. Integer linear programming (ILP) adds the additional constraint that the solution must contain integer values, but makes the computational problem intricately harder.

Inspection of Equation 14.1 indicates that it essentially decomposes the energy of a complete rotamer configuration into a linear sum of energies for positions and pairs of positions. The LP/ILP approach takes advantage of this fact and casts the search for the minimal energy rotamer assignment as an integer linear programming (ILP) problem (Eriksson et al. 2001; Kingsford et al. 2005). Variables xr. and xr. r. are introduced for each rotamer rt at position i and for each rotamer pair r^r^ at positions i,j, respectively. The objective function for minimization through linear programming is then

The ILP formulation for exact calculation of the GMEC is obtained by constraining the variables to (1) be binary integers (can assume only the values 0 or 1), and (2) consistently assign a unique rotamer at each position. A value of 1 for a variable indicates that the corresponding rotamer (or rotamer pair) is chosen in the ILP solution. The LP formulation simply ignores the requirement that the variables be integers and allows them to vary continuously in the range between 0 and 1. This linear programming (LP) relaxation is efficiently solved, possibly yielding a solution in which all variables are integers. Otherwise, a computationally more intensive ILP solver is required. Thus, the LP/ILP approach is guaranteed to obtain the lowest energy rotamer assignment, albeit with no guarantee of a nonexponential run-time. In fact, even when using a simple energy function (van der Waals interactions and a statistical rotamer self-energy), the ILP solver was required for most design problems tested, resulting in very large run-times (Kingsford et al. 2005). Nonetheless, similar approaches have been taken to perform protein library redesign, as computationally demonstrated for a 16-member library of E. coli/B. subtilis dihydrofolate reductase hybrids (Saraf et al. 2006).

## Post a comment