## Approaches For Computational Protein Design

Numerous search algorithms have been developed to search the energy landscape for low energy sequences and their preferred amino acids at each position. These algorithms are divided into two classes: stochastic and deterministic. Stochastic algorithms use probabilistic trajectories, where the resulting sequence depends on initial conditions and a random number generator. Stochastic algorithms do not guarantee finding the GMEC sequence, but they can always find an approximate solution or a set of solutions. This may be sufficient, considering that simplifying assumptions in the energy function and in modeling protein flexibility inevitably result in uncertainty in defining the best protein sequence. In contrast, deterministic algorithms always produce the same solution given the same parameters. Many, but not all, of the deterministic algorithms are guaranteed to find the GMEC sequence if they converge. However, convergence is not guaranteed and the frequency of convergence is reduced with increasing problem size. In the following sections, we describe in detail several search algorithms that have been used in protein design studies, and mention some experimental studies in which they were utilized:

• Stochastic algorithms: Monte Carlo, genetic algorithms, FASTER

• Deterministic algorithms: dead-end elimination, self-consistent mean field, belief propagation, linear programming techniques

Monte Carlo (mc)

The Monte Carlo (MC) method (Metropolis et al. 1953; Kirckpatrick et al. 1983) was one of the first search techniques applied to side-chain prediction and protein design, and it often performs well on difficult energy landscapes. In MC, each sequence position is initially assigned a random rotamer. Randomized substitution attempts are then performed in an iterative manner. At a given step, mutation to a different rotamer is made at a random position. The energy of the new sequence is computed and compared to that of the previous sequence, so that this mutation is accepted or rejected as determined by the Metropolis criterion (Metropolis et al. 1953):

1. If the energy of the new sequence is less than the previous energy, the mutation is accepted.

2. If the energy of the new sequence is greater than the previous energy, the mutation still has a probability of being accepted. This (Boltzmann) probability is defined as p = enew-Eprenous', where p= 1/kT, k is Boltzmann's constant and T is the system temperature.

The strategy of accepting higher energy mutations helps to overcome energy barriers in the sequence landscape. To strengthen this effect, the search system is typically annealed from high to low temperatures (for example, 4000 and 150 K, respectively), in a process known as Monte Carlo simulated annealing (MCSA). MCSA can be run for an arbitrary number of temperature annealing cycles (in which the temperature is successively lowered and raised) and an arbitrary number of substitution attempts per such cycle. In the end, the program generates a list of the lowest energy sequences observed throughout.

For rugged energy landscapes, conventional MC methods tend to become trapped in local minima, thus impeding the sampling of low energy sequences. To circumvent this problem, a few modifications to MC approaches have been suggested. One such modification is quenched MC (MCQ), wherein the algorithm quenches the energy of each solution by calculating the lowest energy rotamer for each position for a given sequence while holding all other rotamers fixed (as in Equation 14.7 in the FASTER algorithm). This quench step has been shown to produce a large improvement in performance (Voigt et al. 2000).

Another enhancement to the standard MC method was introduced in biased Monte Carlo (BMC) (Cootes et al. 2000) and mean field biased Monte Carlo (MFBMC) (Zou and Saven 2003). In these approaches, the amino-acid substitution at a particular position i is not chosen uniformly at random, as in standard MC, but with certain biases meant to approximate Pr(S; = a) (see Equation 14.2). In BMC, these biases are computed for each substitution attempt by evaluating a local energy function at design position i (derived from interactions with other positions in the protein). In MFBMC, on the other hand, they are precalculated before performing the MC sampling by using the self-consistent mean field (SCMF) theory approach (see later in this chapter for a full description).

Further improvement to the standard MC approach resulted from the incorporation of a replica exchange method (REM) into MC, producing a combined algorithm termed MCREM (Yang and Saven 2005), which can also be utilized to predict site-specific amino-acid probabilities. This approach was previously applied to protein folding (Hansmann 1997; Sugita and Okamoto 1999). In MCREM, the MC procedure is simultaneously run on multiple replicas of the same protein, where the simulation for each replica is performed at a different temperature. Throughout the simulation, pairs of rotameric assignments from different replicas are exchanged with some transition probability. This probability, similar to that of the Metropolis criterion, is intended to maintain each temperature's equilibrium ensemble distribution. MCREM was shown to enable the system to surpass high energy barriers and thus reach low energy sequences more efficiently than the standard MC procedure.

MC has been repeatedly applied to both side-chain prediction and protein design problems. Among the most notable examples are the design of a protein with a novel fold (Kuhlman et al. 2003), redesign of endonuclease DNA binding and cleavage specificity (Ashworth et al. 2006), and design of a protein that can switch between two conformations (Ambroggio and Kuhlman 2006).

Genetic Algorithms (GAs)

Genetic algorithms (GAs) were designed to mimic natural evolution, wherein genes of proteins are subjected to rounds of random mutation, recombination, and selection (Holland 1992). Initially, a population of random sequences is generated. Mutations are then randomly applied to each sequence at a rate of 1-2 mutations per sequence. The energies of the mutant sequences are evaluated and the sequences are ranked. The top several sequences are chosen for recombination. For each pair of parental sequences, the sequences are recombined by choosing a rotamer from either of the two sequences with equal probability.

The population of sequences for the next round of the GA is generated by randomly choosing S sequences from the library and comparing their energies. The sequence with the lowest energy is propagated to the next round. This selection process is repeated a number of times to produce a final pool of sequences that will continue to the next round of mutation, recombination, and selection. The selection strength, represented by the number of sequences that undergo selection (S), is analogous to the temperature in MC simulations. By starting at low S and annealing to high S, the population distribution in the sequence space starts out very broad and then narrows after each round to finally converge to a single sequence.

The advantages of using GAs are that the population dynamics can overcome large barriers in the sequence space. Beneficial mutations in different sequences can be combined together onto a single sequence, increasing the number of paths that circumvent local minima. However, GAs tend to perform poorly on highly coupled systems such as sequences of protein cores. This is due to the fact that the crossovers performed can result in the existence of a single deleterious mutation that had been beneficial before the crossover (when coupled with another mutation). In fact, GAs were shown to perform comparatively worse than MCQ on several side-chain placement problems (Voigt et al. 2000).

GA was first proposed as a tool for protein design calculations in a purely computational study (Jones 1994) and later applied to designs that were verified experimentally. These include the redesign of core residues of the phage 434 cro protein (Desjarlais and Handel 1995), ubiquitin (Lazar et al. 1997), and a native-like three-helix bundle (Kraemer-Pecore et al. 2003). In addition, GAs have been applied to optimize interaction specificity in coiled coils (Havranek and Harbury 2003).