## The Challenge

The complexity of the computational protein design problem is very large (Park et al. 2004, 2005; Rosenberg and Goldblum 2006; Butterfoss and Kuhlman 2006). Even for a small protein of 100 residues, there are 20100 = 10130 different sequence possibilities. This collection of sequences, each constructed in a single copy, would occupy a space larger than the whole universe. Additional complexity arises if one tries to model protein flexibility. It remains intractable to perform full-scale molecular dynamics simulations within the protein design calculation. Hence, most protein design studies consider only movement of protein side chains while the protein backbone remains fixed. Flexibility of amino-acid side chains is typically modeled by using a discrete set of statistically significant empirical conformations, called rotamers (see Chapter 13). With a larger number of rotamers used to represent each amino acid, the movement of side chains is modeled more accurately; but clearly the design problem becomes more complex. Even if only three rotamers were to be used for each amino acid, this exponentially increases the complexity of designing a 100-residue protein to 60100 = 10178 different rotameric assignments.

In protein design, the goal is to search over this huge sequence space and to find the best (lowest energy) sequence for the particular protein scaffold. The input to the computational protein design problem usually consists of a protein backbone structure, N sequence positions to be designed, the amino acids (and their respective rotamers) permitted at each position, and an energy function. The energy function, used to evaluate candidate protein sequences, is generally pairwise and thus consists of two basic components, corresponding to rotamer-template and rotamer-rotamer interactions. The template can include the fixed backbone atoms, residues not subject to subsequent optimization, and atoms within the rotamer (for which pseudoenergies are derived from rotamer library statistics).

The energy of a rotameric assignment r = (r1v. ,,rN) is calculated by summing up energies of rotamer r at position i interacting with the fixed template [E(ri)\ and energies of rotamer r interacting with a rotamer r^ at a neighboring position j [E(ri,rj)\:

In protein design calculations, these energy terms are precomputed to implicitly generate a discrete energy landscape, where each point represents a rotameric assignment and the energy designated for it. The problem is thus reduced to searching this landscape and finding the lowest energy sequence(s) compatible with the target structure (the global minimum energy configuration, GMEC) (Rosenberg and Goldblum 2006).

An alternative goal of computational protein design is the calculation of probabilities of the various amino acids at each design position (over all possible sequences compatible with the structure) (Park et al. 2005). Such probabilities essentially provide a position-specific amino acid score matrix (PSSM) and thus can be useful in creating biased combinatorial design libraries (see Chapter 4). Formally, amino-acid probabilities at design position i are sought:

where S{ denotes the amino-acid assignment at sequence position i and a is a particular amino acid; the right-hand side sums the probabilities of all sequences S for which amino acid a is chosen at position i.

In computational protein design, the number of possible rotamer choices increases exponentially with the sequence length and the landscape possesses a very large number of local minima. Clearly, simple enumeration and evaluation of each rota-meric assignment remains infeasible. Moreover, protein design has been demonstrated to belong to the computational class of NP-hard combinatorial optimization problems for which no guaranteed polynomial time algorithms are known (Pierce and Winfree 2002). Facing these challenges, computational protein design researchers must resort to algorithms that may compromise on exactness in more difficult cases but nevertheless have been shown to provide satisfying results in practice.

Finally, we note that the ultimate success of the algorithms developed for protein design strongly depends on the input of the problem, including the expressions for the energy function (see Chapter 12) and the size and quality of the rotamer library used (see Chapter 13).

In this chapter, we detail the major protein design algorithms previously applied. We then note some general preprocessing steps often taken in computational protein design. Lastly, we summarize these methods, so that their advantages and weaknesses are explicitly compared.

## Post a comment