Complexity International       /vol01/fullen01/ © Copyright 1994     
Volume 1 Received: 


Genetic Algorithms and Recursive Ensemble Mutagenesis in Protein Engineering

Georg Füllen and Douglas C. Youvan
Department of Chemistry
Massachusetts Institute of Technology
Room 56-213, 77 Massachusetts Avenue, Cambridge, MA 02139, USA

Current Address (George Füllen):
FSPM - Strukturbildungsprozesse
Universität Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany or

Current Address (Douglas C. Youvan):
Palo Alto Institute for Molecular Medicine
2462 Wyandotte Street, Mountain View, CA 94043, USA
Email: fuellen@MIT.EDU



We compare Genetic Algorithms (GAs) and an existing protein engineering method called Recursive Ensemble Mutagenesis (REM), which has proven highly effective both in experimental work and computer simulations. REM tries to find a set of "optimal" amino acids at preselected positions in a given protein so that its functional behavior can be enhanced or altered appropriately. REM utilizes the relationship between genotype (DNA) and phenotype (protein). After a heuristic or experimental evaluation of the fitness of proteins coded by a random initial DNA population, REM calculates a new set of fit DNA sequences, exploiting properties of the genetic code. Moreover, as it would be impractical to resynthesize every single DNA sequence which coded a fit protein, generalized schemata (formae) are calculated that serve as a partially wild-card input for the DNA synthesizer producing the new generation. This procedure has an effect similar to uniform crossover, but it also introduces completely new amino acid residues with related fitness. Although developed independently, REM may be seen as a GA constrained by experimental requirements. However, these constraints give rise to a powerful new optimization and diversification component not found in traditional GAs.

1. Introduction.

One of the most exciting areas in molecular biology is the design of artificial proteins with applications in medicine [10], industrial catalysis [14], and nearly all fields of the life sciences. Originally, computer simulations of REM [2] [22] were meant for the design and prediction of laboratory experiments on proteins of the photosynthetic apparatus [4] [8]. Here, we have a fast way to detect functional proteins. It is relatively easy (although not trivial) to examine the spectra of photosynthetic bacteria by Digital Imaging Spectroscopy [1] [3] [20] [23], to detect whether the artificially mutagenized photosynthetic proteins can bind a bacteriochlorophyll or not. Such new but still functional proteins are studied intensively in photosynthesis research.

In the next section, we will describe the experimental protocol of REM. The computer simulation of these laboratory experiments is difficult because it requires an exact evaluation of a new protein's fitness. Since protein folding is presumably NP-hard [19], only heuristic decision algorithms can be implemented. Using them, REM was simulated as described in section 3. A comparison between REM and GAs is given in section 4. REM's utilization of the genetic code in the optimizing calculation of new genotypes (DNA sequences) from the fit phenotypes is a feature which generalizes GAs to a concept called "generalized REM" (GREM) algorithms. A first discussion of this idea is given in section 5.

2. The experimental protocol of REM.

In our laboratory, the following procedure is executed to engineer functional proteins performing tasks related to an original protein. Given the original sequenced wildtype, the experimenter chooses positions in its amino acid sequence which should be relevant for the protein's performance. These amino acids can be changed by a method called "combinatorial cassette mutagenesis" [11] [15]. Briefly, this method consists of using a DNA synthesizer to produce a gene fragment which then replaces those parts of the original gene that specify the selected amino acids. Synthesizing random DNA leads to an approximately random change at the selected amino acid positions. More exactly, the genetic code's mapping of DNA nucleotide triplets to amino acids determines the amount of each amino acid encoded by random DNA. For example, since Alanine is encoded by 4 of the 64 nucleotide triplets (GCG, GCA, GCC and GCT), a DNA synthesizer producing sequences containing the nucleotides A, C, G, and T with 25% probability at each of the 3 positions gives us about 6% (1/16) Alanine.

Leaving out A and T at every third position, one still encodes every possible amino acid, but their occurrence is distributed more uniformly. We will denote the first kind of DNA synthesizer input by "NNN", the second one by "NN(G,C)". Here, "N" is a wild-card symbol for equiprobable A, C, G, T and (G,C) is the wild-card symbol for 50% G and 50% C. We have adopted the term "doping scheme" to designate a synthesizer input. Note the similarity of this concept to "schemata" in GA theory. In fact, the doping schemes specify "formae", that are generalized schemata [13]. For example, some commercially available DNA synthesizers are able to produce DNA according to the following doping schemes (Fig. 1) :

Nucleotide Mixture             Symbol
25% A, 25% G, 25% C, 25% T     N
33% A, 33% G, 33% C            (A,G,C)
33% A, 33% G, 33% T            (A,G,T)
33% A, 33% C, 33% T            (A,C,T)
33% G, 33% C, 33% T            (G,C,T)
50% A, 50% G                   (A,G)
50% A, 50% C                   (A,C)
50% A, 50% T                   (A,T)
50% G, 50% C                   (G,C)
50% G, 50% T                   (G,T)
50% C, 50% T                   (C,T)
100% A                         A
100% G                         G
100% C                         C
100% T                         T

Figure 1. DNA Synthesizer Doping Schemes

The REM protocol (Fig. 2) starts with synthesizing random [NN(G,C)] DNA, transforming bacteria to produce the new proteins having approximately random amino acids at the chosen positions and then searching for functional proteins, for example by Digital Imaging Spectroscopy (see section 1). The DNA sequences leading to the functional proteins are determined. To avoid duplicates, REM discards those DNA sequences that encode a protein already known to be functional.

Figure 2.The experimental protocol of REM (adapted from [4]).

Now, the GA researcher expects that the DNA sequences encoding functional proteins are viewed as the fit part of the population which will then be recombined by crossover and used to form the new generation. However, in the laboratory this direct manipulation of DNA strands is at least a very tedious if not impossible task. Therefore, we calculate a few or only one new DNA doping scheme as an input for the DNA synthesizer, producing the new generation of DNA sequences in parallel. The new doping scheme is calculated as follows. For each of the amino acid positions where mutagenization takes place, a "target set" is formulated as the multiset of all amino acids that have been observed at this position. Which of the admissible DNA doping schemes can generate the target set as closely as possible? Our DNA synthesizer accepts the nucleotide mixtures of Fig. 1 as input. Three of them form a nucleotide triplet (possibly including wildcard symbols) which in turn determines a multiset of amino acids, the generated set. The generated set should be "close" to the target set. We are using two notions of "closeness":

(1) The target set T is "liberally" closest to the generated set if and only if the sum of squares of the differences

is minimal. Here, i is an index running from 1 to 20, representing the i th amino acid, pT (i) is its frequency in the target set T , and pA (i) is the probability of i if doping scheme A is used.

(2) The target set is "conservatively" closest to the generated set, if and only if the group probability

has maximal value. Here, i takes only the index values of the amino acids in the target set. In this case, all amino acids in the target set must be present in the generated set since,otherwise, PG is zero. In the first case, however, amino acids may be discarded. In both cases, new amino acids may be introduced which are unavoidably encoded by the doping scheme designed to encode the target set. These newly introduced amino acids can play an important role, see below.

Having calculated optimal DNA doping schemes encoding the target sets (one per position) as closely as possible, REM uses them as an input for the DNA synthesizer and generates the new population in parallel, starting the next cycle. Using REM, an up to 10,000,000-fold increase in the observed frequency of functional proteins, compared to random mutagenesis, has been found experimentally [8] [5], mutagenizing up to 16 amino acid positions simultaneously.

3. Computational simulations of REM.

Since some results have been reported elsewhere [22],[2], we will only give a brief summary. As mentioned in the introduction, the difficulty of the protein-folding problem makes an exact simulation of the experiments impossible. However, computer simulations using different heuristic decision algorithms as fitness functions can reveal important aspects of REM (choice of parameters, expected results, etc.). Figure 3, adapted from [22], shows the efficiency of REM in generating protein populations of a desired functionality. In this example, the decision algorithm used evaluates the probability that the mutagenized protein contains a binding site for bacteriochlorophyll (see section 1). The population size is set to 10,000, the maximal cardinality of functional proteins used to determine the DNA doping scheme for the next iteration is 50. The plots are average results of REM starting with 10 different initial populations. Plot B (green) shows results using PG, whereas plot C (red) shows results using SSD. These may be compared with plot A (blue), where a randomly generated population is used for each iteration. Note that in this application of REM our goal is not a single optimum but a diverse population of functional proteins.

Figure 3. Computer Simulations of REM. Formulation of the doping scheme was done by either PG (plot B, green) or SSD (plot C, red). For plot A, no intelligent doping scheme was calculated. (From [22], adapted)

4. Comparisons between GAs and REM.

Experimental constraints are the main reason for the differences between current REM and canonical GAs. These are:

1. In the currently implemented version, the fitness function is set to either 1 (retain) or 0 (discard), just as the protein engineer distinguishes between functional and non-functional proteins. This constraint, however, can be relaxed immediately for those applications of REM which allow a continuous measurement of fitness. The squares of the differences used by the SSD calculation and the probabilities of the amino acids used by the PG calculation will then be multiplied with the fitness value. Both calculations will return a doping scheme more frequently expressing the amino acids belonging to the fit proteins.

2. The recombination procedure of REM is similar to uniform crossover [18] if one views the DNA synthesizer doping schemes as representions of the populations of DNA they are describing. Every doping scheme containing wildcard symbols can be identified with the set of all DNA strands it represents. Then, ignoring the effects of the SSD or PG calculations, the fit nucleotide triplets of the old generation of DNA are faithfully represented by the doping scheme, from which the new generation of DNA is sampled randomly. Indeed, REM can be formulated this way, see [2]. In GA terminology, we are calculating "formae", generalized schemata [13] in an intermediate step, and we use the formae to construct the new generation. Traditional n-point-crossover between DNA strands cannot be employed experimentally, because it would be quite impractical to implement.

3. A distinction between genotype and phenotype appears occasionally in the GA literature (for example, [7] [12] [6]). REM, however, utilizes the natural relationship between DNA and encoded proteins [2]. The genotype-phenotype mapping is given by the genetic code. as discussed in section 2. Being degenerate, the genetic code maps 64 possible nucleotide triplets to 20 possible amino acids (and "stop"). As shown below in Figure 4, information about the physical properties of amino acids is encoded in the second of the three nucleotides. This structural property of the genetic code is exploited by the SSD or PG dependent calculation of the new DNA doping scheme from the target set. As mentioned in section 2, some amino acids not included in the target set are "accidentally" encoded by the doping scheme used. Now there is an enhanced probability that these newly introduced amino acids are similar to the ones in the target set, at least as far as hydropathy and molar volume are concerned. For example, if we try to express Methionine (M), Phenylalanine (F), and Valine (V) optimally, the amino acid set which is "liberally closest" (that is, has a minimal SSD value) with respect to this target set is encoded by the dope "(A,G,T), T, (G,C)", yielding M, F, V and Leucine (L), Isoleucine (I). As Figure 4 shows, L and I have hydropathy values similar to the other three. Since hydropathy and molar volume seem to be the most important parameters for protein folding and function (see, for example, [17]), the mutagenized protein has a higher probability of being functional; any large change of these values would lead to a high risk of obtaining a non-functional protein. This, together with the general feedback mechanism provided by selection explains the advantages of REM reported in the end of both sections 2 and 3.

Fig. 4. Hydropathy versus molar volume of amino acids (single-letter code), grouped with respect to the nucleotide at the second codon position. The outer (yellow) region contains the group with Guanine (G) at the second position (i.e. W, R, G, S, C; doping scheme "NGN"). Here, most of the amino acids have extreme values or special properties. Serine belongs to two groups ("NGN" and "NCN"). (Adapted from [21].)

5. Generalized REM algorithms: a broader view of GAs.

In this section, the observations made at the end of the last section are used to motivate an algorithm optimizing not only by selection and recombination or mutation, but also by changing populations while mapping between different representations of increasing redundancy. The action of a GA can be seen as a flip-flop process between a representational space and a property space, where recombination and/or mutation occur in the former and fitness evaluation takes place in the latter. (Therefore, there are two changes of the population in standard GAs: the selection of the fittest, and the recombination or mutation. In this section we will investigate the third population change arising from REM.)

At least for engineering problems, the property space can be huge: it may be spanned by all sorts of dependent variables ("features") that can be calculated from the proposed solutions. In the protein engineering case, the proposed amino acid sequence is the starting point for calculating physical properties; for instance, tabulating average hydropathy and molar volume values of each amino acid position. Later, one tries to calculate properties non-locally, of the whole protein, and one main goal is the determination of the three-dimensional structure of the protein. These problems are largely unresolved, and in many cases, the cell environment has to be taken into account, making feature calculation even more complex. In REM, exact feedback about a protein's fitness is given by the experiment.

We have just introduced the concept of a more and more detailed property space, and the fitness evaluation can be seen as an exploration of this space. Furthermore, this exploration is a knowledge-dependent decompression of the original information (the amino acid sequence), because one uses more bits to encode the three-dimensional structure than one uses to encode the amino acid sequence. We will now see that this decompression can be mirrored in the representation space.

Looking at the degree of information decompression, one can imagine the more and more detailed (redundant) coding on the left-hand side, and the more and more detailed (evaluated) property space on the right-hand side, with a compact encoding of the phenotype in the center (Fig. 5, the action of a canonical GA serves as an example) :

Figure 5. Action of a GA in an abstract setting. The representation spaces are on the left, and the physical property spaces are on the right. (The "X axis" represents the degree of information decompression (using knowledge about the application domain). This degree is minimal in the middle of the picture, and maximal on the far left or far right. The "Y axis" is used to visualize exactly this degree of decompression.)

On the left of Fig. 5, every vertical line represents one possible representation space. The longer the line, the more redundant is the representation. On the right, every vertical line corresponds to one property space, the length of the line indicating its complexity (that is, number of variables, degree of resolution).

For the protein engineering case, we utilize a redundant genotype (DNA), as shown in Figure 6 :

Figure 6. Action of REM in the abstract setting of Fig. 5. REM's new feature, the calculation of doping schemes using SSD/PG, is marked in blue.

As we observed in sections 3 and 4, the genetic code is not a random mapping from the primary phenotype to the genotype; instead the correlation of special physical properties (hydropathy, molar volume) and the second nucleotide position of the triplet helps in the optimization process. By including "neighboring" DNA nucleotide triplets, one can, with high probability, include new non-disruptive amino acids "neighboring" in the physical property space. This was the side-effect of the SSD or PG calculations, which are marked by a blue arrow in Fig. 6. In other words, a weak quasi-homomorphism (similar to [9] [16]) between maps in the representational and the property space is exploited. Clusters in the representational space tend to correspond to clusters in the physical property space.

The last ingredient of our algorithm is the recombination taking place on the redundant encoding level. In the protein engineering case, this is the DNA level, and uniform crossover is simulated by the calculation of a doping scheme (forma) used for the new generation of DNA.

Generalized REM algorithms optimize in two directions (Fig.6). Moving to the right, fitness is evaluated and used for selection; moving to the left, the mapping between representations of different redundancy is implicitly optimizing the population. On the left, reproduction can take place, hopefully recombining building blocks that contribute to high fitness on the right, and yielding new building blocks that represent structures of even higher fitness in the property space.

If our goal is the construction of a diverse population of fit individuals, generalized REM algorithms are especially suited because the calculated formae are the start of a new diverse generation of fit phenotypes.


The authors would like to thank the referees and S. Delagrave for their helpful comments. This work was funded by NIH GM42645, DOE 9102-025, DOE DE-FG02ER20019, and by the Human Frontiers Science Program.


1 Arkin, A. P., E. Goldman, S. J. Robles, W. J. Coleman, C. A. Goddard, M. M. Yang and D. C. Youvan. 1990. Bio/Technology 8: 746-749.

2 Arkin, A. P., Youvan, D. C. (1992). Proc. Natl. Acad. Sci. U.S.A. 89:7811-7815.

3 Arkin, A. P., Youvan, D. C. (1993) In Deisenhofer H. & Norris JR. (eds) The Photosynthetic Reaction Center, Vol. 1 (pp. 133-155) Academic Press, New York.

4 Delagrave, S., Goldman, E. R., Youvan, D. C. (1993). Protein Eng. 6:327-331.

5 Delagrave, S., Youvan, D. C. (1993) Bio/Technology 11: 1548-1552.

6 Füllen, G. (1992) In: Parallel Problem Solving from Nature 2, R. Männer, B. Manderick (Ed.) (pp. 351-61) Elsevier Science Publishers.

7 Gerrits, M., Hoogeweg, P. (1990). In: Parallel Problem Solving from Nature 1, H.P. Schwefel, R. Männer, (Ed.) (pp. 70-74) Springer, Berlin.

8 Goldman, E. R., Youvan, D. C. (1992). Bio/Technology 10:1557-1561.

9 Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.A. (1986). Induction. Processes of Inference, Learning, and Discovery. MIT Press, Cambridge MA.

10 Houghten, R. A., C. Pinilla, S. E. Blondelle, J. R. Appel, C. T. Dooley and J. H. Cuervo. (1991). Nature 354: 84-86.

11 Oliphant, A. R., Nussbaum, A. L., Struhl, K. (1986). Gene 44:177-183.

12 Polani, D., Uthmann, T. (1992). In: Parallel Problem Solving from Nature 2, R. Männer, Manderick B (Ed.) (pp. 421-29) Elsevier publishing Co. Amsterdam.

13 Radcliffe, N.J. (1991). Complex Systems 5: 183-205.

14 Ratledge, C. and D. Lewis. (1991). J. Chem. Tech. Biotech. 28: 109-110.

15 Reidhaar-Olson, J. F., Bowie, J. U., Breyer, R. M., Hu, J. C., Knight, K. L., Lim, W. A., Mossing, M. C., Parsell, D. A., Shoemaker, K. R., Sauer, R. T. (1991). Meth. Enzym. 208:564-587.

16 Riolo, R. (1991). In: Proc. third International Conference on Genetic Algorithms. J.D. Schaffer, (Ed.). (pp. 322-27) Morgan Kaufmann, San Mateo.

17 Robles, S., Youvan, D.C. (1993). J. Mol. Biol. 232: 242-252.

18 Syswerda, G. (1991). In: Proc. third International Conference on Genetic Algorithms, J.D. Schaffer, (Ed.). (pp. 2-9) Morgan Kaufmann, San Mateo.

19 Unger, R., Moult, J.. (1993). J. Mol. Biol. 231: 75-81.

20 Yang M. M., Youvan, D. C. (1988). Bio/Technology 8:746-749.

21 Yang, M. M., W. J. Coleman and D. C. Youvan. 1990. In Reaction Centers of Photosynthetic Bacteria. M.-E. Michel-Beyerle. (Ed.) (Springer-Verlag, Germany) 209-218.

22 Youvan, D. C., Arkin, A. P., Yang M. M. (1992). In: Parallel problem solving from Nature, 2 R. Männer, Manderick B (Ed.) (pp 401-410) Elsevier publishing Co. Amsterdam.

23 Youvan, D. C., Goldman, E., Delagrave, S., & Yang, M. M. (1993). Meth. Enzym. in press.