|
/vol01/fullen01/ | © Copyright 1994 | |||
| Volume 1 | Received: Accepted: |
00/00/1994 00/00/1994 |
|||
Georg Füllen and Douglas C. Youvan
Department of Chemistry
Massachusetts Institute of Technology
Room 56-213, 77 Massachusetts Avenue, Cambridge, MA 02139, USA
Current Address (George Füllen):
FSPM - Strukturbildungsprozesse
Universität Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany
Email:fuellen@mathematik.uni-bielefeld.de or fuellen@mit.edu
Current Address (Douglas C. Youvan):
Palo Alto Institute for Molecular Medicine
2462 Wyandotte Street,
Mountain View, CA 94043, USA
Email: fuellen@MIT.EDU
We compare Genetic Algorithms (GAs) and an existing protein engineering method called Recursive Ensemble Mutagenesis (REM), which has proven highly effective both in experimental work and computer simulations. REM tries to find a set of "optimal" amino acids at preselected positions in a given protein so that its functional behavior can be enhanced or altered appropriately. REM utilizes the relationship between genotype (DNA) and phenotype (protein). After a heuristic or experimental evaluation of the fitness of proteins coded by a random initial DNA population, REM calculates a new set of fit DNA sequences, exploiting properties of the genetic code. Moreover, as it would be impractical to resynthesize every single DNA sequence which coded a fit protein, generalized schemata (formae) are calculated that serve as a partially wild-card input for the DNA synthesizer producing the new generation. This procedure has an effect similar to uniform crossover, but it also introduces completely new amino acid residues with related fitness. Although developed independently, REM may be seen as a GA constrained by experimental requirements. However, these constraints give rise to a powerful new optimization and diversification component not found in traditional GAs.
1. Introduction.
One of the most exciting areas
in
molecular biology is the design of artificial proteins with
applications in medicine [10], industrial
catalysis [14], and nearly all fields of the
life sciences. Originally, computer simulations of REM [2] [22]
were meant for the design
and prediction of laboratory experiments on proteins of the
photosynthetic
apparatus [4] [8].
Here, we have a fast way to detect functional proteins.
It is relatively easy (although not trivial) to examine the spectra of
photosynthetic bacteria by Digital Imaging Spectroscopy [1] [3] [20] [23],
to detect whether the artificially mutagenized
photosynthetic proteins can bind a bacteriochlorophyll or not. Such
new but still functional proteins are studied intensively in
photosynthesis research.
In the next section, we will describe the experimental protocol of REM. The computer simulation of these laboratory experiments is difficult because it requires an exact evaluation of a new protein's fitness. Since protein folding is presumably NP-hard [19], only heuristic decision algorithms can be implemented. Using them, REM was simulated as described in section 3. A comparison between REM and GAs is given in section 4. REM's utilization of the genetic code in the optimizing calculation of new genotypes (DNA sequences) from the fit phenotypes is a feature which generalizes GAs to a concept called "generalized REM" (GREM) algorithms. A first discussion of this idea is given in section 5.
2. The experimental protocol of REM.
In our laboratory, the following procedure is executed to
engineer functional proteins performing tasks related to an original
protein. Given the original sequenced wildtype, the experimenter
chooses positions in its
amino acid
sequence which should be relevant
for the protein's performance. These amino acids can be changed by a
method called "combinatorial cassette mutagenesis" [11] [15].
Briefly, this
method consists of using a DNA synthesizer to produce a gene fragment
which then replaces those parts of the original gene that specify the
selected amino acids. Synthesizing random DNA leads to an
approximately random change at the selected amino acid positions.
More exactly, the
genetic code's mapping
of DNA nucleotide triplets to
amino acids determines the amount of each amino acid encoded by random
DNA. For example, since Alanine is encoded by 4 of the 64 nucleotide
triplets (GCG, GCA, GCC and GCT), a DNA synthesizer producing
sequences containing the nucleotides A, C, G, and T with 25%
probability at each of the 3 positions gives us about 6% (1/16)
Alanine.
Leaving out A and T at every third position, one still
encodes every possible amino acid, but their occurrence is distributed more
uniformly. We will denote the first kind of DNA synthesizer input by
"NNN", the second one by "NN(G,C)". Here, "N" is a wild-card symbol
for equiprobable A, C, G, T and (G,C) is the wild-card symbol for 50%
G and 50% C. We have adopted the term "doping scheme" to designate a
synthesizer input. Note the similarity of this concept to "schemata"
in GA theory. In fact, the doping schemes specify "formae", that
are generalized schemata
[13]. For example, some
commercially available DNA synthesizers are able to produce DNA
according to the following doping schemes (Fig. 1) :
Figure 1. DNA Synthesizer Doping Schemes
The REM protocol (Fig. 2) starts with synthesizing random
[NN(G,C)] DNA, transforming bacteria to produce the new proteins
having approximately random amino acids at the chosen positions and
then searching for functional proteins, for example by Digital Imaging
Spectroscopy (see section 1). The DNA sequences leading to the
functional proteins are determined. To avoid duplicates, REM discards
those DNA sequences that encode a protein already known to be
functional.
Figure 2.The experimental protocol of REM (adapted from [4]).
Now, the GA researcher expects that the DNA sequences
encoding functional proteins are viewed as the fit part of the population which will then be
recombined by crossover and used to
form the new generation. However, in the laboratory this direct manipulation of DNA strands is at
least
a very tedious if not impossible task. Therefore, we calculate a few
or only one new DNA doping scheme as an input for the DNA synthesizer,
producing the new generation of DNA sequences in parallel. The new
doping scheme is calculated as follows. For each of the amino
acid positions where mutagenization takes place, a "target set" is formulated as the multiset of all
amino acids that have been observed at this position. Which of the admissible DNA
doping schemes can generate the target set as closely as possible?
Our DNA synthesizer accepts the nucleotide mixtures of
Fig. 1 as
input. Three of them form a nucleotide triplet (possibly including wildcard symbols) which in turn
determines a multiset of amino acids, the generated set. The
generated set should be "close" to the target set. We are using two
notions of "closeness": is minimal. Here, i is
an index running from 1 to 20, representing the i th amino
acid, pT (i) is its frequency in the target set T , and
pA (i) is the probability of i if doping scheme A is
used. (2) The target set is "conservatively" closest to the
generated set, if and only if the group probability has maximal value. Here, i takes only the index values of the amino
acids in the target set. In this case, all amino acids in the target
set must be present in the generated set since,otherwise, PG is zero.
In the first case, however, amino acids may be discarded. In both
cases, new amino acids may be introduced which are unavoidably encoded
by the doping scheme designed to encode the target set. These newly
introduced amino acids can play an important role, see below. Having calculated optimal DNA doping schemes encoding the
target sets (one per position) as closely as possible, REM uses them
as an input for the DNA synthesizer and generates the new population
in parallel, starting the next cycle. Using REM, an up to
10,000,000-fold increase in the observed frequency of functional
proteins, compared to random mutagenesis, has been found
experimentally [8] [5],
mutagenizing up to 16 amino acid positions simultaneously.
Figure 3. Computer
Simulations of REM. Formulation of the doping scheme
was done by either PG (plot B, green) or SSD (plot C, red). For plot A,
no intelligent doping scheme was calculated. (From [22], adapted)
Nucleotide Mixture Symbol
-------------------------------------
25% A, 25% G, 25% C, 25% T N
33% A, 33% G, 33% C (A,G,C)
33% A, 33% G, 33% T (A,G,T)
33% A, 33% C, 33% T (A,C,T)
33% G, 33% C, 33% T (G,C,T)
50% A, 50% G (A,G)
50% A, 50% C (A,C)
50% A, 50% T (A,T)
50% G, 50% C (G,C)
50% G, 50% T (G,T)
50% C, 50% T (C,T)
100% A A
100% G G
100% C C
100% T T
3. Computational simulations of REM.
Since some results have been reported elsewhere [22],[2], we
will only give a brief summary. As mentioned in the introduction, the
difficulty of the protein-folding problem makes an exact simulation of
the experiments impossible. However, computer simulations using
different heuristic decision algorithms as fitness functions can
reveal important aspects of REM (choice of parameters, expected
results, etc.). Figure 3, adapted from [22],
shows the efficiency of REM in generating protein populations of a
desired functionality. In this example, the decision algorithm used evaluates the
probability that the mutagenized protein contains a binding site for
bacteriochlorophyll (see section 1). The population size is set to 10,000,
the maximal cardinality of functional proteins used to determine the
DNA doping scheme for the next iteration is 50. The plots are average
results of REM starting with 10 different initial populations. Plot B (green)
shows results using
PG, whereas plot C (red) shows results using SSD. These
may be compared with plot A (blue), where a randomly generated population is
used for each iteration. Note that in this application of REM our goal
is not a single optimum but a diverse population of functional
proteins.
4. Comparisons between GAs and REM.
Experimental constraints are the main reason for the differences
between current REM and canonical GAs. These are:
1. In the currently implemented version, the fitness function is set to either 1 (retain) or 0 (discard), just as the protein engineer distinguishes between functional and non-functional proteins. This constraint, however, can be relaxed immediately for those applications of REM which allow a continuous measurement of fitness. The squares of the differences used by the SSD calculation and the probabilities of the amino acids used by the PG calculation will then be multiplied with the fitness value. Both calculations will return a doping scheme more frequently expressing the amino acids belonging to the fit proteins.
2. The recombination procedure of REM is similar to uniform crossover [18] if one views the DNA synthesizer doping schemes as representions of the populations of DNA they are describing. Every doping scheme containing wildcard symbols can be identified with the set of all DNA strands it represents. Then, ignoring the effects of the SSD or PG calculations, the fit nucleotide triplets of the old generation of DNA are faithfully represented by the doping scheme, from which the new generation of DNA is sampled randomly. Indeed, REM can be formulated this way, see [2]. In GA terminology, we are calculating "formae", generalized schemata [13] in an intermediate step, and we use the formae to construct the new generation. Traditional n-point-crossover between DNA strands cannot be employed experimentally, because it would be quite impractical to implement.
3. A distinction between genotype and phenotype appears occasionally in the GA literature (for example, [7] [12] [6]). REM, however, utilizes the natural relationship between DNA and encoded proteins [2]. The genotype-phenotype mapping is given by the genetic code. as discussed in section 2. Being degenerate, the genetic code maps 64 possible nucleotide triplets to 20 possible amino acids (and "stop"). As shown below in Figure 4, information about the physical properties of amino acids is encoded in the second of the three nucleotides. This structural property of the genetic code is exploited by the SSD or PG dependent calculation of the new DNA doping scheme from the target set. As mentioned in section 2, some amino acids not included in the target set are "accidentally" encoded by the doping scheme used. Now there is an enhanced probability that these newly introduced amino acids are similar to the ones in the target set, at least as far as hydropathy and molar volume are concerned. For example, if we try to express Methionine (M), Phenylalanine (F), and Valine (V) optimally, the amino acid set which is "liberally closest" (that is, has a minimal SSD value) with respect to this target set is encoded by the dope "(A,G,T), T, (G,C)", yielding M, F, V and Leucine (L), Isoleucine (I). As Figure 4 shows, L and I have hydropathy values similar to the other three. Since hydropathy and molar volume seem to be the most important parameters for protein folding and function (see, for example, [17]), the mutagenized protein has a higher probability of being functional; any large change of these values would lead to a high risk of obtaining a non-functional protein. This, together with the general feedback mechanism provided by selection explains the advantages of REM reported in the end of both sections 2 and 3.
Fig. 4. Hydropathy versus molar volume of
amino acids (single-letter code),
grouped with respect to the nucleotide at the
second codon position. The outer (yellow) region contains the group with
Guanine (G) at the second position (i.e. W, R, G, S, C; doping scheme "NGN"). Here, most of the amino
acids have extreme values or special properties.
Serine belongs to two groups ("NGN" and "NCN"). (Adapted from [21].)
At
least for engineering problems, the property space can be huge: it may
be spanned by all sorts of dependent variables ("features") that can
be calculated from the proposed solutions. In the protein engineering
case, the proposed amino acid sequence is the starting point for
calculating physical properties; for instance, tabulating average
hydropathy and molar volume values of each amino acid position.
Later, one tries to calculate properties non-locally, of the whole protein, and one
main goal is the determination of the three-dimensional structure of
the protein. These problems are largely unresolved, and in many cases,
the cell environment has to be taken into account, making feature
calculation even more complex. In REM, exact feedback about a
protein's fitness is given by the experiment.
We have just
introduced the concept of a more and more detailed property space, and
the fitness evaluation can be seen as an exploration of this space.
Furthermore, this exploration is a knowledge-dependent decompression of the original
information (the amino acid sequence), because one uses more
bits
to encode the three-dimensional structure than
one uses to encode the amino acid sequence. We will now see that this
decompression can be mirrored in the representation space.
Looking at the degree of information decompression, one can imagine the
more and more detailed (redundant) coding on the left-hand side, and the
more and more detailed (evaluated) property space on the right-hand
side, with a compact encoding of the phenotype in the center (Fig. 5,
the action of a canonical GA serves as an example) : Figure 5. Action of a GA in an abstract
setting. The representation spaces are on the left, and the physical
property spaces are on the right. (The "X axis" represents the degree of
information decompression (using knowledge about the application domain). This degree is minimal in
the middle of the picture, and maximal
on the far left or far right. The "Y axis" is used to visualize exactly
this degree of decompression.)
On the left of Fig. 5, every vertical
line represents one possible representation space. The longer the
line, the more redundant is the representation. On the right, every
vertical line corresponds to one property space, the length of the
line indicating its complexity (that is, number of variables, degree of
resolution).
For the protein engineering case, we utilize a redundant genotype (DNA), as shown in Figure 6 :
Figure 6. Action of REM in the abstract setting of Fig. 5. REM's new feature, the calculation
of doping schemes using SSD/PG, is marked in blue.
As we observed in sections 3 and 4, the genetic code is not
a random mapping from the primary phenotype to the genotype; instead
the
correlation of special physical properties (hydropathy, molar
volume) and the second nucleotide position of the triplet helps in the
optimization process. By including "neighboring" DNA nucleotide
triplets, one can, with high probability, include new non-disruptive amino acids
"neighboring" in the physical property space. This was the
side-effect of the SSD or PG calculations,
which are marked
by a blue arrow in Fig. 6. In other words, a weak
quasi-homomorphism (similar to [9] [16])
between maps in the representational and the property space is
exploited. Clusters in the representational space tend to correspond
to clusters in the physical property space.
The last ingredient of our algorithm is the recombination taking place on the redundant
encoding level. In the protein engineering case, this is the DNA level, and uniform crossover
is simulated by the calculation of a doping scheme (forma) used for the
new generation of DNA.
Generalized REM algorithms optimize in two directions (Fig.6). Moving to the right, fitness
is evaluated and used for selection; moving to the left, the mapping between representations
of different redundancy is implicitly optimizing the population. On the left, reproduction can
take place, hopefully recombining
building blocks
that contribute to high fitness on the right, and yielding new building blocks that represent
structures of even higher fitness in the property space.
If our goal is the construction of a diverse population of fit individuals, generalized REM
algorithms are especially suited because the calculated formae are the start of a new diverse
generation of fit phenotypes.
1 Arkin, A. P., E. Goldman, S. J. Robles, W. J. Coleman, C. A. Goddard, M. M. Yang and
D. C. Youvan. 1990.
Bio/Technology 8: 746-749.
2 Arkin, A. P., Youvan, D. C. (1992). Proc. Natl. Acad. Sci. U.S.A. 89:7811-7815.
3 Arkin, A. P., Youvan, D. C. (1993) In Deisenhofer H. & Norris JR. (eds)
The Photosynthetic Reaction Center, Vol. 1 (pp. 133-155) Academic Press, New York.
4 Delagrave, S., Goldman, E. R., Youvan, D. C. (1993).
Protein Eng. 6:327-331.
5 Delagrave, S., Youvan, D. C. (1993)
Bio/Technology 11: 1548-1552.
6
Füllen, G. (1992) In: Parallel Problem Solving from Nature
2, R. Männer, B. Manderick (Ed.) (pp. 351-61) Elsevier Science Publishers.
7 Gerrits, M., Hoogeweg, P. (1990).
In: Parallel Problem Solving from Nature 1, H.P. Schwefel, R. Männer, (Ed.)
(pp. 70-74) Springer, Berlin.
8 Goldman, E. R., Youvan, D. C. (1992). Bio/Technology 10:1557-1561.
9 Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.A. (1986).
Induction. Processes of Inference, Learning, and Discovery. MIT Press, Cambridge MA.
10 Houghten, R. A., C. Pinilla, S. E. Blondelle, J. R. Appel, C. T. Dooley and J. H. Cuervo.
(1991).
Nature 354: 84-86.
11 Oliphant, A. R., Nussbaum, A. L., Struhl, K. (1986).
Gene 44:177-183.
12 Polani, D., Uthmann, T. (1992).
In: Parallel Problem Solving from Nature 2, R. Männer, Manderick B (Ed.) (pp. 421-29) Elsevier publishing Co. Amsterdam.
13 Radcliffe, N.J. (1991).
Complex Systems 5: 183-205.
14 Ratledge, C. and D. Lewis. (1991). J. Chem. Tech. Biotech. 28: 109-110.
15 Reidhaar-Olson, J. F., Bowie, J. U., Breyer, R. M., Hu, J. C., Knight, K. L., Lim, W. A.,
Mossing, M. C., Parsell, D. A., Shoemaker, K. R., Sauer, R. T. (1991). Meth. Enzym.
208:564-587.
16 Riolo, R. (1991). In: Proc. third International Conference on Genetic Algorithms.
J.D. Schaffer, (Ed.). (pp. 322-27) Morgan Kaufmann, San Mateo.
17 Robles, S., Youvan, D.C. (1993). J. Mol. Biol. 232: 242-252.
18 Syswerda, G. (1991). In: Proc. third International Conference on Genetic Algorithms,
J.D. Schaffer, (Ed.). (pp. 2-9) Morgan Kaufmann, San Mateo.
19 Unger, R., Moult, J.. (1993). J. Mol. Biol. 231: 75-81.
20 Yang M. M., Youvan, D. C. (1988). Bio/Technology 8:746-749.
21 Yang, M. M., W. J. Coleman and D. C. Youvan. 1990. In Reaction Centers of
Photosynthetic Bacteria. M.-E. Michel-Beyerle. (Ed.) (Springer-Verlag, Germany) 209-218.
22 Youvan, D. C., Arkin, A. P., Yang M. M. (1992). In: Parallel problem solving from Nature,
2 R. Männer, Manderick B (Ed.) (pp 401-410) Elsevier publishing Co. Amsterdam.
23 Youvan, D. C., Goldman, E., Delagrave, S., & Yang, M. M. (1993). Meth. Enzym. in press.
5. Generalized REM algorithms: a broader view of GAs.
In this section, the observations made
at the end of the last section are used to motivate an algorithm
optimizing not only by selection and recombination or mutation, but
also by changing populations while mapping between different
representations of increasing redundancy. The action of a GA can be
seen as a
flip-flop process
between a representational space and a
property space, where recombination and/or mutation occur in the former
and fitness evaluation takes place in the latter. (Therefore, there
are two changes of the population in standard GAs: the selection of
the fittest, and the recombination or mutation. In this section we will
investigate the third population change arising from REM.)
Acknowledgments.
The authors would like to thank the referees and S. Delagrave for their helpful comments.
This work was funded by NIH GM42645, DOE 9102-025, DOE DE-FG02ER20019, and by the Human Frontiers
Science Program.