|
ISSN 1320-0682 | ||||
| Volume 3 | April 1996 | ||||
Xin Yao, Yong Liu & Paul Darwen
This work is partially supported by an ARC small grant.
Email: xin@csadfa.cs.adfa.oz.au
Evolutionary learning has been developing rapidly in the last decade. It is a powerful and general learning approach which has been used successfully in both symbolic systems - for example, rule-based systems - and subsymbolic systems - for example, artificial neural networks. However, most evolutionary learning systems have paid little attention to the fact that they are population-based learning. The common practice is to select the best individual in the last generation as the final learned system. Such practice, in essence, treats these learning systems as optimisation ones. This paper emphasises the difference between a learning system and an optimisation one, and shows that such difference requires a different approach to population-based learning and that the current practice of selecting the best individual as the learned system is not the best choice. The paper then argues that a population contains more information than the best individual and thus should be used as the final learned system. Two examples are presented in this paper to show that even some simple methods which make full use of a population can improve the performance of a learned system greatly. The first example is in the sub-symbolic domain of artificial neural networks. The second example is in the symbolic domain of rule-based systems.
The popularity of evolutionary learning and population-based learning in general, has increased rapidly in recent years. They offer distinct advantages over other machine-learning approaches in dealing with complex and changing environments. In evolutionary learning, the most popular approach is to represent a complete learning system as an individual. The learning system can be either a rule-base system or an artificial neural network. This approach is known as the Pitt approach [22] although the original Pitt approach only referred to rule-based systems. After a certain number of generations of evolving (training), the best individual in the last generation or the best individual found so far is usually chosen as the final learned system. This approach has been so widely used that few people have ever questioned whether there is a better way to make full use of population-based learning without changing the general framework. This paper proposes a simple yet effective method of making best use of population-based learning.
The method proposed in this paper is based on the fact that learning is different from optimisation and that a population contains more useful information than the best individual in the population. We use the last population, rather than any individual to form the final learned system. The results we obtained are better than those obtained by the single best individual.
We will emphasise the difference between evolutionary learning and optimisation in the first section Evolutionary Learning and Optimisation. To support our analysis and arguments, we present our first set of experiments using evolutionary artificial neural networks in the following section Modular Evolutionary Artificial Neural Networks. To take it one step further from just combining an existing population, we can actually encourage formation of species which will be modules of the final combined system. We demonstrate in the section Towards Automatic Modularisation that such a method does work and produces better results than other methods. Finally, we conclude this paper with a summary and future research directions.
Learning is often formulated as an optimisation problem in the machine-learning field. For example, back-propagation (BP) is often used to train feedforward artificial neural networks (ANNs) [19]. This training process is also called the learning process of ANNs. BP is known as one of the most popular learning algorithms. However, BP is in essence, a gradient-based optimisation algorithm which is used to minimise an error function (often a mean square error) of ANNs. The so-called learning problem here is a typical optimisation problem in numerical analysis. Many improvements on the ANN learning algorithm are actually improvements over optimisation algorithms [10], such as conjugate gradient methods [11, 14].
Learning is different from optimisation because we want the learned system to have best generalisation, which is different from minimising an error function. An ANN with the minimum error does not necessarily mean that the ANN has the best generalisation unless there is an accurate way to measure generalisation which can then be maximised. Unfortunately, measuring generalisations accurately is almost impossible in practice, although there are theories and criteria on generalisation, such as the minimum description length (MDL) [17], Akaike information criteria (AIC) [1], and minimum message length (MML) [23]. In practice, these criteria are often used to define a better error function in the hope that minimising the function will correspond to maximising generalisation. While better error functions do lead to better generalisation of learned systems, there is no guarantee. BP or other more advanced learning algorithms are still used as optimisation algorithms. They just optimise different error functions. The nature of the problem is unchanged.
The similar situation occurs in decision tree learning [16] and other machine-learning methods [3]. Criteria always have to be set for learning. However, no criteria can guarantee that they correspond to "true" generalisation. This is a problem faced by all inductive learning methods. There is no way in practice one can get around this. The best we can do is to find better criteria, as mentioned before.
Evolutionary learning is a population-based learning method. Most people just use an evolutionary algorithm to maximise a fitness function or minimise an error function, and thus face the same problem as described above. The evolutionary algorithm is just yet another optimisation algorithm. Some interesting questions arise here. Are we optimising the right function? Why spending time on fancy optimisation algorithms if the function optimised is not what we want? Can we do anything about it? While little can be done for the traditional nonpopulation-based learning methods, there are opportunities for improving population-based learning, such as evolutionary learning.
Since the maximum fitness is not equivalent to best generalisation in evolutionary learning, the best individual with the maximum fitness in a population may not be the one we want. Other individuals in the population may be better and have useful information that will help to improve generalisation of learned systems. It is thus beneficial to make use of the whole population rather than any single individual. A population always contains at least as much information as any single individual in it. The issue now is how to use them effectively. The next section describes four simple methods for combining individuals in a population into an integrated learned system in order to make use of all the information contained in different individuals.
Evolutionary design of ANNs has attracted a lot of attention since the late 1980's. It can be regarded as evolutionary learning of ANN architectures and/or weights. We have developed such a learning system called EPNet based on an evolutionary programming (EP) algorithm [26, 27, 31, 30, 29, 12]. EPNet follows the usual method of selecting the final learned ANN; that is, only one individual is selected as the output. The rest of the final population are abandoned. Although EPNet has achieved excellent results on a number of benchmark problems [26, 27, 31, 30, 28, 12], it has been shown that some simple combination methods which combine different individuals in the last generation to form the final integrated learned ANN can improve the results further [28].
In our experiments [29], we first used EPNet to evolve a population of ANNs, called evolutionary ANNs (EANNs). We then used four combination methods to combine different individuals in the last generation into the final learned EANNs. These systems were then compared with the single best EANN in the last generation. Three benchmark problems were used in our experiments: the Australian credit card, diabetes and heart disease problems. All three problems were obtained from the UCI machine-learning repository.
The main architecture of EPNet can be described by Figure 1.
Figure 1: The main structure of EPNet.
Two validation sets were used in our experiments. The first, V-set 1, was used for fitness evaluation. The second, V-set 2, was used in the last step of EPNet - that is, further training. In the last step, all individuals in the last generation were trained by a modified BP [27] on the combined training and first validation set. V-set 2 was used to stop such training. The best individual with the minimum error rate on V-set 2 was chosen as the final learned EANN. A tie was broken in favour of the individual with the minimum error rate on the combined training and V-set 1. If a tie still exists, the individual with the minimum error on the combined training and V-set 1 will be the final ouput.
For all our experiments, each data set was randomly partitioned into four subsets: a training set, V-set 1, V-set 2 and a testing set. The size of the training set, V-set 1, V-set 2, and testing set was 50%, 12.5%, 12.5%, and 25% of all examples, respectively. The input attributes used for EANNs were rescaled to between 0.0 and 0.1 by a linear function. The output attributes of all the problems were encoded using a 1-of-m output representation for m classes. The winner-takes-all method is used in our EANNs; that is, the output with the highest activation designates the class.
The parameters used in our experiments with the three problems were set to be the same: the population size (20), the maximum number of generations (100), the initial number of hidden nodes (2 to 8), the initial connection density (0.75), the initial learning rate (0.2), the number of mutated hidden nodes (1), and the number of mutated connections (1 to 3). More details about EPNet can be found in other relevant papers [26, 27, 31, 30, 29, 12, 28].
The first combination method we used was the simple majority voting among all the individuals in the final generation. The second method was based on a weighted linear combination of different individuals in the last generation. The weights were calculated according to how good an individual was in the last generation; that is,
where N was the population size and
was a scaling
factor. It was set as 0.75, 0.5, and 0.75 for the Australian credit card,
diabetes and heart disease problem respectively. These numbers were selected
after limited preliminary experiments with
and 0.75. The
ensemble output was:
where
were
outputs from different individuals.
The third method also used linear combination of all the individuals in the last generation. It used the recursive least square (RLS) algorithm [15] to determine the weights. The fourth method was slightly more complicated. It used a genetic algorithm (GA) to search for a near optimal subset of the population so that the final integrated system would contain fewer EANNs. The weights were still determined by the RLS algorithm.
Table 1 shows the average results over 30 runs produced by the four combination methods and those produced by the single best EANN. ESi (Ensemble Structure i) in the table indicates the result produced by the i-th method.

Table 1: Mean error rates produced by the four combination methods and those by the single best EANNs. The results are averaged over 30 runs. ESi (Ensemble Structure i) in the table indicates the result produced by the i-th method.
It is clear from Table 1 that ES3, which used the RLS algorithm to determined the weights, outperforms EPNet consistently. For the other three methods, they all outperform EPNet on the Australian credit card and diabetes problems but are comparable to, or worse than, EPNet on the heart disease problem. This indicates that the best combination method will be the one which can optimally assign weights to different individuals such that better individuals get more say in the linear combination. The RLS algorithm is a good example, although we cannot claim it is optimal.
There are many other algorithms which can be used to combine multiple EANNs, including more powerful nonlinear combination methods. It is, however, not the purpose of this paper to investigate the best algorithm for combination. Instead, this paper wants to emphasise that population-based learning has its own special characteristics which should be exploited in order to improve its performance further.
Combining individuals in a population into an integrated learned system has a close relationship to modular design of ANNs. Each individual in the population can be regarded as a module. The evolutionary process can be regarded as a natural and automatic way to produce different modules. At present, although there are many studies on how to combine different modules together, few studies investigate how modules can be obtained in the first instance. This paper fills this gap in the research.
The work described in the first section is only the first step towards automatic design of modular systems. The approach can be used not only for ANN design, but also for other learning systems - for example, rule-based systems. In other words, the approach we used does not depend on representations used by the learning system. We will describe an example of a evolving modular rule-based system later in this section.
It is worth pointing out that no extra bookkeeping or any other technique was used in evolving ANNs in the experiments described in the first section. The evolutionary process there was the same as an ordinary simulated evolution. It is then logical to ask the question whether there is anything which could be done to form better modules and thus improve the integrated learned system. For example, if a training example is correctly classified by an individual in a population, we would prefer other individuals in the population to not worry about this example and, instead, to concentrate on other examples. Ideally, if each individual can successfully deal with a different aspect (or part) of a complex problem, then the final integrated system can be expected to perform well on the problem as a whole. In this case, the final modular system will be simpler and easier to understand as well because each module is simpler and targetted at a specific part of the complex problem.
It is clear at this stage that we now have a framework for automatic design of modular learning systems based on simulated evolution. All the modules will be evolved, not manually crafted. The number of modules and the composition of each module need not be pre-fixed by human designers, who may or may not have sufficient a priori knowledge to do so.
The next question then is how to encourage formation of good modules and avoid re-inventing wheels among different modules. Fitness-sharing [9] provides one way to form different species around different peaks of a fitness landscape. It discourages individuals to stay at only one high-fitness region. From our point of view, different species are actually different modules. There will be little repetition between any two different species. This is exactly what we want. It is worth emphasising however, that fitness-sharing is just one technique which implements our idea. The automatic design of modular learning systems is not limited to fitness-sharing. Other techniques which encourage speciation and niching, such as crowding [7, 13], can also be used without any modification to our framework. In our experiments, we used implicit fitness-sharing [20].
In the rest of this section, we will describe an example of evolving rule-based systems using a speciated GA [5]. The example illustrates two points. Firstly, the framework we proposed does not depend on any representations. It works with both ANNs and rule-based systems. Secondly, fitness-sharing does help formation of good modules (species) which can then be combined into an integrated system.
The problem we are interested in here is the 2-player iterated prisoner's dilemma (2IPD) [4]. All the rules in the rule-based system encode strategies for playing the game. Figure 2 shows our variant of the payoff matrix of the 2IPD. For the general definition of prisoner's dilemma, see [25]. In 2IPD, the players iterate through many rounds of a prisoner's dilemma, each remembering previous actions.

Figure 2: Payoff to player A in our variant of the 2IPD. Each player can either co-operate (C) or defect (D). Payoff is symmetric for both players.
All previous studies on evolving strategies for the 2IPD only chose the best individual in the last generation as the final output [2, 8, 6, 25], although some of them used speciation to increase and maintain the diversity in populations [21, 18]. Their motivation and purpose for using speciation are totally different from ours.
The implicit fitness-sharing implemented in our experiments can be described by Figure 3.
Figure 3: Payoff function for implicit fitness-sharing.
Other details of the implementation include:
After a population of modules are evolved, we face the same issue as that in modular EANNs - that is, combining them into the final integrated system. We describe one such combination algorithm, also called a gating algorithm, in this subsection. Its main purpose is to decide which module's decision should be taken as the output of the integrated system. In other words, the gating algorithm will decide which module would be the best answer to an unseen opponent.
In our implementation, the final generation contains diverse expert strategies. There are several ways to produce separate modules from these. One could run a second GA for each individual to find the single best counter-strategy for that individual. However, to keep things simple, we will presume that the final generation contains suitable counter-strategies to its own members. Hence, our separate modules will simply be the individuals in the last generation. Even though this is the simplest possible way to obtain modules from a speciated GA, we will see below that even this gives better results.
The gating algorithm does the following:
The gating algorithm finds what high-quality strategy our opponent is currently following, and finds another high-quality strategy which is specialised against the one our opponent uses. This is shown conceptually in Figure 4.
Figure 4: The modular approach to harnessing the specialised
expertise in a speciated GA's final generation.
How do we begin the game before any previous moves tell us what the opponent is like? There is no way of knowing what strategy our opponent is using before the first move. Our gating algorithm takes a vote of the final GA generation for the presumed pre-game moves (included in the genotype, described in the next section).
The framework of our experimental setup was similar to Axelrod's [2],
except for some parameter values. The history lengths (l) of the 2IPD we
considered were 2, 3 and 4. The parameters below were selected after a modest
amount of searching. The different parameters for
were to keep running
times reasonable. We did 30 runs for each value of
.
We compared the modular and non-modular system on a set of test opponents. As
Wolpert [24]
observes, "It may turn out that ... there is no way to measure generalisation
more rigorously than via test problems". To obtain this set, we performed a
partial enumerative search of all possible strategies, for each value of
. For each strategy
tested, we evaluated its performance against a large number of randomly chosen
strategies. Only the best 25 were kept for the test set.
Tables 2 - 4 show the results against the test sets by different systems. Each game of the 2IPD, both during learning and testing, lasted for 100 iterations. The results are for 30 of each of the following strategies (that is, 30 runs were conducted):
Table 2: Results of 30 of each strategy against the best 25
strategies from the partial enumerative search, so each line is the average of
750 games. Remembered history is
. Note that for this
simpler game, speciation gives similar results for both the best individual or
the modular approach.

Table 3: Results against the best 25 strategies from the
partial enumerative search, for 2IPD with remembered history
. This is exactly
the same as used by Axelrod [2].
Note that the modular approach works better than the best individual. For both
methods, improving the speciation method by using the better sample selection
scheme also gives better results.
Table 4: Results against the best 25 strategies from the partial
enumertaive search, for 2IPD with remembered history
. Reverse-assortative
sample selection takes much more time for this larger population, so we only did
two GA experiments (of 30 runs each).
For the simplest 2IPD with history
, where strategies
are only 20 bits in length, both methods (best and gate) give very
similar performance in Table 2. This is because our simple method of finding
test strategies produced a group of very similar strategies for
. A single best
strategy is sufficient to deal with them.
In the more complex games of 2IPD with
, the test sets
contain a number of different strategies. The single best strategy can no longer
handle them well. The advantage of our modular systems is shown clearly in
Tables 3 and 4. In general, the more complex a problem, the better is our
modular system than any single individual.
Our experiments also reveal that using a sample selection scheme that improves sample diversity, gives better results than for random sample selection. This is true for both the best individual approach, and our modular approach. However, the improvement is greater for the modular approach, as shown in Table 3. This indicates that evolving good modules, by a better fitness-sharing scheme or other methods, are important in forming the final integrated system.
This paper emphasises that learning is different from optimisation although most learning problems are formulated as optimisation ones. Few studies try to make use of this difference. This paper shows that such difference can be exploited in evolutionary learning and other population-based learning to improve generalisation of learned systems.
A framework for automatic design of modular systems through simulated evolution is presented in this paper. It can evolve different modules for a larger integrated system without human intervention, and thus addresses the issue of how to design modules in a modular system. We tested this framework on two very different examples - that is, ANNs and rule-based systems. The experimental results show that our proposed framework performs better than existing methods. They also demonstrate the generality of the framework.
We used implicit fitness-sharing to evolve good modules for a modular system. This may not be the best technique to implement our framework. Future studies into better techniques for evolving good modules will be conducted in our project. We also intend to apply such techniques to EANNs.
How to Make Best Use of Evolutionary Learning
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html yao_liu.
The translation was initiated by Pam Milliken on Tue Feb 4 16:56:33 EST 1997Tue Feb 4 16:56:33 EST 1997