|
ISSN 1320-0682 | ||
| Complexity International is a
refereed journal for scientific papers dealing with any area of complex
systems research. |
Raymond Lister
School of Computing Science
Queensland University of Technology
GPO Box 2434, Brisbane Q 4001
Email: raymond@fitmail.fit.qut.edu.au
Simulated Annealing can be painfully slow. While the algorithm does exhibit asymptotic convergence (that is, as run time approaches infinity, the probability of finding an optimal solution approaches one), of more practical interest is the rate at which the probability approaches one. In many implementations of annealing, the rate of convergence appears to be extremely slow. Thus, although these implementations work well on small problems, they often do not scale well to interesting problem sizes.
Sorkin [1] has offered a proof that the rate of convergence can be fast for the special case where the energy landscape is a fractal. The energy landscape on the left in Figure 1 illustrates the following intuitive explanation of his proof. The landscape shown is the second generation of a fractal. The first generation simply contains a line between points C and D. The third generation fractal would add, between points c and d, a structure that is self-similar to the first generation.
At the first temperature iteration, the annealing algorithm moves only
upon the first generation structure of the fractal energy landscape;
the finer second generation fractal structure is not part of this
restricted solution space.
(The reader should picture the restricted solution space as consisting only of
the maxima and the midpoints of horizontal line segments.)
This restricted space is small, so the
algorithm will quickly find the lower energy region between C and D.
The temperature for the second temperature iteration is chosen
carefully, so that transitions from the lower region to B or E are
very unlikely, restricting the algorithm to the second generation
structure between C and D.
This is once again a small solution
space, and so the algorithm will quickly find the lower energy region
between c and d.
With each succeeding temperature
iteration, the algorithm explores an increasingly restricted region of
progressively higher generations of the fractal. Sorkin proved that
with this approach, a solution of expected energy no more than
above the global minimum can be found in time polynomial in
.
Note that Sorkin's result is independent of problem size. (In fact, the solution set may even be infinite.) Thus, annealing on fractals should scale well to large problem sizes.
The algorithm in Sorkin's proof, however, is not a practical approach to annealing, for two reasons. Firstly, it is generally not possible to use Sorkin's careful way of altering the temperature parameter. It requires detailed a priori knowledge of the fractal's construction. Secondly, pure fractal energy landscapes are rare.
Figure 1: Fractal (left) and quasi-fractal
(right) energy landscapes
Solla, Sorkin, and White [2] conjectured that annealing may still work well on non-fractal landscapes, provided similar solutions are separated by low maxima, and dissimilar solutions are separated by high maxima. Such a landscape is illustrated by the energy landscape on the right in Figure 1. We refer to such landscapes as being quasi-fractal. We will formalise this concept in the next section.
The shape of an energy landscape is determined by the changing mechanism; that is, the method by which candidate solutions are generated from the current solution. A recursive changing mechanism is one that can be applied to a solution at different spatial or temporal scales, with similar effect at each scale. Recursive changing mechanisms often give quasi-fractal energy landscapes. We will illustrate such mechanisms on two combinatorial optimisation problems, the Ising spin problem and the travelling salesman problem.
The concept of quasi-fractal energy landscapes also offers insight in to how NOT to construct fast practical annealing algorithms. We will review some cases of that in Section 5.
In this section, we formalise the concept of a quasi-fractal energy landscape. The reader may prefer to skip this section on first reading.
A metric space consists of a set and a distance function
, such
that for any three elements x,y,z of the set, three conditions apply:
that
if and only if x = y;
that
; and
that
.
The euclidean plane is a metric space, and for this reason the last of the
above three conditions is known as the triangle inequality. The minima
of an energy landscape also define a metric space, when
is the number of applications of the changing mechanism required to
turn locally minimal solution x into locally minimal solution y.
An ultrametric space satisfies an additional condition stronger
than the triangle inequality:
that
maximum {
,
}.
This condition implies that all triangles in the space are isosceles,
with the third side shorter than or equal to the other two sides.
The archetypal example is a tree, where the distance between nodes is the
height one must ascend the tree to reach a common ancestor.
Any ultrametric
space can be mapped onto a tree structure [3], so all
ultrametric spaces exhibit a recursive structure.
Energy landscapes are also ultrametric spaces, when the metric
is the minimax energy between locally minimal solutions x and y.
The minimax energy is defined thus. Consider the set of all
possible paths from x to y. Each of those paths has a highest
energy maximum. The minimax energy is the lowest of these maxima. It
follows trivially that
defines an ultrametric space, as one can
always move from x to any other locally minimal solution z, via y.
In some metric spaces, the two longer sides of triangles are approximately equal, with high probability. This can be quantified by a statistical correlation function, ranging from minus one to one. When the correlation is exactly one, the space is ultrametric. When the correlation is near one, the space is said to exhibit a high degree of ultrametricity.
Solla, Sorkin, and White [2] conjectured that annealing
works well in landscapes where
and
are highly
correlated; that is, in landscapes where similar solutions
are separated by low minimaxima, and dissimilar solutions are separated
by high minimaxima. This is illustrated by the energy landscape on the
right in Figure 1, with
represented by the horizontal separation of the minima. We refer to
such self-similar energy landscapes as being quasi-fractal.
To empirically determine whether
and
are
highly correlated, it is widely considered sufficient to determine
whether
exhibits a high degree of ultrametricity; that
is, for triples of locally minimal solutions x,y,z, where
minimum {
,
},
determine the correlation between
and
.
Ising spin models are well known within physics, where they are used to
study several phenomena [4].
Computer
scientists would recognise them as a stochastic variation of cellular
automata, similar to the popular Game of Life. Here we use models with
two-state elements arranged in a grid.
The
grid at the bottom left of Figure 2 is such a grid.
Two elements are neighbours if they share a common edge.
Neighbouring elements contribute one energy unit to the total energy of the grid
if the two elements are in the same state. Neighbouring elements contribute
no energy if they are in opposite states. Thus, the
Ising spin models used
in this paper have zero energy when every element has the opposite
state from all of its neighbours, giving a checkered pattern of activity
across the grid. Simulated annealing can find the two global minima,
when the states of elements are changed one-at-a-time, but such a
formulation is slow.
Figure 2: The recursive Ising spin problem (left)
and a graph (right) showing its superior performance
over both the normal and random forms
A faster, more reliable way to find the global minima is to organise
the model recursively, as illustrated in the hierarchy of grids on the left
of Figure 2. In this case, the
bottom or first level is, in isolation, a normal
Ising model.
Each higher level unit is
associated with several contiguous units at the level
immediately below, as indicated by the arrows. A state change for a
higher level element is defined as simultaneous state changes for all
of its associated level 1 elements. The change in energy for the
hierarchy is the energy change for level 1. When calculating the
energy for such a change, we need only examine the energy change for
the level 1 elements that make up the four edges of the higher level
element; the level 1 elements in the interior of the higher level
element do not produce a net energy change. Each higher level of the
recursive structure is a higher-order Ising spin model, and the energy
landscape of the entire system is quasi-fractal.
The graph on the right of Figure 2 shows the superior performance of the
recursive formulation over the normal formulation, on a
Ising problem. Both formulations used the same
initial temperature
, and the same number of attempted changes were
made at each temperature. The temperature was adjusted according to
the rule
, where the
cooling rate
is
constant within any given simulation run. (Note that this is
considerably simpler than the careful way of updating
used in
Sorkin's proof.) All runs were halted when they passed beneath a fixed
lower temperature. Thus, runs with equal
for both the normal
Ising model and the recursive model consumed near equal time.
Furthermore, if
,
then a run with a cooling rate
of
takes very close to half the time of a run with
.
A run with
takes approximately half the time of a run
with
, which in turn takes approximately half the time of a run
with
, and so on, so that as
approaches one, the run time
approaches infinity. One thousand runs were performed for each
algorithm for each cooling rate. As the run time for each algorithm
increases, the probability of finding an optimal solution approaches
one. However, the rate of asymptotic convergence for the recursive
formulation is much faster than for the normal Ising problem.
The reader may be sceptical of the above comparison. The superior performance of the recursive formulation may not be due to its recursive nature, but to the "bigger steps" it takes through the solution space. Such an argument is eliminated by the following experiment. We constructed another hierarchical Ising model, one in which the level i elements were not made from spatially contiguous i-1 elements, but were instead chosen at random from level i-1 elements. The distribution of step sizes for such a random formulation is exactly the same as the distribution of step sizes for the recursive formulation. The performance of the random formulation is also plotted in Figure 2, and is worse than the recursive formulation, being only near equal to the performance of the normal Ising problem.
The reader may still be sceptical. These results only show the frequency of finding optimal solutions, whereas in practice near optimal solutions are also of interest. However, we have plotted similar curves to those in Figure 2, for the frequency of achieving final solutions within some constant of optimality. In such plots, the superiority of the recursive approach is still apparent.
We note in passing that the Ising Spin problem is similar to finding low energy state configurations in the Boltzmann machine [5]. We suspect that the very long run times required by the Boltzmann machine are due, at least in part, to the way its units switch individually, instead of in the recursively decomposed way we advocate here.
Perhaps the most obvious changing mechanism for the Travelling Salesman Problem (TSP) is to remove a single city from the existing path and re-insert it at some other point in the path. This mechanism, however, does not scale well to large problems, as illustrated in Figure 3. In (a) and (b), the 32 cities are grouped into 4 clusters of 8 cities. The path in (a) contains a sub-optimal crossover; that is, the edge between clusters B and D crosses the edge between clusters A and C. Transforming path (a) into path (b) requires the movement of all eight cities forming one cluster, with no obvious improvement in path length until all eight cities have been moved.
Figure 3: The travelling salesman problem.
The path in (a) can be changed into the better path in
(b) by either moving eight cities individually, or
by a single segment reversal. The final solutions from two
annealing runs on Peterson's 200 city TSP, using
segment reversal (c) and moving single cities at a time (d).
Segment reversal is a well known way to change a path [6], and it scales well to large problems. The transition between the paths in Figure 3(a), (b) requires a single segment reversal. The direction of travel along the segment through clusters C and B is reversed. This reversal implies that the pair of edges into and out of this segment must change: the inter-cluster edges from A to C, and from B to D in (a) have been replaced in (b) with edges from A to B and from C to D. Segment reversal scales well because it can be applied to segments of arbitrary length, and thus it may be applied at different spatial scales. In the case of Figure 3 (a) and (b), it is as if the four clusters were "super cities" (hence the circles around the four clusters, to convey this impression).
The cities in Figure 3 (a) and (b) are distributed in a self-similar manner, in order to demonstrate most clearly the effect of segment reversal. Note, however, that segment reversal still gives quasi-fractal landscapes even when the cities are not so distributed. Kirkpatrick and Toulouse [7] numerically studied the distribution of minima in an energy landscape defined by segment reversal, where the distance between each pair of cities was drawn independently from a uniform distribution over the interval (0,1). They found that the distances between the minima exhibited a high degree of ultrametricity.
Figures 3 (c) and (d) contrast the performance of simulated annealing, when using the above two changing mechanisms, on Peterson's 200 city problem [8]. The simulations differed in no respect other than changing mechanism, and consumed the same amount of CPU time. Details of the parallel annealing algorithm used are given elsewhere [9]. The solution obtained by moving single cities (d) contains many locations where the path crosses itself, which is a reliable indicator that a path is poor. The sides of the enclosing squares define the unit of length. The path generated from segment reversal (c) is 1.4 units shorter, which is approximately equal to the length of the diagonal of the enclosing squares. In all, we performed 100 runs for each changing mechanism. The two solutions shown are close to the respective average path lengths. The best solution from moving single cities was 11.2, which is worse than the 10.8 average for segment reversal. The best segment reversal solution was 10.6 units.
We will focus upon two popular pseudo-annealing approaches used in artificial neural networks, which merely give mediocre performance.
This idea has considerable intuitive appeal. We apply some nonlinear filtering process to the landscape. Initially, the smoothing is so great, the only features on the smoothed landscape are remnants of the highest maxima and deepest minima on the original landscape. We perform gradient descent on the smoothed surface. The smoothing is then slowly relaxed, so that progressively finer structure from the original landscape becomes apparent.
The most famous use of smoothing is Hopfield and Tank's neural network [10] for the TSP. Smoothing is implemented by a variable gain parameter, affecting the output of each unit. Unfortunately, the network produces quite mediocre results. Solutions generated by the network typically contain many locations where the path crosses itself. This is the same ill-structure we noted earlier for annealing runs that move single cities. The two algorithms exhibit this same ill-structure for the same fundamental reason: their respective energy surfaces are related. Moving single cities is the nearest discrete approximation to the changing mechanism implemented by the neural network. The analog Hopfield and Tank network can be pictured intuitively as moving portions of cities with each path update, and so it is also slow to move clusters of cities.
There have been many attempts to improve upon the original Hopfield and Tank network, with modest results. The best known improvement uses a variant on standard annealing, known as mean-field annealing [11]. The mean-field network was benchmarked [8], using the same 200-city problem we used earlier. Five runs were performed. The length of the best solution found was 12.1, which is a poor path, being approximately equal to the length of the path shown in Figure 3 (d).
There are many variations on the basic smoothing theme. Many workers have applied it to the error surfaces of multilayer perceptrons, by slowly varying either the error function or the units' activation functions.
Some attempts to implement annealing use noisy gradient descent; that is, the algorithm performs a strict gradient descent calculation, but some form of noise in the data introduces a capacity to make upward steps on the landscape.
The Boltzmann machine [5] uses two types of noise. One type comes from explicitly adding noise to the training patterns. The other type of noise is implicit. It comes from the inevitable sampling error due to the necessarily finite sampling time for collecting co-occurrence statistics.
The above techniques can work if the energy landscape happens to be quasi-fractal. The difficulty is that most straightforward problem representations do not give such landscapes. Usually, the implementor must exercise some ingenuity and devise a recursive changing mechanism. However, such a mechanism implies nonlocalised operations on the problem representation and that is often incompatible with the gradient descent component of the above techniques.
This paper contains good news, and perhaps bad news. The good news is that simulated annealing can be a fast reliable optimisation algorithm, provided the implementor can find a suitable recursive changing mechanism. To this we must add one interesting caveat: even if quasi-fractal landscapes are sufficient to give fast convergence, it has not yet been demonstrated that such landscapes are necessary for fast convergence. There may be other ways.
The possible bad news concerns the implications for annealing within self-organising systems. An attractive feature of annealing has been its ability to support the emergence of coherent global behaviour from unco-ordinated localised interactions. A recursive changing mechanism implies nonlocalised interactions. Furthermore, a recursive changing mechanism would appear to imply an annealing homuncules; that is, a centralised coordinator, which selects from among the many suitable nonlocalised interactions. We hope that the ingenuity of future researchers will render this fear groundless.
Thanks to Russel Stonier and Xing Huo Yu for persevering with an old troff hacker.
Simulated Annealing: Quasi-Fractals and Quasi-Failures
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
l2h -dir lister lister.tex.
The translation was initiated by Pam Milliken on Tue Oct 15 09:53:12 EST 1996