This blog post discusses how genetic programming can be used to automatically find a solution to a computational problem. The specific problem I am presenting is that of decision tree induction, and was described in detail in John R. Koza’s Genetic Programming: On the Programming of Computers by Means of Natural Selection, a book I highly recommend for anyone wanting an introductory look into genetic programming. Like most of the problems described in Koza’s book, his description of the decision tree induction problem and its solution are purposely “generic” and theory-oriented so that readers can attempt the problem themselves. This blog post will illustrate an actual concrete implementation of the problem using esec, an evolutionary computation framework for the Python programming language.
esec is an open-source evolutionary computation framework that supports research of both simple and complex ecosystem models. Developed by Steve Dower and Clinton Woodward from Swinburne University, the framework allows one to implement customizable evolutionary systems through the use of an Evolutionary System Definition Language (ESDL). It provides support for genetic programming, and comes with a number of practical examples for one to refer to in order to quickly get started.
I originally attempted this decision tree induction problem whilst undertaking summer scholarship research at my university, and found it a simple, easy-to-understand introduction to genetic programming. Since evolutionary computation can seem like something of a mystery to the uninitiated, I am going to explain things in layman’s terms where possible. If you have read about genetic programming before but never attempted to use it for problem solving, consider this a brief tutorial.
At this point, I am going to assume you have some prerequisite knowledge of the basic concepts behind tree-based genetic programming, specifically program representation. If not, refer to this for some light reading.
Firstly, let’s get acquainted with the problem at hand. If you do not own Koza’s book, you can find a free excerpt of the decision tree induction problem here. Basically, Koza presents a decision tree for classifying Saturday mornings into either class
0 or class
1, which I have copied below:
Saturday mornings can have three attributes:
WINDY. As illustrated above, these attributes can have different values e.g. the
OUTLOOK attribute can have a value of
RAIN. A Saturday morning is classified as either
1 depending on its combination of attribute values. Koza asserts that we can use tree-based genetic programming to automatically generate a computer program capable of providing correct classifications for this decision tree for any combination of attribute values.
In genetic programming, we evolve programs over many generations of reproduction and mutation. Like real-world biology, the process of natural selection favors those individuals with higher fitness; the strong survive to breed whilst the weak gradually die out, their genes sometimes disappearing from the gene pool entirely. Given enough generations of breeding in their environment, this selective pressure can cause evolution of a species as a whole to a better-adapted, more “survivable” form (but not always). Therefore, if we are to evolve programs capable of performing correct classification of this decision tree, we need some way of measuring their fitness.
For the decision tree induction problem, the fitness of a program is simply its ability to classify a Saturday morning correctly, whatever the values of its attributes may be. In order to measure the fitness, we need a representative sample of fitness cases to execute a program against. One or two fitness cases is not sufficient, as it does not allow us to differentiate between programs that can perform correct decision tree induction given any input and sub-optimal programs that only work correctly in a couple of situations. For a small problem like this one, the twelve possible combinations of weather attributes that can be made using the given decision tree are a sufficient sample of fitness cases. When a program classifies a particular fitness case correctly,
1.0 is added to its fitness value, hence the maximum attainable fitness value for any individual is
We can represent our fitness cases as instances of a
SaturdayMorning class in Python like so:
And then, to create fitness cases for all twelve of the possible combinations of weather attributes, we can do the following in Python:
Writing a Fitness Evaluator
The next aspect that we must address is that of fitness evaluation. Previously, I mentioned how fitness can simply be an aggregation of the number of fitness cases correctly classified by an individual program, with each correct classification being worth
1.0 point, and a score of
12.0 being the maximum fitness value possible to attain. To execute a program against the fitness cases and perform this fitness aggregation, we need to define a fitness evaluator.
The esec framework allows us to easily do this by prefixing a function with the
@esdl_eval tag. The argument to this function should be the individual to evaluate, which in this example will be an instance of type
TgpIndividual since we are using tree-based genetic programming. Similarly, the return value will be the fitness calculated from evaluation of the individual against the fitness cases, which will be an instance of the
TGPFitness class. To evaluate an individual’s functions and terminals, we can simply call the
TgpSpecies.evaluate() function (which is made available to any instance of
TgpIndividual). Optionally, we can pass a state object (e.g. a fitness case) as an argument, as well any terminals. Koza defines only two terminals for the decision tree induction problem, the classifications
1, so that is what we will use. Our evaluation function should end up looking something like the following:
As we know, in tree-based genetic programming an individual program’s genetic makeup is represented as a tree structure comprised entirely of functions and terminals. In his description of the decision tree induction problem, Koza specifies three attribute-testing functions (i.e. as many as there are attributes in the decision tree):
WIND. The return value of each function is dependent on the value of that attribute for a given fitness case. For example, the
OUTLOOK attribute can have one of three values, as we know. If a fitness case’s
OUTLOOK attribute has a value of
OUT function returns its first argument. Similarly, if the
OUT function returns its second argument. Its third argument is returned if the
Using the esec framework, we can easily define these functions by initializing instances of esec’s
Instruction class or its many sub-classes. Since each function is executed against the current state (fitness case), the
InstructionWithState sub-class seems appropriate. Here is an example:
Furthermore, we can make our attribute-testing functions even more concise and readable if we use the
DecisionInstructionWithState class instead, like so:
Creating a System Definition
To quote the esec wiki, “ESDL is a conceptual model and language for describing evolutionary systems efficiently and unambiguously”. We can use it conjunction with esec to define the ecosystems that we want to evolve and to tweak their parameters as necessary. Since this a introductory tutorial on esec, I will not elaborate any further on ESDL, but there is an article by Steve Dower here if you are interested in the ESDL syntax.
I have provided an ESDL system definition for Koza’s decision tree induction problem below:
Since the ESDL syntax follows a readable English format similar to LINQ, it is quite self-explanatory as to what it specifies. With that said, there are some general points to note:
- The species is defined as
random_tgp, indicating the form of tree-based genetic programming we are using.
- The arrays of functions and terminals that we defined earlier are used to construct individuals of the specified species.
- Every iteration, individuals in the population are selected for crossover, mutation, and reproduction in proportion to their fitness.
- The rate of crossover for those selected is 90% of the total population size, the rate of mutation is 2%, and the rate of reproduction is 8%.
Configuring an Experiment
To link everything together into a single cohesive and runnable experiment, we must create an esec configuration file, which is basically a Python script that provides an experiment configuration via a
config dictionary variable. Configuration files are relatively straight-forward – they are essentially dictionaries of keys mapped to values (which may also be dictionaries). An example configuration for the decision tree induction problem is below:
There are a few observations we can glean from this experiment configuration:
- The evaluation function that we defined earlier is referenced as the value for the
- The instructions we defined earlier are specified as the value to the
- The ESDL system definition we defined earlier is provided as a value to the
- The maximum population size per generation/iteration is 500, as indicated by the value of the
- By specifying
monitor.limits, we define both an iteration limit (50) and a fitness limit. If either is reached, the experiment concludes immediately.
Executing an Experiment
Once we have created the configuration file for our experiment, we can run it by calling the
run.py script included in the esec framework, which essentially acts as a command line interface for running single or multiple experiments. There are numerous command line options that
run.py supports (see the esec quick reference guide), but for the sake of getting things running the one we are most interested in is the
--config option, which allows one to specify a configuration name to use when executing an experiment. For example, assuming our configuration for the decision tree induction problem exists in its own file,
decision_tree.py, and is in the
esec/cfgs/ directory, we can execute our experiment on the command line in the following way:
python.exe run.py --config decision_tree
Immediately, one should see some console output concerning the status of the experiment as it churns through each iteration of the evolutionary process, like below:
For the decision tree induction problem, I have generally observed that a maximum fitness solution is found in less than fifteen iterations. In the example output shown above, you will observe that it occurs at generation 5.
As can be seen, using genetic programming as a problem solving methodology can be quite effective. Provided we can define appropriate sets of functions and terminals, it is a sound method for discovering solutions in a given problem space. Furthermore, the esec framework enables us to easily set up genetic programming experiments via configuration files, which can be customized based on the specific needs of each experiment. Also, with ESDL we can quickly tweak the individual parameters of our evolutionary systems, making esec well-suited for use as a research tool.
Hopefully, this blog post shed some light on genetic programming with the esec framework. If you are interested in esec, check out esecui, a work-in-progress attempt at providing a graphical user interface for designing and conducting esec experiments.
Complete code for the decision tree induction configuration file: