Getting started—Benchmarking an algorithm (generating data) with Python

For benchmarking an algorithm in Python, we use the coco-experiment (cocoex) Python package.

Installation (assuming Python is present)

From a system shell, execute

python -m pip install coco-experiment cocopp

Using `cocoex`

Depending on the algorithm, we have to chose the appropriate benchmark suite

>>> import cocoex  # install(ed) with "pip install coco-experiment"

>>> cocoex.known_suite_names
['bbob',
 'bbob-biobj',
 'bbob-biobj-ext',
 'bbob-constrained',
 'bbob-largescale',
 'bbob-mixint',
 'bbob-biobj-mixint']

An impractical (oversimplistic) example

To get an idea of the ultimately necessary code, we first show a working but oversimplistic example for running an experiment that benchmarks scipy.optimize.fmin on the bbob suite. Including boilerplate overhead code, it has 2 + 2 + 2 + 3 = 9 lines of (effective) code:

import cocoex  # experimentation module
import scipy  # to define the solver to be benchmarked

### input
suite_name = "bbob"
fmin = scipy.optimize.fmin  # optimizer to be benchmarked

### prepare
suite = cocoex.Suite(suite_name, "", "")
observer = cocoex.Observer(suite_name, "")

### go
for problem in suite:  # this loop may take several minutes or more
    problem.observe_with(observer)  # generates the data for cocopp
    fmin(problem, problem.initial_solution, disp=False)

A practical example

A practical still short-ish example code that benchmarks scipy.optimize.fmin on the bbob suite with a given experimentation budget looks like this:

import cocoex  # experimentation module
import cocopp  # post-processing module (not strictly necessary)
import scipy  # to define the solver to be benchmarked

### input
suite_name = "bbob"
fmin = scipy.optimize.fmin  # optimizer to be benchmarked
budget_multiplier = 1  # x dimension, increase to 3, 10, 30,...

### prepare
suite = cocoex.Suite(suite_name, "", "")  # see https://numbbo.github.io/coco-doc/C/#suite-parameters
output_folder = '{}_of_{}_{}D_on_{}'.format(
        fmin.__name__, fmin.__module__ or '', int(budget_multiplier), suite_name)
observer = cocoex.Observer(suite_name, "result_folder: " + output_folder)
repeater = cocoex.ExperimentRepeater(budget_multiplier)  # 0 == no repetitions
minimal_print = cocoex.utilities.MiniPrint()

### go
while not repeater.done():  # while budget is left and successes are few
    for problem in suite:  # loop takes 2-3 minutes x budget_multiplier
        if repeater.done(problem):
            continue  # skip this problem
        problem.observe_with(observer)  # generate data for cocopp
        problem(problem.dimension * [0])  # for better comparability
        xopt = fmin(problem, repeater.initial_solution_proposal(problem),
                    disp=False)
        problem(xopt)  # make sure the returned solution is evaluated
        repeater.track(problem)  # track evaluations and final_target_hit
        minimal_print(problem)  # show progress

### post-process data
cocopp.main(observer.result_folder + ' bfgs!');  # re-run folders look like "...-001" etc

The benchmarking data are written to a subfolder in the exdata folder (see also here). The last line postprocesses the obtained data and compares the result with BFGS (see also Getting started—Analysing and comparing results (postprocess)). The resulting figures and tables can be browsed from the file ./ppdata/index.html.

For running the benchmarking in several batches a somewhat expanded example example_experiment_complete.py can be found in this subfolder.

Benchmarking “your own” algorithm

For benchmarking another algorithm,

copy-paste the above code or (preferably) download the linked batching example example_experiment_complete.py and the mkl_bugfix.py file
revise the lines following ### input, in particular
- reassign fmin to run the desired algorithm
- adapt the calling code fmin(problem, ...) respectively
- some algorithms may need a maximal budget, like problem.dimension * algorithm_budget_multiplier as input parameter
expect to iterate over the below points many times, start with a low budget to reduce waiting time! For first experiments, the suite can be filtered to contain lesser instances and few dimensions like suite = cocoex.Suite(suite_name, "instances: 1-5", "dimensions: 2,3,5,10,20") (for the bbob suite).
execute the file from a system shell like
```
python example_experiment_complete.py 1 # under *nix, consider using "nohup nice ..."
```
or from an ipython shell or notebook like
```
%run example_experiment_complete.py 1
```
where the “1” is used as the above budget_multiplier when using example_experiment_complete.py.
inspect the figures (and tables) generated by cocopp.main, see also Getting started—Analysing and comparing results (postprocess).
when using the downloaded example, ideally, add an algorithm to compare with in the call to cocopp.main as shown above (last line). The choice depends on the used test suite and preference.
repeat from 4. with increased budget_multiplier argument

Practical hints

Benchmarking is usually an iterative process of repeatedly running experiments and inspecting data. The recommended approach is to successively increase the above budget_multiplier starting with short(er) experiments (which take at first only minutes, then hours, then…) and correcting successively for bugs or setup inaccuracies on the way by carefully inspecting the resulting data.

Parameters that influence termination conditions of the algorithm may have a significant impact on the performance results. It is advisable to determine whether these parameters meet the test suite and the experimental conditions. More specifically,

check the termination status of the algorithm, at least on a random test basis
termination is usually advisable when the algorithm has stalled for some period of time. Long stagnations in the empirical runtime distributions indicate too late termination, whereas a still increasing distribution graph right before the budget is exhausted indicates too early termination.¹

The above budget_multiplier for the experiment is ideally not used as input budget to the algorithm because it is meant to be interpreted as experimental condition and not as algorithm parameter. For algorithms without any other termination criteria however it may be used. When the experimental budget is smaller than the budget that the algorithm uses, the latter will determine the outcome. When the algorithm terminates earlier unsuccessfully, another experiment sweep is invoked by repeater.

In order to get some results as quickly as possible, and besides running batches in parallel, it can be useful to omit in the first experiments some instances and the largest dimension of the suite (in exchange for a larger budget_multiplier) using the suite parameters options. Running the first 5 instances only and passing min_successes=5 to RepeatExperiment has been found to be a useful practical approach.

Footnotes

The budget crosses in the figures indicate the actually used budget (median) which can be (quite) different from the budget given here.↩︎