Motivation

(<3 minutes read with footnotes)

The COCO framework has been motivated by (a) the highly repetitive nature of benchmarking and (b) realizing that a proper and robust benchmarking methodology is more intricate than we had hoped for.

When designing test suites, our motivation and objective was to

reflect typical difficulties we have observed in real world applications;
test typical and important generic aspects of continuous domain optimization (too generic to be ignored by nature);
allow for interpretation — functions are comprehensible under human scrutiny;
allow for comprehensive experimentations — functions are quick to evaluate.

While our test suites are motivated by real world difficulties, we do not claim that the distribution of functions in these suites truly reflect any distribution in the real world.¹ We do not even claim to know any such real world distribution of any broader class of functions.

Generally, a decent light-weight (but somewhat fragile) benchmarking can be done² without dealing with the most intricate aspects of COCO, which are

the systematic creation of different and slightly disturbed function instances,
the systematic application of restarts with their seamless integration into performance measures and figures, and
the comprehensive and lossless(!) aggregation of results.

Compared to a one-off benchmarking set up, the COCO framework provides methodological and exploitative robustness.

Additionally, COCO allows for the direct comparison with a variety of previous results, which has become one of the most appealing features to us.

Footnotes

The ultimately relevant question is not how well a benchmark function suite approximates real world problems, but how much the suite’s performance is (positively) correlated with the performance on the real world problem(s) of interest. (Or in other words, the approximation distance measure is similarity in algorithm performance.) The latter question generally depends on the considered algorithms as well. Algorithm invariance properties and other measures against overfitting are likely to increase such a correlation. Possibly due to the systematic instance creation, we seem to have seen surprisingly little overfitting to the COCO test suites over more than a decade.↩︎
Given a suite of test functions, we can simply display convergence plots of single runs (and their median). However, avoiding any and all possible pitfalls from a fresh start remains always a challenge. There are many small but potentially significant decisions to be made in the process.↩︎