Critics and Rebuttals

This is a potentially growing list of critical remarks and our rebuttals.

COCO was not designed with practitioners in mind.

We don’t believe in mind reading. The motivations for COCO are outline here.

The sphere function is not interesting.

Whether and how fast an algorithm can solve simple functions (locally or globally) is a relevant and consequential observation and should be factored in into a decision which algorithm to use on any given problem and setup.

The sphere function is not relevant in practice.

Many real world problems are comparatively simple and very much akin to the sphere function. These are not interesting for challenging research questions but practitioners need to solve them every day nevertheless. If a test suite wanted to reflect a true distribution of all real world problems, the almost-sphere function would need to claim a significant part of this suite.

Highly ill-conditioned functions are not relevant in practice.

We have regularly seen real world problems with condition numbers considerably above 10^3 and have occasionally stumbled over real world problems with condition numbers above 10^{9}. The overwhelming success of second order methods in “classical” optimization is another testament that (highly) ill-conditioned functions are ubiquitous. In the standard bbob test suite, ill-conditioning is a core difficulty in 6/24 functions while multimodality is a core difficult in 11/24 functions.¹

In practice, we often have a fixed, limited budget (time). The experimental setup of COCO does not reflect this situation.

It is important to understand that the goal of benchmarking is to deliver the maximum amount of information to make an informed decision. The goal of running the algorithm in practice is to have the best chances to succeed within the limited budget. These are fundamentally different objectives and it would be very surprising if they would require the very same experimental setup. As the most basic example: in order to make the best informed decision for a given budget, because any generalization involves considerable uncertainties, we need to know the performance for (somewhat) smaller and (somewhat) larger budgets too. Hence, a fixed budget benchmarking setup can not possibly be the correct choice. The budget-free approach, which allows to read performance as a function of budget, appears to be the best setup we can think of.

The aggregation of all results from a single suite is meaningless and misleading.

We agree in parts:

Aggregated results often get more attention than they deserve. While our runtime aggregations literally contain all individual data uncompressed, they still convey much less information than an annotated profile over all individual results from all single functions. However, when there is time or space for a single figure only, the aggregated results are usually the most meaningful choice to display.
The aggregation of the results from a test suite does not reflect “the true real world performance” of an algorithm. Giving the impression that it does is misleading, suggesting that displaying such aggregation gives this impression is unjustified. The aggregation does (perfectly well) reflect the aggregated performance on this test suite, assuming that all functions are deemed as equally important.²

Not reflecting “the true real world performance”, which is an ultimate (but impossible) goal, does not render any aggregation meaningless.

There cannot exist a single universally best algorithm because there is No Free Lunch.

This is not a critics of COCO per se, but a common misconception of the No Free Lunch Theorems and their practical consequences, see here. The claim is wrong (specifically, the clause): No Free Lunch does not provide any indication as to whether a single universally best algorithm can exist in practice. On the other hand, it is trivial to see that no single algorithm can outperform (dominate) all other algorithms on all functions in any nontrivial set of functions, because for each function there exists an algorithm which hits the global optimum in its first iteration. (NFL makes an aggregation claim, not a dominance claim.) Similarly, it seems exceedingly unlikely that a single algorithm will outperform (on aggregate) all other algorithms in each specific function class of practical relevance and interest. On the other hand, it is conceivable to find “universal” algorithms that perform competitive on a wide range of different function classes, hence it is a legitimate research goal to search for these.

Footnotes

9/24 functions have a condition number in [1, 10], 9/24 functions have a condition number in [100, 10^4], 5/24 functions have a (high) condition number of 10^6 and one function is very highly ill-conditioned with \approx10^8 close to the optimum.↩︎
This is a reasonable assumption: largely insignificant functions would not have included into the suite in the first place. Yet, our aggregation methodology allows to give different (importance) weights to different functions and our code is open source. We are not aware of any proposal which argues why another, specific set of weights would make the aggregation over a COCO test suite more realistic compared to equal weights. We would certainly be interested to learn about such an argument! Changing the weight of a single function, say, to zero does very little in the overall performance picture of like 20-100 functions.↩︎