How to Publish Data Within COCO
There are two ways to make data easily accessible to the community:
- proposing inclusion of the data into
cocopp.archives
, or - hosting a COCO archive of the data.
In both cases, first, the data need to be prepared. For this, for each dataset, that is, each benchmarked algorithm or algorithm variant:
- Zip the data folder.
A data zipfile shall contain a single folder under which all data from a single full experiment was collected. The folder can contain subfolders (or subsub…folders), for example of data from different (sub)batches of the complete experiment. Valid formats are .gzip or .tgz or .zip - Rename the zip file.
The name of the zipfile defines the name of the data set. The name should represent the benchmarked algorithm and may contain authors names (but rather not the name of the test suite). The name can have any length, but the first ten-or-so characters should be a meaningful algorithm abbreviation.
Propose Inclusion to the COCO Data Archive1
This option is available if one or several datasets were used in a publication or in a preprint available for example on arXiv or HAL. For this:
- Upload the above data zipfile(s) to a file sharing site or to an accessible URL.
- Ask for the inclusion into
cocopp.archives
by opening an issue atnumbbo/data-archive
on Github with- the publication reference and a link to the paper
- a very short description of each dataset including the name of
- the algorithm
- the test suite
- the zip file
- a link to the dataset zip file(s)
- (optional but encouraged) a link to the source code to reproduce the dataset
Host an Archive
Hosting an archive means putting one or several data zipfiles with an added “archive definition text file” online in a dedicated folder that can be accessed under an URL, like http://lq-cma.gforge.inria.fr/data-archives/lq-gecco2019. For example, any folder under a personal homepage root will do.
For this:
Move the above data zipfile(s) into a clean folder, possibly with subfolders (click to see more).
The folder name is only used as part of the URL and can be changed after creating the archive. If desired, subfolders can be created that become part of the names of the datasets under this subfolder. These can not be changed without repeating the following creation procedure:
Create the archive (two lines of Python code, click to see more).
Assume the data zipfiles are in the folder elisa_2020 or its subfolders and cocopp is installed (pip install cocopp). In a Python shell, it suffices to type:
import cocopp 'elisa_2020') cocopp.archiving.create(
thereby “creating” the archive locally by adding an archive definition file to the folder elisa_2020. Archives can contain other archives as subfolders or, the other way around, additional subarchives can be created in any archive subfolder. This is how https://numbbo.github.io/data-archive/ is organized.
Alternative code (from a system shell, click to expand)
python -c “import cocopp; cocopp.archiving.create(‘elisa_2020’)”
Upload the archive folder and its content to where it can be accessed via an URL. The archive is now accessible with
cocopp.archiving.get('URL')
(see below example).Open an issue at the numbbo/data-archive Github repository of COCO (you need to have a Github account) signalling the URL of the archive with a short description of the dataset(s) in the archive.
Example of an resulting archive
For example, the bbob-mixint
archive on Github contains four datasets. The folder structure for these four datasets looks like this:
bbob-mixint/
|-- 2019-gecco-benchmark/
| |-- CMA-ES-pycma.tgz
| |-- DE-scipy.tgz
| |-- RANDOMSEARCH.tgz
| `-- TPE-hyperopt.tgz
|-- 2022/
| `-- CMA-ESwM_Hamano.tgz
`-- coco_archive_definition.txt
and the corresponding coco_archive_definition.txt
file looks like
'2019-gecco-benchmark/CMA-ES-pycma.tgz', '0d8e7f2c77f4e43176bc9424ee8f9a0bfe8e7f66fabc95b15ea7a56ad8b1d667', 38514),
[('2019-gecco-benchmark/DE-scipy.tgz', '494483b1bce9185f8977ce9abf6f6eac3a660efd6fa09321e305dfb79296cd18', 35401),
('2019-gecco-benchmark/RANDOMSEARCH.tgz', '14b237093fd1f393871c578b6b28b6f9a6c3d8dc8921e3bdb024b3cc7cdd287d', 26006),
('2019-gecco-benchmark/TPE-hyperopt.tgz', '34fede46a00c8adef4c388565c3b759c07a7d7d83366e115632b407764e64bf6', 19633),
('2022/CMA-ESwM_Hamano.tgz', 'caaf35f552822bc8376716c6af9f41aaceeebc1e63fece386fa12929c53338ca', 16406)] (
with hashcodes and filesizes as additional entries.
Using an archive by URL
Here’s an example of how to use a (possibly self-hosted) archive from an URL with the cocopp
package for postprocessing.
import cocopp
= 'https://numbbo.github.io/data-archive/data-archive/bbob'
url = cocopp.archiving.get(url)
arch print(arch) # `arch` "is" a `list` of relative filenames
'2009/ALPS_hornby_noiseless.tgz',
['2009/AMALGAM_bosman_noiseless.tgz',
'2009/BAYEDA_gallagher_noiseless.tgz',
'2009/BFGS_ros_noiseless.tgz',
'2009/BIPOP-CMA-ES_hansen_noiseless.tgz',
'2009/CMA-ESPLUSSEL_auger_noiseless.tgz',
'2009/Cauchy-EDA_posik_noiseless.tgz',
# suppressing remainder output
[...]
]print(arch == cocopp.archives.bbob)
True
# compare local result with a result from `arch`
# and from the `cocopp.archives.bbob` archive
= cocopp.main([
dsl # 'exdata/my_local_results', # in case
'2020/SLSQP-11'), # downloads if necessary
arch.get('2010/IPOP-CMA')]) cocopp.archives.bbob.get_first(
Footnotes
Currently requires a Github account. If this is an issue for you, please contact one of the BBOBies via email.↩︎