Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science

18th International Workshop on Compilers for Parallel Computing (CPC'15), London, UK, January 2015

Grigori Fursin ¹, Abdul Wahid Memon ², Christophe Guillon ³, Anton Lokhmotov ⁴

¹ cTuning Foundation, France, ² UVSQ, France, ³ STMicroelectronics, France, ⁴ ARM, UK

This live report demonstrates how CK can help create researchers reproducible and interactive article from reusable components

GitHub repository with shared artifacts in CK format: reproduce-ck-paper, reproduce-ck-paper-large-experiments, ctuning-programs, ctuning-datasets-min, ck-analytics, ck-autotuning, ck-env
BIB
Open archive (ArXiv) with PDF
Related DATE'16 article (PDF)
Wiki describing how to reproduce some experiments via CK
Some reddit discussions
Partially funded by TETRACOM project
Related Collective Knowledge infrastructure and repository (CK)
Related Collective Mind infrastructure and repository (deprecated for CK)
Extends our previous work: 1, 2, 3, 4, 5
Supports our open publication model

Here we provide details how to reproduce experiments from our papers via Collective Knowledge Framework and Repository (CK). All artifacts are shared via CK "reproduce-ck-paper" repository at GitHub. If you have CK installed (docs), you can download and install this repository via CMD simply as
ck pull repo:reproduce-ck-paper
We would like to thank the community for validating various results and sharing unexpected behavior with us via our public CK repository!

Abstract

Nowadays, engineers have to develop software often without even knowing which hardware it will eventually run on in numerous mobile phones, tablets, desktops, laptops, data centers, supercomputers and cloud services. Unfortunately, optimizing compilers are not keeping pace with ever increasing complexity of ever changing computer systems anymore and may produce severely underperforming executable codes while wasting expensive resources and energy.

We present the first to our knowledge practical, collaborative and publicly available solution to this problem. We help the software engineering community gradually implement and share light-weight wrappers around any software piece with more than one implementation or optimization choice available. These wrappers are connected with a public Collective Mind autotuning infrastructure and repository of knowledge ^* to continuously monitor all important characteristics of these pieces (computational species) across numerous existing hardware configurations in realistic environments together with randomly selected optimizations. At the same time, Collective Mind Node allows to easily crowdsource time-consuming autotuning across existing Android-based mobile device including commodity mobile phones and tables.

Similar to natural sciences, we can now continuously track all winning solutions (optimizations for a given hardware such as compiler flags, OpenCL/CUDA/OpenMP/MPI/skeleton parameters, number of threads and any other exposed by users) that minimize all costs of a computation (execution time, energy spent, code size, failures, memory and storage footprint, optimization time, faults, contentions, inaccuracy and so on) of a given species on a Pareto frontier along with any unexpected behavior at c-mind.org/repo. Furthermore, the community can continuously classify solutions, prune redundant ones, and correlate them with various features of software, its inputs (data sets) and used hardware either manually (similar to Wikipedia) or using available big data analytics and machine learning techniques.

Our approach can also help computer engineering community create the first public, realistic, large, diverse, distributed, representative, and continuously evolving benchmark with related optimization knowledge while gradually covering all possible software and hardware to be able to predict best optimizations and improve compilers depending on usage scenarios and requirements. Such continuously growing collective knowledge accessible via simple web service can become an integral part of the practical software and hardware co-design of self-tuning computer systems as we demonstrate in several real usage scenarios validated in industry.

^* Note that we have moved all our developments to a newer, smaller, simpler and faster version of Collective Mind aka Collective Knowledge or CK. This open source, BSD-licensed framework with a live repository is available here: http://github.com/ctuning/ck and http://cknowledge.org/repo. Documentation with all examples is also available at http://github.com/ctuning/ck/wiki.

P1

Reproducing adaptive filter experiments (collaboratively finding features which are not in the system to enable run-time adaptation):

CK scripts: reproduce-filter-speedup

CK datasets:

Data set "image-raw-bin-fgg-office-day-gray" (features)	Data set "image-raw-bin-fgg-office-night-gray" (features)

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.4.4
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	10776	4.622 ; 4.634 ; 0.7%	4.630 ; 4.653 ; 1.0%
-O3 -fno-if-conversion	10784	5.169 ; 5.193 ; 1.0% (Slow down over -O3: 1.12)	4.091 ; 4.094 ; 0.2% (Speed up over -O3: 1.14)
-O2	10168	4.631 ; 4.754 ; 10.2%	4.623 ; 4.639 ; 0.7%
-O1	10152	4.621 ; 4.633 ; 0.8%	4.623 ; 4.685 ; 3.6%
-Os	9744	4.668 ; 4.678 ; 0.6%	4.666 ; 4.685 ; 0.9%

Note: CK allows the community to continue validating above results and share unexpected behavior in public cknowledge.org/repo here. See some of such shared results below:

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.9.1
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	11008	4.619 ; 4.630 ; 0.6%	4.603 ; 4.628 ; 1.0% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion	11008	4.615 ; 4.625 ; 0.6%	4.624 ; 4.628 ; 0.3% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2	10880	4.632 ; 4.635 ; 0.1%	4.602 ; 4.647 ; 2.5%
-O1	10360	4.625 ; 4.637 ; 0.7%	4.630 ; 4.654 ; 1.8%
-Os	10376	4.635 ; 4.653 ; 0.8%	4.630 ; 4.652 ; 0.8%

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 5.2.0
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	10776	4.622 ; 4.632 ; 0.8%	4.630 ; 4.631 ; 0.2%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion	10776	4.629 ; 4.649 ; 0.8%	4.610 ; 4.626 ; 1.1%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2	10568	4.599 ; 4.610 ; 0.6%	4.597 ; 4.603 ; 0.4%
-O1	10032	4.613 ; 4.616 ; 0.2%	4.605 ; 4.616 ; 0.8%
-Os	10088	4.609 ; 4.630 ; 1.4%	4.608 ; 4.615 ; 0.4%

Just for comparison during crowd-benchmarking: Samsung Chromebook 2; Samsung EXYNOS5; ARM Cortex A15/A7; ARM Mali-T628; Ubuntu 12.04 32bit; GCC 4.9.2
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	7416	7.396 ; 7.513 ; 2.9%	7.390 ; 7.464 ; 2.6%
-O3 -fno-if-conversion	7424	7.345 ; 7.455 ; 3.8%	7.384 ; 7.490 ; 2.6%
-O2	7100	7.398 ; 7.926 ; 39.2%	7.450 ; 7.514 ; 2.3%
-O1	7072	7.404 ; 7.444 ; 1.4%	7.389 ; 7.443 ; 2.9%
-Os	6292	7.367 ; 7.409 ; 1.5%	7.375 ; 7.479 ; 2.7%

Above results support our wrapper based approach (computational species) across various most-time consuming kernels and libraries combined with exposed features and automatically built and continuously refined decision trees via CK as described in this paper and its first part. They also demonstrate that it is possible to use CK to balance decision tree's prediction rate (accuracy) versus size and speed to ensure that the ultimate decision tree implemented in C and embedded in a kernel wrapper is fast and compact enough.

Note: we are gradually converting all the code and data related to this paper from the deprecated Collective Mind Format to the new Collective Knowledge Framework.

View entry in raw format