Live CK report with shared artifacts and iteractive graphs: analyzing, autotuning and adapting OpenCL applications (HOG and SLAMBench) across heterogeneous platforms

Anton Lokhmotov ¹, Grigori Fursin ^1,2

¹ dividiti, UK, ² cTuning Foundation, France

[ Discussion wiki (comments, reproducibility, etc.) ]

This live report demonstrates how CK can help create researchers reproducible and interactive article from reusable components

Here we provide details how to reproduce experiments via Collective Knowledge Framework and Repository (CK). You can download and install all related CK repositories on your local machine via CMD simply as
ck pull repo:reproduce-carp-project ck pull repo:reproduce-ck-paper-large-experiments ck pull repo:reproduce-ck-paper ck pull repo:reproduce-pamela-project ck pull repo:reproduce-pamela-project-small-dataset
We would like to thank the community for validating various results and sharing unexpected behavior with us via our public CK repository!

Artifact and experiment workflow sharing via open-source Collective Knowledge Infrastructure

Please, check out our motivation to develop free, open-source and customizable Collective Knowledge Infrastructure (CK) for collaborative artifacts and workflow sharing and reuse.

Briefly, CK allows to organize and share artifacts (code and data) as reusable Python components with simple JSON API. Unlike other tools, CK uses agile, schema-free and specification-free approach. This is particularly important in computer engineering where software is changing practically every day - indeed, we need to avoid wasting months or weeks on experiment specification when experimental results may become easily outdated within days. Instead, only when research idea is quickly prototyped and validated, CK allows to collaboratively add and improve meta-description of interfaces and artifacts via simple and human readable JSON format. ( see the following papers for more details [1, 2]).

Such components can be easily connected into experimental workflows (pipelines) as LEGO (TM) to quickly prototype research ideas, crowdsource experimentation, preserve and query results in a third-party HADOOP-based Elastic Search repository, use universal and multi-objective autotuning, apply predictive analytics, share as standard zip archive or via GitHub/Bitbucket for artifact evaluation at conferences, workshops and journals, or enable live and interactive articles, etc:

CK open format can be used as a template for artifact evaluation:

P1

Reproducing adaptive filter experiments (collaboratively finding features which are not in the system to enable run-time adaptation) via CK:

CK scripts: reproduce-filter-speedup

Data set 1 image-raw-bin-fgg-office-day-gray (features)	Data set 2 (article explaining this unexpected behavior) image-raw-bin-fgg-office-night-gray (features)

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.4.4
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	10776	4.622 ; 4.634 ; 0.7%	4.630 ; 4.653 ; 1.0%
-O3 -fno-if-conversion	10784	5.169 ; 5.193 ; 1.0% (Slow down over -O3: 1.12)	4.091 ; 4.094 ; 0.2% (Speed up over -O3: 1.14)
-O2	10168	4.631 ; 4.754 ; 10.2%	4.623 ; 4.639 ; 0.7%
-O1	10152	4.621 ; 4.633 ; 0.8%	4.623 ; 4.685 ; 3.6%
-Os	9744	4.668 ; 4.678 ; 0.6%	4.666 ; 4.685 ; 0.9%

Note: CK allows the community to continue validating above results and share unexpected behavior in public cknowledge.org/repo here. See some of such shared results below:

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.9.1
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	11008	4.619 ; 4.630 ; 0.6%	4.603 ; 4.628 ; 1.0% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion	11008	4.615 ; 4.625 ; 0.6%	4.624 ; 4.628 ; 0.3% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2	10880	4.632 ; 4.635 ; 0.1%	4.602 ; 4.647 ; 2.5%
-O1	10360	4.625 ; 4.637 ; 0.7%	4.630 ; 4.654 ; 1.8%
-Os	10376	4.635 ; 4.653 ; 0.8%	4.630 ; 4.652 ; 0.8%

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 5.2.0
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	10776	4.622 ; 4.632 ; 0.8%	4.630 ; 4.631 ; 0.2%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion	10776	4.629 ; 4.649 ; 0.8%	4.610 ; 4.626 ; 1.1%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2	10568	4.599 ; 4.610 ; 0.6%	4.597 ; 4.603 ; 0.4%
-O1	10032	4.613 ; 4.616 ; 0.2%	4.605 ; 4.616 ; 0.8%
-Os	10088	4.609 ; 4.630 ; 1.4%	4.608 ; 4.615 ; 0.4%

Just for comparison during crowd-benchmarking: Samsung Chromebook 2; Samsung EXYNOS5; ARM Cortex A15/A7; ARM Mali-T628; Ubuntu 12.04 32bit; GCC 4.9.2
(CK public repo: all experiments, compiler description, all compilers)

		Dataset image-raw-bin-fgg-office-day-gray:	Dataset image-raw-bin-fgg-office-night-gray:
Optimization:	Binary size:	min time (s); exp time (s); var (%):	min time (s); exp time (s); var (%):
-O3	7416	7.396 ; 7.513 ; 2.9%	7.390 ; 7.464 ; 2.6%
-O3 -fno-if-conversion	7424	7.345 ; 7.455 ; 3.8%	7.384 ; 7.490 ; 2.6%
-O2	7100	7.398 ; 7.926 ; 39.2%	7.450 ; 7.514 ; 2.3%
-O1	7072	7.404 ; 7.444 ; 1.4%	7.389 ; 7.443 ; 2.9%
-Os	6292	7.367 ; 7.409 ; 1.5%	7.375 ; 7.479 ; 2.7%

P2

Interactive graph of slambench (OpenCL) multi-objective algorithm autotuning (FPS vs accuracy or any other characteristics with Pareto frontier)

Samsung ChromeBook 2; Samsung EXYNOS5; ARM Cortex A15/A7; ARM Mali-T628; Ubuntu 12.04; video 640x480
Clang 3.6.0 -O3
Big green dot: default algorithm configuration
Small blue dots: random algorithm configurations (see CK JSON autotuning description with ranges or all scripts):
- Param c (compute-size-ratio, default 1)
- Param r (integration-rate, default 1)
- Param l (icp threshold)
- Param m (default is 0.1)
- Param y1 (pyramid-levels X, default=10)
- Param y2 (pyramid-levels Y, default=5)
- Param y3 (pyramid-levels Z, default=4)
- Param v (volume-resolution, default=256)
Big red dots: Pareto frontier
Requires 2 more CK repositories for reproducibility:
- ck pull reproduce-pamela-project
- ck pull reproduce-pamela-project-small-dataset
Note: we did not yet implement visualization variation in D3 graphs - help is appreciated!
Table with Pareto results is available here
Table with crowdbenchmarking results across multiple shared platforms vs FPS is available here

Note, that with minimal changes, we can monitor and balance any characteristics exposed via CK such as energy (currently on Odroid platforms), code size, compilation time, and anything else across multiple heterogeneous platforms on Android, Linux, Windows, MacOS, etc. We can also tune any parameter exposed via CK such as compiler flags/passes, hardware configurations such as frequency or used cores, OpenCL parameters such as worksize, CUDA/OpenMP/MPI parameters and algorithm parameters. The community can gradually add and share more parameters and characteristics for tuning.

P3

Non-interactive graph of analysis of CPU/GPU with memory transfer and GPU only time of HOG image processing applications (machine learning algorithms) versus different OpenCL optimizations and hardware configurations

Samsung ChromeBook 1/2; ARM Cortex A15/A7; ARM Mali-T604/T628; Debian; random images; calc_histogram CPU/OpenCL kernel
Example of experiments not executed via CK but exported to CK by Anton Lokhmotov from his scripts in January 2015 (before CK was mature enough) to take advantage of predictive analytics (active learning, decision trees, predictive scheduling). However, since June 2015, it is possible to perform the same experimentation fully in CK (see CK Getting Started Guide HOG examples).
ARM Mali OpenCL 5.0 (default optimization)
Blue line: calc_histogram kernel CPU time (s.)
Red line: calc_histogram kernel GPU with memory transfer time (s.)
Yellow line: calc_histogram kernel GPU time (s.)
Green dashed line: adaptation threshold (left - better to run on CPU; right - better to run on GPU)
Tuned/varied parameters
- GWS_X
- GWS_Y
- GWS_Z
- LWS_X
- LWS_Y
- LWS_Z
- OFFSET_X
- OFFSET_Y
- OFFSET_Z
- CPU frequency
- GPU frequency
- image cols
- image rows
- worksize
Requires 2 more CK repository for reproducibility:
- ck pull reproduce-carp-project
- ck pull reproduce-ck-paper-large-experiments
Raw experiments

Samsung ChromeBook 1; ARM A7; ARM Mali T604

Samsung ChromeBook 2; ARM A15/A7; ARM Mali T628

Note: CK allows the community validate the results and share unexpected behavior in public cknowledge.org/repo here.

P4

3 interactive graphs of (CPU time) / (GPU with memory transfer time) for above HOG application vs experiment number used for further predictive analytics to enable predictive scheduling at run-time (see our paper for motivation).

P5

Building predictive models via CK (scikit-learn and decision trees with different features and depth)

The unification of experiments in CK allows to easily connect experimental results with various powerful machine learning techniques to get help in understanding correlations between 'characteristics', 'features', 'choices' and the 'state'. Furthermore, researchers and the community can focus their effort on explaining unexpected behavior (can be replayed in CK), finding missing features (where model fails explaining results) and improving models to make best predictions.

Next, we demonstrate such example, by first building predictive model for adaptive CPU/GPU scheduling for above HOG example fro Samsung Chromebook 1, then trying to reuse it for Samsung Chromebook 2, then building new model for Samsung Chromebook 2, finding better features (combinations of existing ones or new ones that were even not in the system) and getting to nearly 99% prediction rate at least on the current setup (it can be continuously improved by the community of via active learning - see our previous papers for more details [1], [2]).
Experimental setup (see above):

Objective function (decision):

CPU/GPU COPY > 1.07 (true/false)? Needed for predictive run-time scheduling

SF1: Features (set 1):

worksize

SF2: Features (set 2, includes set 1):

GWS0
GWS1
GWS2
cpu_freq
gpu_freq
image cols
image rows
image size

SF3: Designed features (set 3, includes set 2):

image_size_div_by_cpu_freq
image_size_div_by_gpu_freq
cpu_freq_div_by_gpu
image_size_div_by_cpu_div_by_gpu_freq
image_size_div_by_cpu_freq

Example of a decision tree built for Chromebook 1 to schedule HOG kernel at run-time either on CPU or GPU with set 2 and depth 3 (achieving 98% correct predictions on 296 points and 100% on 20 points):

Scripts to build/validate above model in CK (t604-second-set-of-features-depth3)

Note that a few features were enough to make nearly perfect correlations (to some extent, decision tree can be used as PCA to quickly and continuously explain correlations while exposing unexpected behavior to researchers in the workgroup or to the community to find more features and their combinations or improve models).

P6

Table with correct prediction rate for adaptive CPU/GPU scheduling of HOG kernel via CK on two platforms

Results have either 2 or 3 numbers:
Prediction rate with training n 20 points / Prediction rate with training on 276 points (/ Prediction rate with training on 460 points)

It is useful to see decision tree refinement during active learning (improving model on the fly when more points are available). Click on the prediction rate number to see the decision tree in png (or change png to pdf in URL to see it in scalable pdf). Note, that such CK generated decision tree can also be converted to C to be integrated inside adaptive wrappers for kernels.

	Feature set 1	Feature set 2				Feature set 3
Platform	Depth 2	Depth 2	Depth 3	Depth 4	Depth 5	Depth 2	Depth 3	Depth 4	Depth 5	Depth 6
Samsung Chromebook 1 (Mali T604); own models	95.0% / 97.8%	97.8% / 97.8%	100.0% / 97.8%	100.0% / 99.3%	100.0% / 99.6%				100.0% / 100.0%
Samsung Chromebook 2 (Mali T628); models from Chromebook 1	51.1% / 51.7%				30.0% / 51.5%				30.0% / 51.8% / 52.6%
Samsung Chromebook 2 (Mali T628); own models	85.0% / 82.3% / 88.3%	95.0% / 90.2% / 93.0%	95.0% / 91.0% / 93.5%	95.0% / 90.2% / 93.0%	95.0% / 90.2% / 93.0%	95.0% / 90.2% / 93.0%	100.0% / 95.3% / 96.1%	100.0% / 98.2% / 98.3%	100.0% / 99.3% / 98.9%	100.0% / 100.0% / 99.6%

Above results support our wrapper based approach (computational species) across various most-time consuming kernels and libraries combined with exposed features and automatically built and continuously refined decision trees via CK as described in this paper and its first part. They also demonstrate that it is possible to use CK to balance decision tree's prediction rate (accuracy) versus size and speed to ensure that the ultimate decision tree implemented in C and embedded in a kernel wrapper is fast and compact enough.

View entry in raw format