Collective Knowledge Aggregator proof-of-concept
Crowd results Raw CK browser Graphs Reports Datasets Models Home
Live CK report with shared artifacts and iteractive graphs: analyzing, autotuning and adapting OpenCL applications (HOG and SLAMBench) across heterogeneous platforms
Anton Lokhmotov 1, Grigori Fursin 1,2
1 dividiti, UK,   2 cTuning Foundation, France

This live report demonstrates how CK can help create researchers reproducible and interactive article from reusable components


Here we provide details how to reproduce experiments via Collective Knowledge Framework and Repository (CK). You can download and install all related CK repositories on your local machine via CMD simply as
    ck pull repo:reproduce-carp-project
    ck pull repo:reproduce-ck-paper-large-experiments
    ck pull repo:reproduce-ck-paper
    ck pull repo:reproduce-pamela-project
    ck pull repo:reproduce-pamela-project-small-dataset
   
We would like to thank the community for validating various results and sharing unexpected behavior with us via our public CK repository!

Artifact and experiment workflow sharing via open-source Collective Knowledge Infrastructure

Please, check out our motivation to develop free, open-source and customizable Collective Knowledge Infrastructure (CK) for collaborative artifacts and workflow sharing and reuse.

Briefly, CK allows to organize and share artifacts (code and data) as reusable Python components with simple JSON API. Unlike other tools, CK uses agile, schema-free and specification-free approach. This is particularly important in computer engineering where software is changing practically every day - indeed, we need to avoid wasting months or weeks on experiment specification when experimental results may become easily outdated within days. Instead, only when research idea is quickly prototyped and validated, CK allows to collaboratively add and improve meta-description of interfaces and artifacts via simple and human readable JSON format. ( see the following papers for more details [1, 2]).




Such components can be easily connected into experimental workflows (pipelines) as LEGO (TM) to quickly prototype research ideas, crowdsource experimentation, preserve and query results in a third-party HADOOP-based Elastic Search repository, use universal and multi-objective autotuning, apply predictive analytics, share as standard zip archive or via GitHub/Bitbucket for artifact evaluation at conferences, workshops and journals, or enable live and interactive articles, etc:



CK open format can be used as a template for artifact evaluation:





P1

Reproducing adaptive filter experiments (collaboratively finding features which are not in the system to enable run-time adaptation) via CK:

CK scripts: reproduce-filter-speedup

Data set 1
image-raw-bin-fgg-office-day-gray
(features)
Data set 2 (article explaining this unexpected behavior)
image-raw-bin-fgg-office-night-gray
(features)

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.4.4
(CK public repo: all experiments, compiler description, all compilers)

Dataset image-raw-bin-fgg-office-day-gray: Dataset image-raw-bin-fgg-office-night-gray:
Optimization: Binary size: min time (s); exp time (s); var (%): min time (s); exp time (s); var (%):
-O3 10776 4.622 ; 4.634 ; 0.7% 4.630 ; 4.653 ; 1.0%
-O3 -fno-if-conversion 10784 5.169 ; 5.193 ; 1.0% (Slow down over -O3: 1.12) 4.091 ; 4.094 ; 0.2% (Speed up over -O3: 1.14)
-O2 10168 4.631 ; 4.754 ; 10.2% 4.623 ; 4.639 ; 0.7%
-O1 10152 4.621 ; 4.633 ; 0.8% 4.623 ; 4.685 ; 3.6%
-Os 9744 4.668 ; 4.678 ; 0.6% 4.666 ; 4.685 ; 0.9%

Note: CK allows the community to continue validating above results and share unexpected behavior in public cknowledge.org/repo here. See some of such shared results below:

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 4.9.1
(CK public repo: all experiments, compiler description, all compilers)

Dataset image-raw-bin-fgg-office-day-gray: Dataset image-raw-bin-fgg-office-night-gray:
Optimization: Binary size: min time (s); exp time (s); var (%): min time (s); exp time (s); var (%):
-O3 11008 4.619 ; 4.630 ; 0.6% 4.603 ; 4.628 ; 1.0% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion 11008 4.615 ; 4.625 ; 0.6% 4.624 ; 4.628 ; 0.3% (slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2 10880 4.632 ; 4.635 ; 0.1% 4.602 ; 4.647 ; 2.5%
-O1 10360 4.625 ; 4.637 ; 0.7% 4.630 ; 4.654 ; 1.8%
-Os 10376 4.635 ; 4.653 ; 0.8% 4.630 ; 4.652 ; 0.8%

Lenovo X240; Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz; Ubuntu 14.04 64bit; GCC 5.2.0
(CK public repo: all experiments, compiler description, all compilers)

Dataset image-raw-bin-fgg-office-day-gray: Dataset image-raw-bin-fgg-office-night-gray:
Optimization: Binary size: min time (s); exp time (s); var (%): min time (s); exp time (s); var (%):
-O3 10776 4.622 ; 4.632 ; 0.8% 4.630 ; 4.631 ; 0.2%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O3 -fno-if-conversion 10776 4.629 ; 4.649 ; 0.8% 4.610 ; 4.626 ; 1.1%(slower than GCC 4.4.4 -O3 -fno-if-conversion)
-O2 10568 4.599 ; 4.610 ; 0.6% 4.597 ; 4.603 ; 0.4%
-O1 10032 4.613 ; 4.616 ; 0.2% 4.605 ; 4.616 ; 0.8%
-Os 10088 4.609 ; 4.630 ; 1.4% 4.608 ; 4.615 ; 0.4%

Just for comparison during crowd-benchmarking: Samsung Chromebook 2; Samsung EXYNOS5; ARM Cortex A15/A7; ARM Mali-T628; Ubuntu 12.04 32bit; GCC 4.9.2
(CK public repo: all experiments, compiler description, all compilers)

Dataset image-raw-bin-fgg-office-day-gray: Dataset image-raw-bin-fgg-office-night-gray:
Optimization: Binary size: min time (s); exp time (s); var (%): min time (s); exp time (s); var (%):
-O3 7416 7.396 ; 7.513 ; 2.9% 7.390 ; 7.464 ; 2.6%
-O3 -fno-if-conversion 7424 7.345 ; 7.455 ; 3.8% 7.384 ; 7.490 ; 2.6%
-O2 7100 7.398 ; 7.926 ; 39.2% 7.450 ; 7.514 ; 2.3%
-O1 7072 7.404 ; 7.444 ; 1.4% 7.389 ; 7.443 ; 2.9%
-Os 6292 7.367 ; 7.409 ; 1.5% 7.375 ; 7.479 ; 2.7%

P2

Interactive graph of slambench (OpenCL) multi-objective algorithm autotuning (FPS vs accuracy or any other characteristics with Pareto frontier)

  • Samsung ChromeBook 2; Samsung EXYNOS5; ARM Cortex A15/A7; ARM Mali-T628; Ubuntu 12.04; video 640x480
  • Clang 3.6.0 -O3
  • Big green dot: default algorithm configuration
  • Small blue dots: random algorithm configurations (see CK JSON autotuning description with ranges or all scripts):
    • Param c (compute-size-ratio, default 1)
    • Param r (integration-rate, default 1)
    • Param l (icp threshold)
    • Param m (default is 0.1)
    • Param y1 (pyramid-levels X, default=10)
    • Param y2 (pyramid-levels Y, default=5)
    • Param y3 (pyramid-levels Z, default=4)
    • Param v (volume-resolution, default=256)
  • Big red dots: Pareto frontier
  • Requires 2 more CK repositories for reproducibility:
  • Note: we did not yet implement visualization variation in D3 graphs - help is appreciated!
  • Table with Pareto results is available here
  • Table with crowdbenchmarking results across multiple shared platforms vs FPS is available here
Note, that with minimal changes, we can monitor and balance any characteristics exposed via CK such as energy (currently on Odroid platforms), code size, compilation time, and anything else across multiple heterogeneous platforms on Android, Linux, Windows, MacOS, etc. We can also tune any parameter exposed via CK such as compiler flags/passes, hardware configurations such as frequency or used cores, OpenCL parameters such as worksize, CUDA/OpenMP/MPI parameters and algorithm parameters. The community can gradually add and share more parameters and characteristics for tuning.


P3

Non-interactive graph of analysis of CPU/GPU with memory transfer and GPU only time of HOG image processing applications (machine learning algorithms) versus different OpenCL optimizations and hardware configurations

  • Samsung ChromeBook 1/2; ARM Cortex A15/A7; ARM Mali-T604/T628; Debian; random images; calc_histogram CPU/OpenCL kernel
  • Example of experiments not executed via CK but exported to CK by Anton Lokhmotov from his scripts in January 2015 (before CK was mature enough) to take advantage of predictive analytics (active learning, decision trees, predictive scheduling). However, since June 2015, it is possible to perform the same experimentation fully in CK (see CK Getting Started Guide HOG examples).
  • ARM Mali OpenCL 5.0 (default optimization)
  • Blue line: calc_histogram kernel CPU time (s.)
  • Red line: calc_histogram kernel GPU with memory transfer time (s.)
  • Yellow line: calc_histogram kernel GPU time (s.)
  • Green dashed line: adaptation threshold (left - better to run on CPU; right - better to run on GPU)
  • Tuned/varied parameters
    • GWS_X
    • GWS_Y
    • GWS_Z
    • LWS_X
    • LWS_Y
    • LWS_Z
    • OFFSET_X
    • OFFSET_Y
    • OFFSET_Z
    • CPU frequency
    • GPU frequency
    • image cols
    • image rows
    • worksize
  • Requires 2 more CK repository for reproducibility:
  • Raw experiments

Samsung ChromeBook 1; ARM A7; ARM Mali T604

Samsung ChromeBook 2; ARM A15/A7; ARM Mali T628


Note: CK allows the community validate the results and share unexpected behavior in public cknowledge.org/repo here.


P4

3 interactive graphs of (CPU time) / (GPU with memory transfer time) for above HOG application vs experiment number used for further predictive analytics to enable predictive scheduling at run-time (see our paper for motivation).



P5

Building predictive models via CK (scikit-learn and decision trees with different features and depth)



The unification of experiments in CK allows to easily connect experimental results with various powerful machine learning techniques to get help in understanding correlations between 'characteristics', 'features', 'choices' and the 'state'. Furthermore, researchers and the community can focus their effort on explaining unexpected behavior (can be replayed in CK), finding missing features (where model fails explaining results) and improving models to make best predictions.

Next, we demonstrate such example, by first building predictive model for adaptive CPU/GPU scheduling for above HOG example fro Samsung Chromebook 1, then trying to reuse it for Samsung Chromebook 2, then building new model for Samsung Chromebook 2, finding better features (combinations of existing ones or new ones that were even not in the system) and getting to nearly 99% prediction rate at least on the current setup (it can be continuously improved by the community of via active learning - see our previous papers for more details [
1], [2]).
Experimental setup (see above):

Objective function (decision):
  • CPU/GPU COPY > 1.07 (true/false)? Needed for predictive run-time scheduling
SF1: Features (set 1):
  • worksize
SF2: Features (set 2, includes set 1):
  • GWS0
  • GWS1
  • GWS2
  • cpu_freq
  • gpu_freq
  • image cols
  • image rows
  • image size
SF3: Designed features (set 3, includes set 2):
  • image_size_div_by_cpu_freq
  • image_size_div_by_gpu_freq
  • cpu_freq_div_by_gpu
  • image_size_div_by_cpu_div_by_gpu_freq
  • image_size_div_by_cpu_freq

Example of a decision tree built for Chromebook 1 to schedule HOG kernel at run-time either on CPU or GPU with set 2 and depth 3 (achieving 98% correct predictions on 296 points and 100% on 20 points):
Scripts to build/validate above model in CK (t604-second-set-of-features-depth3)

Note that a few features were enough to make nearly perfect correlations (to some extent, decision tree can be used as PCA to quickly and continuously explain correlations while exposing unexpected behavior to researchers in the workgroup or to the community to find more features and their combinations or improve models).

P6

Table with correct prediction rate for adaptive CPU/GPU scheduling of HOG kernel via CK on two platforms

Results have either 2 or 3 numbers:
Prediction rate with training n 20 points / Prediction rate with training on 276 points (/ Prediction rate with training on 460 points)

It is useful to see decision tree refinement during active learning (improving model on the fly when more points are available). Click on the prediction rate number to see the decision tree in png (or change png to pdf in URL to see it in scalable pdf). Note, that such CK generated decision tree can also be converted to C to be integrated inside adaptive wrappers for kernels.

Feature set 1 Feature set 2 Feature set 3
Platform Depth 2 Depth 2 Depth 3 Depth 4 Depth 5 Depth 2 Depth 3 Depth 4 Depth 5 Depth 6
Samsung Chromebook 1 (Mali T604); own models 95.0% / 97.8% 97.8% / 97.8% 100.0% / 97.8% 100.0% / 99.3% 100.0% / 99.6% 100.0% / 100.0%
Samsung Chromebook 2 (Mali T628); models from Chromebook 1 51.1% / 51.7% 30.0% / 51.5% 30.0% / 51.8% / 52.6%
Samsung Chromebook 2 (Mali T628); own models 85.0% / 82.3% / 88.3% 95.0% / 90.2% / 93.0% 95.0% / 91.0% / 93.5% 95.0% / 90.2% / 93.0% 95.0% / 90.2% / 93.0% 95.0% / 90.2% / 93.0% 100.0% / 95.3% / 96.1% 100.0% / 98.2% / 98.3% 100.0% / 99.3% / 98.9% 100.0% / 100.0% / 99.6%

Above results support our wrapper based approach (computational species) across various most-time consuming kernels and libraries combined with exposed features and automatically built and continuously refined decision trees via CK as described in this paper and
its first part. They also demonstrate that it is possible to use CK to balance decision tree's prediction rate (accuracy) versus size and speed to ensure that the ultimate decision tree implemented in C and embedded in a kernel wrapper is fast and compact enough.

View entry in raw format

Developed by dividiti,
cTuning foundation,
and the community
          
Implemented as a CK workflow
                     
   
   
                      Hosted at