18 with optimization results
for GCC 7.1.0 also confirms that this version was considerably improved
in comparison with GCC 4.9.2
(latest live results are available in our public optimization repository
at this
):
there are fewer efficient optimization solutions found during crowd-tuning
14 vs 23 showing the overall improvement of the
optimization level.
Nevertheless, GCC 7.1.0 still misses many optimization opportunities
simply because our long-term experience suggests that it is infeasible
to prepare one universal and efficient optimization heuristics
with good multi-objective trade-offs for all continuously
evolving programs, data sets, libraries, optimizations and platforms.
That is why we hope that our approach of combining a common workflow framework
adaptable to software and hardware changes, public repository of optimization knowledge,
universal and collaborative autotuning across multiple hardware platforms
(e.g. provided by volunteers or by HPC providers), and community involvement
should help make optimization and testing of compilers
more automatic and sustainable [6, 35].
Rather than spending considerable amount of time on writing their own autotuning and crowd-tuning
frameworks, students and researchers can quickly reuse shared workflows,
reproduce and learn already existing optimizations, try to improve optimization heuristics,
and validate their results by the community.
Since CK automatically adapts to a user environment, it is also possible
to reproduce the same bug using a different compiler version.
Compiling the same program with the same combination of flags on the same platform
using GCC 7.1.0 showed that this bug has been fixed in the latest compiler.
We hope that our extensible and portable benchmarking workflow will help students and engineers
prototype and crowdsource different types of fuzzers.
It may also assist even existing projects [81, 82]
to crowdsource fuzzing across diverse platforms and workloads.
For example, we collaborate with colleagues from Imperial College London
to develop CK-based, continuous and collaborative OpenGL and OpenCL compiler
fuzzers [83, 84, 85]
while aggregating results from users in public or private repositories
( link to public OpenCL fuzzing results across diverse desktop and mobile platforms).
To demonstrate our approach, we converted all our past research artifacts
on machine learning based optimization and SW/HW co-design
to CK modules.
We then assembled them to a universal Collective Knowledge workflow
shown in Figure 28.
If you do not know about machine learning based compiler optimizations,
we suggest that you start from our MILEPOST GCC paper [24]
to make yourself familiar with terminology
and methodology for machine learning training
and prediction used further.
Next, we will briefly demonstrate the use of this customizable workflow
to continuously classify shared workloads presented in this report
in terms of the most efficient compiler optimizations
while using MILEPOST models and features.
Note that our CK crowd-tuning workflow also continuously applies
such optimization to all shared workloads.
This allows us to analyze "reaction" of any given workload
to all most efficient optimizations.
We can then group together those workloads which exhibit similar reactions.
Having such groups of labeled objects (where labels are the most efficient optimizations
and objects are workloads) allows us to use standard machine learning classification methodology.
One must find such a set of objects' features and a model which maximizes
correct labeling of previously unseen objects, or in our cases can correctly predict
the most efficient software optimization and hardware design for a given workload.
As example, we extracted 56 so-called MILEPOST features described in [24]
(static program properties extracted from GCC's intermediate representation)
from all shared programs, stored them in program.static.features,
and applied simple nearest neighbor classifier to above data.
We then evaluated the quality of such model (ability to predict) using prediction accuracy
during standard leave-one-out cross-validation technique: for each workload we remove it
from the training set, build a model, validate predictions, sum up all correct predictions
and divide by the total number of workloads.
Though relatively low, this number can now become a
reference point to be further improved by the community.
It is similar in spirit to the ImageNet Large Scale Visual Recognition Competition
(ILSVRC) [89]
which reduced image classification error rate from 25%
in 2011 to just a few percent with the help of the community.
Furthermore, we can also keep just a few representative
workloads for each representative group as well as misclassified ones in
a public repository thus producing a minimized, realistic and
representative training set for systems researchers.
For a proof-of-concept of such collaborative learning approach,
we shared a number of customizable CK modules (see ck search module:*model*)
for several popular classifiers including the nearest neighbor,
decision trees and deep learning.
These modules serve as wrappers with a common CK API for
TensorFlow, scikit-learn, R and other machine learning frameworks.
We also shared several feature extractors (see ck search module:*features*)
assembling the following groups of program features
which may influence predictions:
We then attempted to autotune various parameters
of machine learning algorithms exposed via CK API.
Figure 31
shows an example of autotuning the depth of a decision tree
(available as customizable CK plugin) with all shared groups of features
and its impact on prediction accuracy of compiler flags using MILEPOST
features from the previous section for GCC 4.9.2 and GCC 7.1.0
on RPi3.
Blue round dots obtained using leave-one-out validation suggest
that decision trees of depth 8 and 4 are enough
to achieve maximum prediction accuracy of 0.4% for GCC 4.9.2
and GCC 7.1.0 respectively.
Model autotuning thus helped improve prediction accuracy in comparison
with the original nearest neighbor classifier from the MILEPOST project.
Turning off cross-validation can also help developers understand
how well models can perform on all available workloads (in-sample data)
(red dots on Figure 31).
In our case of GCC 7.1.0, the decision tree of depth 15 shown in Figure 32)
is enough to capture all compiler optimizations for 300 available workloads.
To complete our demonstration of CK concepts for collaborative machine learning and optimization,
we also evaluated a deep learning based classifier from TensorFlow [102]
(see ck help module:model.tf)
with 4 random configurations of hidden layers ([10,20,10], [21,13,21], [11,30,18,20,13], [17])
and training steps (300..3000).
We also evaluated the nearest neighbor classifier used in the MILEPOST project but with different groups of features
and aggregated all results in Table 5.
Finally, we automatically reduced the complexity of the nearest neighbor classifier (1) by iteratively removing those features one by one
which do not degrade prediction accuracy and (2) by iteratively adding features one by one to maximize prediction accuracy.
It is interesting to note that our nearest neighbor classifier achieves
a slightly better prediction accuracy with a reduced feature set than with
a full set of features showing inequality of MILEPOST features
and overfitting.
As expected, deep learning classification achieves a better prediction accuracy of 0.68%
and 0.45% for GCC 4.9.2 and GCC 7.1.0 respectively for RPi3 among currently shared models,
features, workloads and optimizations.
However, since deep learning models are so much more computationally intensive, resource hungry
and difficult to interpret than decision trees, one must carefully balance accuracy vs speed vs size.
That is why we suggest to use hierarchical models where
high-level and coarse-grain program behavior is quickly captured
using decision trees, while all fine-grain behavior is captured
by deep learning and similar techniques.
Another possible use of deep learning can be in automatically capturing
influential features from the source code, data sets and hardware.
All these data sets will be immediately visible to all related programs
via the CK autotuning workflow.
For example, if we now run susan corners program, CK will prompt user
a choice of 20 related images from the above data sets:
Next, we can apply all most efficient compiler optimizations
to a given program with all data sets.
Figure 33 shows such reactions
(ratio of an execution time with a given optimization to an execution time
with the default -O3 compiler optimization) of a jpeg decoder across
20 different jpeg images from the above MiDataSet on RPi3.
One can observe that the same combination of compiler flags can both
considerably improve or degrade execution time for the same program
but across different data sets.
For example, data sets 4,5,13,16 and 17 can benefit from the most
efficient combination of compiler flags found by the community
with speedups ranging from 1.2 to 1.7.
On the other hand, it's better to run all other data sets with
the default -O3 optimization level.
Unfortunately, finding data set and other features which could easily differentiate
above optimizations is often very challenging.
Even deep learning may not help if a feature is not yet exposed.
We explain this issue in [6]
when optimizing real B&W filter kernel - we managed to improve
predictions by exposing a "time of the day" feature
only via human intervention.
However, yet again, the CK concept is to bring the interdisciplinary
community on board to share such cases in a reproducible way
and then collaboratively find various features to improve predictions.
Another aspect which can influence the quality of predictive models,
is that the same combinations of compiler flags are too coarse-grain
and can make different internal optimization decisions
for different programs.
Therefore, we need to have an access to fine-grain optimizations
(inlining, tiling, unrolling, vectorization, prefetching, etc)
and related features to continue improving our models.
However, this follows our top-down optimization and modeling methodology
which we implemented in the Collective Knowledge framework.
We want first to analyze, optimize and model coarse-grain behavior of shared workloads
together with the community and students while gradually adding more workloads,
data sets, models and platforms.
Only when we reached the limit of prediction accuracy,
we start gradually exposing finer-grain optimizations
and features via extensible CK JSON interface
while avoiding explosion in design and optimization spaces
(see details in [4] for our previous
version of the workflow framework, Collective Mind).
This is much in spirit of how physicists moved from Newton's
three coarse-grain laws of motion to fine-grain quantum mechanics.
To demonstrate this approach, we shared a simple skeletonized
matrix multiply kernel from [104] in the CK format
with blocking (tiling) parameter and data set feature
(square matrix size) exposed via CK API:
We can then reuse universal autotuning (exploration) strategies
available as CK modules or implement specialized ones to explore
exposed fine-grain optimizations versus different data sets.
Figure 34 shows matmul performance
in GFLOPS during random exploration of a blocking parameter for different square
matrix sizes on RPi3.
These results are in line with multiple past studies showing that
unblocked matmul is more efficient for small matrix sizes (less than 32
on RPi3) since all data fits cache, or between 32 and 512 (on RPi3)
if they are not power of 2.
In contrast, the tiled matmul is better on RPi3 for matrix sizes of power of 2 between 32 and 512,
since it can help reduce cache conflict misses, and for all matrix sizes more than 512
where tiling can help optimize access to slow main memory.
Our customizable workflow can help teach students how to build efficient,
adaptive and self-optimizing libraries including BLAS, neural networks and FFT.
Such libraries are assembled from the most efficient routines
found during continuous crowd-tuning across numerous data sets and platforms,
and combined with fast and automatically generated decision trees
or other more precise classifiers [105, 106, 107, 6].
The most efficient routines are then selected at run-time
depending on data set, hardware and other features as conceptually shown
in Figure 35..
[1] | The HiPEAC vision on high-performance and embedded architecture and
compilation (2012-2020).
http://www.hipeac.net/roadmap, 2017.
|
[2] | J. Dongarra et.al.
The international exascale software project roadmap.
Int. J. High Perform. Comput. Appl., 25(1):3--60, Feb. 2011.
|
[3] | PRACE: partnership for advanced computing in europe.
http://www.prace-project.eu.
|
[4] | G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin, D. Malony, Allen,
Z. Chamski, D. Novillo, and D. D. Vento.
Collective Mind: Towards practical and collaborative auto-tuning.
Scientific Programming, 22(4):309--329, July 2014.
|
[5] | G. Fursin.
Collective Tuning Initiative: automating and accelerating
development and optimization of computing systems.
In Proceedings of the GCC Developers' Summit, June 2009.
|
[6] | G. Fursin, A. Memon, C. Guillon, and A. Lokhmotov.
Collective Mind, Part II: Towards performance- and cost-aware
software engineering as a natural science.
In 18th International Workshop on Compilers for Parallel
Computing (CPC'15), January 2015.
|
[7] | Artifact Evaluation for Computer Systems Conferences including CGO,PPoPP,PACT
and SuperComputing: developing common experimental methodology and tools for
reproducible and sustainable research.
Link, 2014-cur.
|
[8] | B. R. Childers, G. Fursin, S. Krishnamurthi, and A. Zeller.
Artifact Evaluation for Publications (Dagstuhl Perspectives Workshop
15452).
5(11), 2016.
|
[9] | G. Fursin and C. Dubach.
Community-driven reviewing and validation of publications.
In Proceedings of the 1st Workshop on Reproducible Research
Methodologies and New Publication Models in Computer Engineering (ACM SIGPLAN
TRUST'14). ACM, 2014.
|
[10] | R. Whaley and J. Dongarra.
Automatically tuned linear algebra software.
In Proceedings of the Conference on High Performance Networking
and Computing, 1998.
|
[11] | F. Matteo and S. Johnson.
FFTW: An adaptive software architecture for the FFT.
In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, volume 3, pages 1381--1384,
Seattle, WA, May 1998.
|
[12] | K. Cooper, P. Schielke, and D. Subramanian.
Optimizing for reduced code space using genetic algorithms.
In Proceedings of the Conference on Languages, Compilers, and
Tools for Embedded Systems (LCTES), pages 1--9, 1999.
|
[13] | M. Voss and R. Eigenmann.
ADAPT: Automated de-coupled adaptive program transformation.
In Proceedings of International Conference on Parallel
Processing, 2000.
|
[14] | G. Fursin, M. O'Boyle, and P. Knijnenburg.
Evaluating iterative compilation.
In Proceedings of the Workshop on Languages and Compilers for
Parallel Computers (LCPC), pages 305--315, 2002.
|
[15] | C. Ţăpuş, I.-H. Chung, and J. K. Hollingsworth.
Active harmony: towards automated performance tuning.
In Proceedings of the 2002 ACM/IEEE conference on
Supercomputing, Supercomputing '02, pages 1--11, Los Alamitos, CA, USA,
2002. IEEE Computer Society Press.
|
[16] | P. Kulkarni, W. Zhao, H. Moon, K. Cho, D. Whalley, J. Davidson, M. Bailey,
Y. Paek, and K. Gallivan.
Finding effective optimization phase sequences.
In Proceedings of the Conference on Languages, Compilers, and
Tools for Embedded Systems (LCTES), pages 12--23, 2003.
|
[17] | B. Singer and M. Veloso.
Learning to predict performance from formula modeling and training
data.
In Proceedings of the Conference on Machine Learning, 2000.
|
[18] | J. Lu, H. Chen, P.-C. Yew, and W.-C. Hsu.
Design and implementation of a lightweight dynamic optimization
system.
In Journal of Instruction-Level Parallelism, volume 6, 2004.
|
[19] | C. Lattner and V. Adve.
LLVM: A compilation framework for lifelong program analysis &
transformation.
In Proceedings of the 2004 International Symposium on Code
Generation and Optimization (CGO'04), Palo Alto, California, March 2004.
|
[20] | Z. Pan and R. Eigenmann.
Fast and effective orchestration of compiler optimizations for
automatic performance tuning.
In Proceedings of the International Symposium on Code Generation
and Optimization (CGO), pages 319--332, 2006.
|
[21] | S. S. Shende and A. D. Malony.
The tau parallel performance system.
Int. J. High Perform. Comput. Appl., 20(2):287--311, May 2006.
|
[22] | D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth,
P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, and
H. You.
PERI auto-tuning.
Journal of Physics: Conference Series, 125(1):012089, 2008.
|
[23] | A. Hartono, B. Norris, and P. Sadayappan.
Annotation-based empirical performance tuning using orio.
In 23rd IEEE International Symposium on Parallel and
Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009, pages
1--11, 2009.
|
[24] | G. Fursin, Y. Kashnikov, A. W. Memon, Z. Chamski, O. Temam, M. Namolaru,
E. Yom-Tov, B. Mendelson, A. Zaks, E. Courtois, F. Bodin, P. Barnard,
E. Ashton, E. Bonilla, J. Thomson, C. Williams, and M. F. P. O'Boyle.
Milepost gcc: Machine learning enabled self-tuning compiler.
International Journal of Parallel Programming (IJPP),
39:296--327, 2011.
10.1007/s10766-010-0161-2.
|
[25] | S. Tomov, R. Nath, H. Ltaief, and J. Dongarra.
Dense linear algebra solvers for multicore with GPU accelerators.
In Proc. of the IEEE IPDPS'10, pages 1--8, Atlanta, GA, April
19-23 2010. IEEE Computer Society.
DOI: 10.1109/IPDPSW.2010.5470941.
|
[26] | Open benchmarking: automated testing & benchmarking on an open platform.
Link, 2017.
|
[27] | G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt.
Google-wide profiling: A continuous profiling infrastructure for data
centers.
IEEE Micro, 30(4):65--79, July 2010.
|
[28] | S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos.
Auto-tuning a high-level language targeted to GPU codes.
In Innovative Parallel Computing (InPar), 2012, pages 1--10,
May 2012.
|
[29] | D. Grewe, Z. Wang, and M. F. P. O'Boyle.
Portable mapping of data parallel programs to opencl for
heterogeneous systems.
In Proceedings of the 2013 IEEE/ACM International Symposium on
Code Generation and Optimization, CGO 2013, Shenzhen, China, February
23-27, 2013, pages 22:1--22:10, 2013.
|
[30] | M. Khan, P. Basu, G. Rudy, M. Hall, C. Chen, and J. Chame.
A script-based autotuning compiler system to generate
high-performance cuda code.
ACM Trans. Archit. Code Optim., 9(4):31:1--31:25, Jan. 2013.
|
[31] | J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M.
O'Reilly, and S. Amarasinghe.
Opentuner: An extensible framework for program autotuning.
In International Conference on Parallel Architectures and
Compilation Techniques, Edmonton, Canada, August 2014.
|
[32] | Y. M. Tsai, P. Luszczek, J. Kurzak, and J. J. Dongarra.
Performance-portable autotuning of opencl kernels for convolutional
layers of deep neural networks.
In 2nd Workshop on Machine Learning in HPC Environments,
MLHPC@SC, Salt Lake City, UT, USA, November 14, 2016, pages 9--18, 2016.
|
[33] | A. Abdelfattah, A. Haidar, S. Tomov, and J. J. Dongarra.
Performance, design, and autotuning of batched GEMM for gpus.
In High Performance Computing - 31st International Conference,
ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016,
Proceedings, pages 21--38, 2016.
|
[34] | Collective Knowledge: open-source, customizable and cross-platform workflow
framework and repository for computer systems research.
Link, 2016.
|
[35] | G. Fursin, A. Lokhmotov, and E. Plowman.
Collective Knowledge: towards R&D sustainability.
In Proceedings of the Conference on Design, Automation and Test
in Europe (DATE'16), March 2016.
|
[36] | Introducing JSON.
http://www.json.org.
|
[37] | D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer.
Seti@home: An experiment in public-resource computing.
Commun. ACM, 45(11):56--61, Nov. 2002.
|
[38] | Open collective knowledge repository with shared optimization results from
crowdsourced experiments across diverse platforms and data sets.
Link.
|
[39] | Public optimization results when crowd-tuning gcc 7.1.0 across raspberry pi3
devices.
link.
|
[40] | ReQuEST: open tournaments on collaborative, reproducible and pareto-efficient
software/hardware co-design of emerging workloads such as deep learning using
collective knowledge technology.
Link, 2017.
|
[41] | MILEPOST project archive (MachIne Learning for Embedded PrOgramS
opTimization).
Link.
|
[42] | cTuning.org: public portal for collaborative and reproducible computer
engineering.
Link.
|
[43] | Zenodo - research data repository.
Link, 2013.
|
[44] | Figshare - online digital repository where researchers can preserve and share
their research outputs, including figures, datasets, images, and videos.
Link, 2011.
|
[45] | Digital Object Identifier or DOI - a persistent identifier or handle used to
uniquely identify objects, standardized by the iso.
Link, 2000.
|
[46] | Proceedings of the 1st workshop on reproducible research methodologies and new
publication models in computer engineering (acm sigplan trust'14).
ACM, 2014.
|
[47] | T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver,
K. Glover, M. R. Pocock, A. Wipat, and P. Li.
Taverna: a tool for the composition and enactment of bioinformatics
workflows.
Bioinformatics, 20(17):3045--3054, 2004.
|
[48] | J. E. Smith and R. Nair.
The architecture of virtual machines.
Computer, 38(5):32--38, May 2005.
|
[49] | Docker: open source lightweight container technology that can run processes
in isolation.
Link.
|
[50] | TETRACOM - eu fp7 project to support technology transfer in computing
systems.
Link, 2014.
|
[51] | Blog article: CK concepts by Michel Steuwer.
Link.
|
[52] | Ck repository with multi-platform software and package manager implemented as
ck modules which detect or install various software (compilers, libraries,
tools).
Link, 2015.
|
[53] | T. Gamblin, M. LeGendre, M. R. Collette, G. L. Lee, A. Moody, B. R.
de Supinski, and S. Futral.
The spack package manager: Bringing order to hpc software chaos.
In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, SC '15, pages
40:1--40:12, New York, NY, USA, 2015. ACM.
|
[54] | K. Hoste, J. Timmerman, A. Georges, and S. D. Weirdt.
Easybuild: Building software with ease.
In 2012 SC Companion: High Performance Computing, Networking
Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012, pages
572--582, 2012.
|
[55] | S. Ainsworth and T. M. Jones.
Software prefetching for indirect memory accesses.
In Proceedings of the 2017 International Symposium on Code
Generation and Optimization, CGO '17, pages 305--317, Piscataway, NJ, USA,
2017. IEEE Press.
|
[56] | Artifacts and experimental workflows in the Collective Knowledge Format for
the CGO'17 paper "Software Prefetching for Indirect Memory Accesses".
Link, 2017.
|
[57] | D. Wilkinson, B. Childers, R. Bernard, W. Graves, and J. Davidson.
Acm pilot demo 1 - collective knowledge: Packaging and sharing.
version 3.
2017.
|
[58] | B. Aarts and et.al.
OCEANS: Optimizing compilers for embedded applications.
In Proc. Euro-Par 97, volume 1300 of Lecture Notes in
Computer Science, pages 1351--1356, 1997.
|
[59] | B. Calder, D. Grunwald, M. Jones, D. Lindsay, J. Martin, M. Mozer, and B. Zorn.
Evidence-based static branch prediction using machine learning.
ACM Transactions on Programming Languages and Systems (TOPLAS),
1997.
|
[60] | A. Nisbet.
Iterative feedback directed parallelisation using genetic algorithms.
In Proceedings of the Workshop on Profile and Feedback Directed
Compilation in conjunction with International Conference on Parallel
Architectures and Compilation Technique (PACT), 1998.
|
[61] | T. Kisuki, P. Knijnenburg, and M. O'Boyle.
Combined selection of tile sizes and unroll factors using iterative
compilation.
In Proceedings of the International Conference on Parallel
Architectures and Compilation Techniques (PACT), pages 237--246, 2000.
|
[62] | M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O'Reilly.
Meta optimization: Improving compiler heuristics with machine
learning.
In Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI'03), pages 77--90, June 2003.
|
[63] | B. Franke, M. O'Boyle, J. Thomson, and G. Fursin.
Probabilistic source-level optimisation of embedded programs.
In Proceedings of the Conference on Languages, Compilers, and
Tools for Embedded Systems (LCTES), 2005.
|
[64] | K. Hoste and L. Eeckhout.
Cole: Compiler optimization level exploration.
In Proceedings of the International Symposium on Code Generation
and Optimization (CGO), 2008.
|
[65] | Bailey and et.al.
Peri auto-tuning.
Journal of Physics: Conference Series (SciDAC 2008), 125:1--6,
2008.
|
[66] | V. Jimenez, I. Gelado, L. Vilanova, M. Gil, G. Fursin, and N. Navarro.
Predictive runtime code scheduling for heterogeneous architectures.
In Proceedings of the International Conference on High
Performance Embedded Architectures & Compilers (HiPEAC 2009), January 2009.
|
[67] | J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and
S. Amarasinghe.
Petabricks: a language and compiler for algorithmic choice.
In Proceedings of the 2009 ACM SIGPLAN conference on Programming
language design and implementation, PLDI '09, pages 38--49, New York, NY,
USA, 2009. ACM.
|
[68] | J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa.
Contention aware execution: Online contention detection and response.
In Proceedings of the 8th Annual IEEE/ACM International
Symposium on Code Generation and Optimization, CGO '10, pages 257--265, New
York, NY, USA, 2010. ACM.
|
[69] | R. W. Moore and B. R. Childers.
Automatic generation of program affinity policies using machine
learning.
In CC, pages 184--203, 2013.
|
[70] | J. Shen, A. L. Varbanescu, H. J. Sips, M. Arntzen, and D. G. Simons.
Glinda: a framework for accelerating imbalanced applications on
heterogeneous platforms.
In Conf. Computing Frontiers, page 14, 2013.
|
[71] | R. Miceli et.al.
Autotune: A plugin-driven approach to the automatic tuning of
parallel applications.
In Proceedings of the 11th International Conference on Applied
Parallel and Scientific Computing, PARA'12, pages 328--342, Berlin,
Heidelberg, 2013. Springer-Verlag.
|
[72] | I. Manotas, L. Pollock, and J. Clause.
Seeds: A software engineer's energy-optimization decision support
framework.
In Proceedings of the 36th International Conference on Software
Engineering, ICSE 2014, pages 503--514, New York, NY, USA, 2014. ACM.
|
[73] | A. H. Ashouri, G. Mariani, G. Palermo, E. Park, J. Cavazos, and C. Silvano.
Cobayn: Compiler autotuning framework using bayesian networks.
ACM Transactions on Architecture and Code Optimization (TACO),
13(2):21, 2016.
|
[74] | F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825--2830, 2011.
|
[75] | G. Fursin, M. O'Boyle, O. Temam, and G. Watts.
Fast and accurate method for determining a lower bound on execution
time.
Concurrency: Practice and Experience, 16(2-3):271--292, 2004.
|
[76] | K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer,
D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick.
The landscape of parallel computing research: a view from Berkeley.
Technical Report UCB/EECS-2006-183, Electrical Engineering and
Computer Sciences, University of California at Berkeley, Dec. 2006.
|
[77] | M. Hall, D. Padua, and K. Pingali.
Compiler research: The next 50 years.
Commun. ACM, 52(2):60--67, Feb. 2009.
|
[78] | J. W. Duran and S. Ntafos.
A report on random testing.
In Proceedings of the 5th International Conference on Software
Engineering, ICSE '81, pages 179--183, Piscataway, NJ, USA, 1981. IEEE
Press.
|
[79] | A. Takanen, J. DeMott, and C. Miller.
Fuzzing for Software Security Testing and Quality Assurance.
Artech House, Inc., Norwood, MA, USA, 1 edition, 2008.
|
[80] | X. Yang, Y. Chen, E. Eide, and J. Regehr.
Finding and understanding bugs in c compilers.
In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI '11, pages 283--294, New York, NY,
USA, 2011. ACM.
|
[81] | Microsoft security risk detection.
Link, 2016.
|
[82] | Continuous fuzzing of open source software.
Link, 2016.
|
[83] | C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson.
Many-core compiler fuzzing.
In Proceedings of the 36th ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI '15, pages 65--76, New York, NY,
USA, 2015. ACM.
|
[84] | A. Lascu and A. F. Donaldson.
Integrating a large-scale testing campaign in the CK framework.
CoRR, abs/1511.02725, 2015.
|
[85] | Ck workflow for opencl crowd-fuzzing.
Link, 2015.
|
[86] | C. M. Bishop.
Pattern Recognition and Machine Learning (Information Science
and Statistics).
Springer, 1st ed. 2006. corr. 2nd printing 2011 edition, Oct. 2007.
|
[87] | C. Sammut and G. Webb.
Encyclopedia of Machine Learning and Data Mining.
Springer reference. Springer Science + Business Media.
|
[88] | J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. O'Boyle, and O. Temam.
Rapidly selecting good compiler optimizations using performance
counters.
In Proceedings of the International Symposium on Code Generation
and Optimization (CGO), March 2007.
|
[89] | O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li.
Imagenet large scale visual recognition challenge.
CoRR, abs/1409.0575, 2014.
|
[90] | A. Monsifrot, F. Bodin, and R. Quiniou.
A machine learning approach to automatic production of compiler
heuristics.
In Proceedings of the International Conference on Artificial
Intelligence: Methodology, Systems, Applications, LNCS 2443, pages 41--50,
2002.
|
[91] | G. Marin and J. Mellor-Crummey.
Cross-architecture performance predictions for scientific
applications using parameterized models.
SIGMETRICS Perform. Eval. Rev., 32(1):2--13, June 2004.
|
[92] | M. Stephenson and S. Amarasinghe.
Predicting unroll factors using supervised classification.
In Proceedings of the International Symposium on Code Generation
and Optimization (CGO). IEEE Computer Society, 2005.
|
[93] | M. Zhao, B. R. Childers, and M. L. Soffa.
A model-based framework: an approach for profit-driven optimization.
In Third Annual IEEE/ACM Interational Conference on Code
Generation and Optimization, pages 317--327, 2005.
|
[94] | F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin, M. O'Boyle, J. Thomson,
M. Toussaint, and C. Williams.
Using machine learning to focus iterative optimization.
In Proceedings of the International Symposium on Code Generation
and Optimization (CGO), 2006.
|
[95] | C. Dubach, T. M. Jones, E. V. Bonilla, G. Fursin, and M. F. O'Boyle.
Portable compiler optimization across embedded programs and
microarchitectures using machine learning.
In Proceedings of the IEEE/ACM International Symposium on
Microarchitecture (MICRO), December 2009.
|
[96] | E. Park, J. Cavazos, L.-N. Pouchet, C. Bastoul, A. Cohen, and P. Sadayappan.
Predictive modeling in a polyhedral optimization space.
International Journal of Parallel Programming, 41(5):704--750,
2013.
|
[97] | H. Leather, E. V. Bonilla, and M. F. P. O'Boyle.
Automatic feature generation for machine learning-based optimising
compilation.
TACO, 11(1):14:1--14:32, 2014.
|
[98] | C. Cummins, P. Petoumenos, Z. Wang, and H. Leather.
End-to-end deep learning of optimization heuristics.
In 26th International Conference on Parallel Architectures and
Compilation Techniques, PACT 2017, Portland, OR, USA, September 9-13,
2017, pages 219--232, 2017.
|
[99] | A. H. Ashouri, A. Bignoli, G. Palermo, C. Silvano, S. Kulkarni, and J. Cavazos.
Micomp: Mitigating the compiler phase-ordering problem using
optimization sub-sequences and machine learning.
ACM Trans. Archit. Code Optim., 14(3):29:1--29:28, Sept. 2017.
|
[100] | Artifact evaluation for computer systems research.
Link, 2014-cur.
|
[101] | Static Features available in MILEPOST GCC V2.1.
Link.
|
[102] | M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. J\'ozefowicz, L. Kaiser, M. Kudlur,
J. Levenberg, D. Man\'e, R. Monga, S. Moore, D. G. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker,
V. Vanhoucke, V. Vasudevan, F. B. Vi\'egas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.
Tensorflow: Large-scale machine learning on heterogeneous distributed
systems.
CoRR, abs/1603.04467, 2016.
|
[103] | G. Fursin, J. Cavazos, M. O'Boyle, and O. Temam.
MiDataSets: Creating the conditions for a more realistic evaluation
of iterative optimization.
In Proceedings of the International Conference on High
Performance Embedded Architectures & Compilers (HiPEAC 2007), January 2007.
|
[104] | G. Fursin.
Iterative Compilation and Performance Prediction for Numerical
Applications.
PhD thesis, University of Edinburgh, United Kingdom, 2004.
|
[105] | M. P\"uschel, J. M. F. Moura, B. Singer, J. Xiong, J. R. Johnson, D. A.
Padua, M. M. Veloso, and R. W. Johnson.
Spiral: A generator for platform-adapted libraries of signal
processing alogorithms.
IJHPCA, 18(1):21--45, 2004.
|
[106] | Y. Liu, E. Z. Zhang, and X. Shen.
A cross-input adaptive framework for gpu program optimizations.
In 2009 IEEE International Symposium on Parallel Distributed
Processing, pages 1--10, May 2009.
|
[107] | L. Luo, Y. Chen, C. Wu, S. Long, and G. Fursin.
Finding representative sets of optimizations for adaptive
multiversioning applications.
In 3rd Workshop on Statistical and Machine Learning Approaches
Applied to Architectures and Compilation (SMART'09), colocated with HiPEAC'09
conference, January 2009.
|
[108] | Kaggle: platform for predictive modelling and analytics competitions.
Link, 2010.
|
[109] | Imagenet challenge (ILSVRC): Imagenet large scale visual recognition
challenge where software programs compete to correctly classify and detect
objects and scenes.
Link, 2010.
|
[110] | LPIRC: low-power image recognition challenge.
Link, 2015.
|
[111] | Public repositories with artifact and workflows in the Collective Knowledge
format.
Link.
|
[112] | Association for Computing Machinery (ACM).
Link.
|
[113] | P. Flick, C. Jain, T. Pan, and S. Aluru.
A parallel connectivity algorithm for de bruijn graphs in metagenomic
applications.
In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, SC '15, pages
15:1--15:11, New York, NY, USA, 2015. ACM.
|
[114] | L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J.
Davison, M. Luj\'an, M. F. P. O'Boyle, G. Riley, N. Topham, and S. Furber.
Introducing SLAMBench, a performance and accuracy benchmarking
methodology for SLAM.
In IEEE Intl. Conf. on Robotics and Automation (ICRA), May
2015.
arXiv:1410.2167.
|
[115] | Collective Knowledge workflows for SLAMBench.
Link.
|
[116] | G. Fursin, A. Lokhmotov, and E. Upton.
Collective knowledge repository with reproducible experimental
results from collaborative program autotuning on raspberry pi (program
reactions to most efficient compiler optimizations).
https://doi.org/10.6084/m9.figshare.5789007.v2, Jan 2018.
|