Software Performance Benchmarking with JMH

measuring, predicting, and optimizing software performance in Java using JMH

Performance is one of the most central non-functional properties of modern software. And yet we all experience the applications we use on a daily basis to continuously become slower, less reliable, and more bloated.

One of the reasons for this is that actually testing performance is much harder than testing functional correctness, and hence much more rarely done.

For the last 10 years, ICET-lab has studied how Java developers can use the Java Microbenchmark Harness (JMH) to continuously benchmark their system, for example as part of their CI pipeline.

A JMH benchmark example from the Protobuf project

Concrete research results include detecting anti-patterns in JMH benchmarks which can lead to misleading measurement results (Costa et al., 2021), demonstrating that statistical methods can be used to significantly reduce required benchmark repetitions (Laaber et al., 2020), or experiments with coverage-based benchmark selection (Laaber et al., 2021).

In this line of research, we have also developed multiple open source tools that can support benchmarking research and practice, including Junit-to-JMH, a tool to generate performance benchmark suites from unit tests (Jangali et al., 2023), and Bencher, a tool to analyse static and dynamic coverage of JMH benchmarks.

The impact of bad JMH practices on benchmark results

dynamically reconfiguring JMH benchmarks

Dynamically reconfiguring JMH to reduce benchmark execution time

In our ongoing work in this research theme, we are particularly interested in:

How to bootstrap performance testing in a project by generating (initial) performance test suites. Junit-to-JMH (Jangali et al., 2023) is a first stab into this direction.
How to predict the execution time of benchmarks (and, hence, performance) prior to execution. We have already achieved initial success predicting the execution time of small pieces of code using graph-based neural networks (Samoaa et al., 2022). The ultimate vision, of course, is to be able to warn developers before committing slow code, without the need for expensive performance testing.
How to make performance testing easier, through performance assessment bots (Markusse et al., 2022) or good visualizations (Cito et al., 2019).

Contacts:

Dr. Christoph Laaber (probably the world’s foremost expert on academic research about JMH benchmarking)

Dr. Philipp Leitner

References

2023

TSE

Automated Generation and Evaluation of JMH Microbenchmark Suites From Unit Tests

Mostafa Jangali, Yiming Tang, Niclas Alexandersson, Philipp Leitner, Jinqiu Yang, and Weiyi Shang

IEEE Transactions on Software Engineering, Apr 2023

Abs DOI PDF

Performance is a crucial non-functional requirement of many software systems. Despite the widespread use of performance testing, developers still struggle to construct and evaluate the quality of performance tests. To address these two major challenges, we implement a framework, dubbed ju2jmh, to automatically generate performance microbenchmarks from JUnit tests and use mutation testing to study the quality of generated microbenchmarks. Specifically, we compare our ju2jmh generated benchmarks to manually written JMH benchmarks and to automatically generated JMH benchmarks using the AutoJMH framework, as well as directly measuring system performance with JUnit tests. For this purpose, we have conducted a study on three subjects (Rxjava, Eclipse-collections, and Zipkin) with sim∼454K source lines of code (SLOC), 2,417 JMH benchmarks (including manually written and generated AutoJMH benchmarks) and 35,084 JUnit tests. Our results show that the ju2jmh generated JMH benchmarks consistently outperform using the execution time and throughput of JUnit tests as a proxy of performance and JMH benchmarks automatically generated using the AutoJMH framework while being comparable to JMH benchmarks manually written by developers in terms of tests’ stability and ability to detect performance bugs. Nevertheless, ju2jmh benchmarks are able to cover more of the software applications than manually written JMH benchmarks during the microbenchmark execution. Furthermore, ju2jmh benchmarks are generated automatically, while manually written JMH benchmarks require many hours of hard work and attention; therefore our study can reduce developers’ effort to construct microbenchmarks. In addition, we identify three factors (too low test workload, unstable tests and limited mutant coverage) that affect a benchmark’s ability to detect performance bugs. To the best of our knowledge, this is the first study aimed at assisting developers in fully automated microbenchmark creation and assessing microbenchmark quality for performance testing.

2022

PROFES 2022

TEP-GNN: Accurate Execution Time Prediction of Functional Tests Using Graph Neural Networks

Hazem Peter Samoaa, Antonio Longa, Mazen Mohamad, Morteza Haghir Chehreghani, and Philipp Leitner

In Product-Focused Software Process Improvement, Apr 2022

Abs PDF

Predicting the performance of production code prior to actual execution is known to be highly challenging. In this paper, we propose a predictive model, dubbed TEP-GNN, which demonstrates that high-accuracy performance prediction is possible for the special case of predicting unit test execution times. TEP-GNN uses FA-ASTs, or flow-augmented ASTs, as a graph-based code representation approach, and predicts test execution times using a powerful graph neural network (GNN) deep learning model. We evaluate TEP-GNN using four real-life Java open source programs, based on 922 test files mined from the projects’ public repositories. We find that our approach achieves a high Pearson correlation of 0.789, considerable outperforming a baseline deep learning model. Our work demonstrates that FA-ASTs and GNNs are a feasible approach for predicting absolute performance values, and serves as an important intermediary step towards being able to predict the performance of arbitrary code prior to execution.
IEEE Software

Using Benchmarking Bots for Continuous Performance Assessment

Florian Markusse, Philipp Leitner, and Alexander Serebrenik

IEEE Software, Apr 2022

DOI PDF

2021

TSE

What’s Wrong with My Benchmark Results? Studying Bad Practices in JMH Benchmarks

Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak

IEEE Transactions on Software Engineering, Apr 2021

DOI
EMSE

Applying test case prioritization to software microbenchmarks

Christoph Laaber, Harald C. Gall, and Philipp Leitner

Empirical Software Engineering, Apr 2021

DOI Supp

2020

FSE 2020

Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner

In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, Apr 2020

DOI

2019

ICSE 2019

Interactive Production Performance Feedback in the IDE

Jürgen Cito, Philipp Leitner, Martin Rinard, and Harald Gall

In Proceedings of the 41st International Conference on Software Engineering (ICSE), Apr 2019

DOI PDF