I was reading about a new benchmarking framework for quantum control software called Benchpress. Looking through the code, it seems to be built upon the benchmark plugin for pytest. If you want to quickly benchmark your code, it’s quite easy to do with pytest-benchmark.

Running benchmark

If you know a bit about pytest, it’s actually very easy to add some benchmarking to your tests by just using the benchmark fixture. Suppose we have these two functions:

def fibonacci_recursive(n):
    if n <= 1:
        return n
    return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)


def fibonacci_iterative(n):
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

that we would like to test and benchmark. We can do this by just adding a benchmark fixture to our test function:

import pytest

def test_fibonacci_recursive(benchmark):
    result = benchmark(fibonacci_recursive, 30)
    assert result == 832040

def test_fibonacci_iterative(benchmark):
    result = benchmark(fibonacci_iterative, 30)
    assert result == 832040

Then after installing the pytest-benchmark package, we can run our tests with the pytest --benchmark-enable flag.

pytest --benchmark-enable tests/

This will run the tests and print out the results in a nice table.


---------------------------------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------------------------------
Name (time in ns)                        Min                        Max                       Mean                    StdDev                     Median                       IQR            Outliers             OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fibonacci_iterative            387.8466 (1.0)           7,455.1521 (1.0)             435.0220 (1.0)             33.3827 (1.0)             435.9217 (1.0)              6.4655 (1.0)    4121;31660  2,298,734.2926 (1.0)      181819          13
test_fibonacci_recursive     60,989,416.0298 (>1000.0)  65,767,124.9937 (>1000.0)  63,742,859.3145 (>1000.0)  1,336,402.2177 (>1000.0)  63,621,042.0053 (>1000.0)  1,510,000.5257 (>1000.0)       5;0         15.6880 (0.00)         16           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================================= 2 passed in 3.33s =============================================================================================

(the enable flag will keep it enabled forever, until you disable it with --benchmark-disable)

Surprisingly, the recursive version is about 15,000 times slower than the iterative version. This is because the recursive version has to call the function itself, which in python is more expensive than looping and adding two numbers (today I learned… lol).

Comparing results

With pytest-benchmark, you can compare results to previous runs by using the --benchmark-save and --benchmark-compare flags.

# run and save results
pytest --benchmark-save=baseline tests/

# make changes, run and compare
pytest --benchmark-compare=baseline tests/

which is convenient if you want to compare the effects of some refactor work:

------------------------------------------------------------------------------------------------------------------------ benchmark: 4 tests ------------------------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                        Max                       Mean                    StdDev                     Median                       IQR              Outliers             OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_fibonacci_iterative (NOW)                     392.3293 (1.0)           6,972.2494 (1.91)            435.3606 (1.0)             31.0298 (1.03)            437.4985 (1.0)              6.9995 (1.0)      8191;37003  2,296,946.3097 (1.0)      198334          12
test_fibonacci_iterative (0002_baselin)            395.8315 (1.01)          3,645.8320 (1.0)             440.7425 (1.01)            30.2541 (1.0)             441.0006 (1.01)             6.9995 (1.0)     11175;30137  2,268,898.7304 (0.99)     196697          12
test_fibonacci_recursive (NOW)              62,625,999.9699 (>1000.0)  67,687,292.0012 (>1000.0)  65,359,044.3762 (>1000.0)  1,253,637.0277 (>1000.0)  65,587,125.0201 (>1000.0)  1,420,166.4890 (>1000.0)         4;0         15.3001 (0.00)         16           1
test_fibonacci_recursive (0002_baselin)     63,488,584.0118 (>1000.0)  66,528,582.9455 (>1000.0)  65,248,984.4360 (>1000.0)    910,938.2318 (>1000.0)  65,386,229.0201 (>1000.0)  1,530,999.4924 (>1000.0)         5;0         15.3259 (0.00)         16           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================================= 2 passed in 3.38s =============================================================================================

So seemingly, I made the iterative version slower. oops.

Visualizing results

I don’t recommend this, but there’s a --benchmark-histogram flag to create a very simple histogram of the results and save that as an svg:

which is nice for a quick look at the results, but frankly: it’s ugly :D

Alternatively, you can use matplotlib to make a much nicer plot:

import json
import matplotlib.pyplot as plt


def compare_benchmarks(
    baseline_file: str, refactor_file: str
) -> tuple[plt.Figure, plt.Axes]:
    with open(baseline_file, "r") as f:
        baseline_data = json.load(f)

    with open(refactor_file, "r") as f:
        refactor_data = json.load(f)

    baseline_dict = {b["name"]: b["stats"]["mean"] for b in baseline_data["benchmarks"]}
    refactor_dict = {b["name"]: b["stats"]["mean"] for b in refactor_data["benchmarks"]}

    all_names = sorted(set(baseline_dict.keys()) | set(refactor_dict.keys()))

    means_baseline = [baseline_dict.get(name, 0) for name in all_names]
    means_refactor = [0.2 * refactor_dict.get(name, 0) for name in all_names]

    y = range(len(all_names))
    width = 0.35

    fig, ax = plt.subplots(figsize=(10, max(8, len(all_names) * 0.3)))
    ax.barh([i - width / 2 for i in y], means_baseline, width, label="Baseline")
    ax.barh([i + width / 2 for i in y], means_refactor, width, label="Refactor")

    ax.set_xscale("log")
    ax.set_xlabel("Mean Time (seconds) - Log Scale")
    ax.set_yticks(y)
    ax.set_yticklabels(all_names)
    ax.legend()

    ax.grid(True, which="both", ls="-", alpha=0.2)

    plt.tight_layout()
    return fig, ax


plt.style.use("dark_background")

fig, ax = compare_benchmarks(
    ".benchmarks/Darwin-CPython-3.11-64bit/0001_baseline.json",
    ".benchmarks/Darwin-CPython-3.11-64bit/0003_refactor.json",
)

plt.show()

because the results are saved in a json file inside the .benchmarks folder.

memory profiling

One last thing I want to mention is that there is a similar plugin called pytest-memray that can be used to do memory profiling.

pytest --memray tests/
=============================================================================================== MEMRAY REPORT ===============================================================================================
Allocation results for rsokolewicz.github.io/content/posts/24_pytest_benchmark/benchpress.py::test_fibonacci_iterative at the high watermark

         📦 Total memory allocated: 6.4MiB
         📏 Total allocations: 2
         📊 Histogram of allocation sizes: |█ |
         🥇 Biggest allocating functions:
                - fibonacci_iterative:/Users/rsoko/Documents/private/rsokolewicz.github.io/content/posts/24_pytest_benchmark/benchpress.py:15 -> 5.0MiB
                - update:/Users/rsoko/miniforge3/envs/dev/lib/python3.11/site-packages/pytest_benchmark/stats.py:45 -> 1.4MiB


Allocation results for rsokolewicz.github.io/content/posts/24_pytest_benchmark/benchpress.py::test_fibonacci_recursive at the high watermark

         📦 Total memory allocated: 3.5KiB
         📏 Total allocations: 4
         📊 Histogram of allocation sizes: |█  ▄|
         🥇 Biggest allocating functions:
                - _raw:/Users/rsoko/miniforge3/envs/dev/lib/python3.11/site-packages/pytest_benchmark/fixture.py:180 -> 1.2KiB
                - __call__:/Users/rsoko/miniforge3/envs/dev/lib/python3.11/site-packages/pytest_benchmark/fixture.py:156 -> 1.1KiB
                - _calibrate_timer:/Users/rsoko/miniforge3/envs/dev/lib/python3.11/site-packages/pytest_benchmark/fixture.py:318 -> 586.0B

This post was co-written with Lan Chu.