Measuring and Visualizing GPU Power Usage in Real Time with asyncio and Matplotlib

In this post we will learn how to periodically measure the power power usage of our GPU and plot it in real time with a single Python program. For this we need concurrency between the measuring and the plotting part of our code. Concurrency means that the measuring process will got to sleep after measuring. While the measuring process is asleep the plotting process can do the plotting and goes to sleep as well. After a defined amount of time the measuring process wakes up and does the measuring if the CPU allows it, then the plotting process starts and so on. We achieve concurrency with asyncio and the plotting is done with Matplotlib. To measure the GPU power we use pynmvl (Python Bindings for the NVIDIA Management Library). Before we get into the code, here is a video showing the interface in action.

This video shows the power in Watt at my GPU over a twenty seconds time window. The measurements are taken every 100 milliseconds and the plotting is done every 200 milliseconds. Everything happens in one Python script. I ran this by simply passing the below script to python in my command line.
import pynvml
import matplotlib.pyplot as plt
import time
import numpy as np
import asyncio

"""Initialize GPU measurement and parameters"""
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
measurement_interval = 0.1  # in seconds
plotting_interval = 0.2  # in seconds
time_span = 20  # time span on the plot x-axis in seconds
m = t = np.array([np.nan]*int(time_span / measurement_interval))
mW_to_W = 1e3

"""Initialize the plot"""
plt.ion()
plt.rcParams.update({'font.size': 18})
figure, ax = plt.subplots(figsize=(8,6))
line1, = ax.plot(t, m, linewidth=3)
ax.set_xlabel("Time (s)")
ax.set_ylabel("GPU Power (W)")

async def measure():
    while True:
        measure = pynvml.nvmlDeviceGetPowerUsage(handle) / mW_to_W
        dt = time.time() - ts
        m[:-1] = m[1:]
        m[-1] = measure
        t[:-1] = t[1:]
        t[-1] = dt
        await asyncio.sleep(measurement_interval)

async def plot():
    while True:
        line1.set_data(t, m)
        tmin, tmax = np.nanmin(t), np.nanmax(t)
        mmin, mmax = np.nanmin(m), np.nanmax(m)
        margin = (np.abs(mmax - mmin) / 10) + 0.1
        ax.set_xlim((tmin, tmax + 1))
        ax.set_ylim((mmin - margin, mmax + margin))
        figure.canvas.flush_events()
        await asyncio.sleep(plotting_interval)

async def main():
    t1 = loop.create_task(measure())
    t2 = loop.create_task(plot())
    await t2, t1
    
if __name__ == "__main__":
    ts = time.time()
    loop = asyncio.new_event_loop()
    loop.run_until_complete(main())

We will start with the functions async def measure() and async def plot() since they are central to the program. First, note that neither of them are ordinary functions because of the async keyword. This keyword has been added in Python 3.5 and in earlier Python versions we could have instead decorated the functions with the @asyncio.coroutine decorator. The async keyword turns our function into a coroutine which allows us to use the await keyword inside. With the await keyword we can put the coroutine to sleep with await asyncio.sleep(measurement_interval). While asleep the asyncio event loop can run other coroutines that are not asleep. More on the asyncio event loop later. Because we want to keep measuring until someone terminates the program we wrap everything in measure into an infinite loop while True:.

So what do we do while measuring? Outside of the coroutine we define two arrays m, t, one to hold the measured power and the other to measure the passed time. Measuring time is important because energy is power during a time period and we generally need to be sure that the coroutine isn’t getting stuck asleep much longer than we want it to. When we measure a value we move the current elements in the measurement array one to the left by assignment with m[:-1] = m[1:]. We then assign the newly measured value to the right of the array with m[-1] = measure. That is all there is to our measurements.

Our plot coroutine works just like the measure coroutine except that it plots whatever is in the time and measurement arrays before it goes to sleep. The plotting itself is basic matplotlib but it is important to note that figure.canvas.flush_events() is critical for updating the plot in real time. Furthermore, when we initialize the plot, plt.ion() is important for the plot to show properly.

Coroutines are not called like normal functions. They do their work as tasks within an asyncio event loop. This event loop knows which coroutines are asleep and decides which coroutine starts working next. This task may seem manageable with two coroutines but with three it becomes tedious already. As a coroutine goes to sleep two may be awake, waiting to get to work. The event loop has to decide which one goes next. Luckily asyncio takes care of the details for us and we can focus on the work we want to get done instead. However, we need to create an event loop with loop = asyncio.new_event_loop() and then we start it with loop.run_until_complete(main()). The coroutines only get to work when the loop starts. Both our coroutines are in main(), thereby both become part of the event loop. Because of the event loop I recommend running the code from the command line. Running it in interactive environments can cause problems because other event loops might already be running there.

With that, we already covered the most important parts of the code. There are several things we could do differently and some of those might make the code better. For one, we could use a technique called blitting (explained here) to improve the performance of the plotting. We could also do the plotting with FuncAnimation (explained here) instead of writing our own coroutine. I tried that for a while but was not able to make the animation and the measurement() coroutine work together in the same event loop. There probably is a way to do it that I did not find. Let me know if you have other points for improvement.

You can find pynvml here. asyncio is part of the Python installation and you can find the docs here. I was inspired to do this project by a package called codecarbon that you can find here. It estimates the carbon footprint of computation and I plan to blog about it soon.

Spiking neural networks for a low-energy future

Spiking neural networks (SNNs) have some disadvantages compared to artificial neural networks (ANNs) but they have the potential to run for a fraction of the energy. Whether SNNs will be able to replace ANNs and how much energy they will be using depends on many engineering and neuroscience advances. Here I will go through some of the technical background of the SNN energy advantages and some of the current numbers.

Energy efficient SNN features

The energy efficiency of SNNs comes primarily from two features. Firstly, the spike is a discrete event and energy is only used when a spike occurs. This is probably the most fundamental feature that distinguishes SNNs from ANNs. This means that the energy efficiency of a SNN depends not only on the number of neurons but also on the number of spikes the model requires to perform. The second feature is local memory. At the heart of all models are parameters. On traditional hardware such as CPUs and GPUs, the part of the chip that performs the calculations is not the same that remembers the parameters. Loading the parameters onto the chip is much more energy intensive than the computation itself. Therefore, when the parameters can be stored locally on the chip that computes, efficiency advantages result. This is not something unique to spiking neural networks. Some tensor processing units (TPUs) also feature local memory and they are specifically designed for ANNs. When most people speak about the energy advantages of SNNs, they assume local memory.

SNNs also require specialized hardware to run efficiently. That hardware is called neuromorphic. It makes efficient use of the binary nature of spikes and local memory. Neuromorphic hardware is so far only available for research purpose and making it more widely available will be one of the challenges to SNN adoption. Next will be some numbers on energy efficiency.

How efficient are we talking?

How much more efficient SNNs are depends on many factors of the comparison. What is the task, what are the model architectures and what is the hardware. Making projections into the future is even harder, since machine learning advances are made quickly on both SNNs and ANNs. Projecting the absolute amount of energy that could be saved is then even harder because it requires AI demand predictions which can change non-linearly with technical advances. I would be interested in finding formal work on some of these uncertainties or work on some myself but for now here are some numbers.

The Loihi processor from Intel Labs is a recent piece of neuromorphic hardware. Depending on the size of their example problem they find that Loihi is 2.58x, 8.08x or 48.74x more energy efficient than a 1.67-GHz Atom CPU (Davies et al. 2018).

Yin et al. (2020) present a method to train SNNs (backpropagation of surrogate gradients). They calculate the theoretical energy consumption for a spiking recurrent network they train with the method and some ANN architectures. Depending on the task, their SNN was 126.2x, 935x, 1602x, 1776x or 3353.3x more efficient than a Long Short-Term Memory network (LSTM; also depends on some details of the LSTM implementation). Their network was 41.3x more efficient compared to a recurrent ANN. Here is a talk from the last author Sander Bohte where he summarizes the findings as >100x more efficient than best recurrent ANN and 1000x more efficient than LSTM. All their calculations assume local memory.

Panda et al. (2012) tried several methods to generate SNNs for image classification and calculated theoretical energy consumption. They estimate better efficiencies of SNNs of 6.52x, 7.7x, 10.6x, 74.9x, 81.3x, 104.8x depending on model architecture and parameter space.

Merolla et al. (2014) present the TrueNorth neuromorphic architecture. They compare synaptic operations per second (SOPS) of their architecture to floating-point operations per second (FLOPS) of traditional chips. They say that TrueNorth can deliver 46 billion SOPS per watt. The most energy-efficient supercomputer they say (at time of their writing) generates 4.5 billion FLOPS per watt.

These numbers highlight the potential for some massive energy savings but benchmarks are always complicated. Making good comparisons can be hard, especially since the unit of computational efficiency is fundamentally different. Either way, SNNs on neuromorphic hardware are extremely energy efficient but to truly save energy they must become better at the tasks ANNs already solve.

References

Davies et al. 2018. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro. 10.1109/MM.2018.112130359

Yin, Corradi & Bohte 2020. Effective and Efficient Computation with Multiple-timescale Spiking Recurrent Neural Networks. https://arxiv.org/abs/2005.11633

Panda, Aketi & Roy, 2012. Towards Scalable, Efficient and Accurate Deep Spiking Neural Networks with Backward Residual Connections, Stochastic Softmax and Hybridization. https://arxiv.org/abs/1910.13931.

Merolla et al. (2014). A million spiking-neuron integrated circuit with a scalable communication network and interface. https://science.sciencemag.org/content/345/6197/668