A Scarcity Mentality - Thinking like the Silicon

Low-level programming is tedious. But with a scarcity mentality, you can have your cake and eat (most of) it too.

September 29, 2019

A Scarcity Mentality - Thinking like the Silicon

Low-level programming is tedious. But with a scarcity mentality, you can have your cake and eat (most of) it too.

Published: September 29, 2019.

Low-level programming is tedious. But with a scarcity mentality, you can have your cake and eat (most of) it too.

Intro

Low-level programming is tedious. While getting closer to the silicon can make for more efficient programs, it seldom makes for more efficient programming. Luckily for those who dabble in software development, there's a myriad of higher-level paradigms to choose - functional, imperative, object-oriented, and more - all designed to allow the engineer to think less like the silicon and more like a human. Nowadays, those who tango with the x86 instruction set find themselves in a niche corner of software development. It's fantastic, it lets people specialise and ultimately leads to better software.

For hardware engineers who wrangle FPGAs and ASICs, and for software engineers dabbling with CPUs and GPUs, those who possess a fondness of their silicon can avoid common pitfalls during the development of their systems - pitfalls which can result in headaches if their system begins to cripple their silicon. That's not to say a fondness of the underlying hardware is necessary, but just like a nurturing parent encouraging their children to eat their vegetables, it's for their own good.

While most engineers are aware of this, it's understandable that they neglect to contemplate the development processes associated with low-level design, instead putting their faith in their compiler, interpreter, or synthesiser to elegantly compose their high-level code into a symphony of machine code. For the most part, this is perfectly acceptable - modern compilers, interpreters, and synthesisers do this spectacularly. A perfect example of this is the numpy library for Python, which allows the user to efficiently manipulate matrices of data, without using sluggish Python operations and data-structures. numpy presents an elegant, pythonic, high-level interface to efficient low-level implementations of the abstracted operations - no tedium necessary.

With that in mind, numpy isn't going to be the solution to all the software engineers numerical-manipulation woes if their objective is to get the most out of their silicon, or equivalently, their dollar. High-level compilers, interpreters, and synthesisers work best with fastidious engineers.

What does getting the most out of the silicon look like?

At first glance, the answer to this question seems like it would depend on the configuration of the silicon. And this is true to some extent. While software engineers program processors, hardware engineers manipulate programmable logic. A processor, such as a CPU, is a highly optimised piece of logic, designed to follow a set of specific instructions and to do so very efficiently. However, a Field Programmable Gate Array (FPGA) is a device whose configuration is left entirely in the hands of the programmer. Just as a software engineer's code is compiled into machine code, a hardware engineers code is synthesised into a logic circuit.

While the job of software compilers and hardware synthesisers are to efficiently and effectively translate the engineer's intentions onto the device (and these translators can go as far as to silently optimise an engineer's sloppy work), getting the most out of the silicon begins with the engineer. Just as a writer must write with their audience in mind, an engineer must write with the silicon in mind. After all, the silicon can only do so much - it has a capacity. There are only so many instructions a CPU can execute every second, and there are only so many Look Up Tables (LUTs) on an FPGA.

To a software engineer, getting the most out of the silicon manifests by maximising utilisation of the CPUs resources and implementing efficient, eloquent, and concise algorithms. To hardware engineers, getting the most out of their silicon is no different, but unfortunately the imperative is very different. In the realm of software, if the engineer's code is inefficient, it just demands a prolonged execution. But if a hardware engineer's code synthesises into a design requiring 100,000 LUTs, it simply won't fit on an FPGA with any less - if the target FPGA has only 80,000 LUTs, it's back to the drawing board. Software is thus more forgiving than hardware. It's easy for the inefficiencies of a software algorithm to be hidden behind the guise of runtime ¹.

To get the most out of the FPGA, a hardware engineer must get their hands dirty with the silicon. Hardware engineers tend to develop a fondness for the number 2 and its siblings (its powers) - an FPGA can make easy work of a divisor of 2 or 4, but a divisor of 3 will keep an astute hardware engineer up at night. Likewise, a software engineer would seldom examine the consequences of computing the square root of a 64-bit integer, yet a hardware engineer questions if they really require 64-bits for that integer, and that's before they even think about computing its square root.

Adopting a scarcity mentality

Bob works in a hardware engineering team implementing real-time systems onto FPGAs. The objective of Bob and his team's collective engineering efforts is to obtain the best system performance possible with the finite capacity of the FPGA. Floating-point signals are forbidden, non-linear functions are discretised and stored in memory, and designs are painstakingly examined with a fine-toothed comb to expose inefficiencies and flaws. Bob and his team have a scarcity mentality. Their mantra, "Sweat every LUT", keeps them on their engineering feet.

Alice is a data scientist implementing a machine learning model in Python. Fortunately for her, modern CPUs have the capability to efficiently perform non-linear functions on floating-point data. Furthermore, some high-level languages even hide things like data-type and algorithm implementations from the developer. This was fine while Alice was developing her system, but now she plans to deploy it at scale in the cloud, where her cloud expenses are proportional to runtime. What she really needs now is to get the most out of her silicon - so she reaches for that fine-toothed comb in Bob's top drawer.

She needs to transform a 3D array of logits into probabilities by applying a logistic transform she has defined:

logits.shape
>>> (256, 256, 32)

import numpy as np
def sigmoid(x):
    return 1/(1+np.exp(-x))

%timeit a = sigmoid(logits)

25 ms ± 949 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Alice would normally trust that the np library has efficiently implemented her algorithm, but she was advised by Bob that she should explore the possibility that perhaps she isn't getting the most out of her silicon. What can she accomplish with Bob's fine-toothed hardware engineering comb? A common practice in FPGA design is to reuse code that others have already painstakingly crafted. So, she might choose to import the expit method (the sigmoid function) from scipy.special:

from scipy.special import expit

%timeit b = expit(logits)

17.3 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's a sizeable improvement! So she applies another stroke of the comb.

To store the result in b, Python will take a moment to allocate some memory. But, Alice no longer requires the logits data, so she decides to store the result in the same location. Thankfully, the expit method has an argument to specify the memory location of the result.

%timeit expit(logits, out=logits)

15.6 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What else can she do? np uses the np.float64 type by default. Alice decides that she doesn't require double precision, and so she can save even more time performing the calculation in single precision:

logits = logits.astype(np.float32)

%timeit expit(logits, out=logits)

12.9 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With a little bit of work, Alice just got twice as much out of her silicon! Most importantly, she didn't even have to leave the comforts of her high-level Python environment.

Now, let's say these probabilities are being fed into an image processing pipeline, and that Alice needs to represent this data as 32 256x256 8-bit greyscale images? She could start by multiplying the predictions by 255 and then casting to an integer:

%timeit probs = np.round(expit(logits)*255).astype(np.uint8)

29.4 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Alice recalls that a tactic commonly used by Bob's team is to precompute non-linear functions and store them in memory. She concludes that while this could result in a loss of precision, as long as the error introduced is within an acceptable tolerance that it could be a suitable approach.

from scipy.special import expit, logit

QMAX, QBINS = 127, 512
QSCALE = QMAX/-logit(1/QBINS)
EXPIT_LUT = {
    INT8_LOGIT: np.multiply(
        (QBINS//2-1), expit(INT8_LOGIT/QSCALE)
    ).round().astype(np.uint8) for INT8_LOGIT in range(-QMAX, QMAX+1, 1)
}

EXPIT_LUT = np.asarray(
    [EXPIT_LUT[i] for i in list(range(0, QMAX+1))+list(range(-QMAX, 0))]
)

%timeit lut_probs = EXPIT_LUT[np.round(logits*QSCALE).astype(np.uint8)]

18.8 ms ± 469 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

To have your cake and eat (most of) it too

There are a number of ways in which engineers can pursue performance optimisation during development. Each approach will require varying degrees of effort, and each will yield varying magnitudes of improvement. Humans, by nature, take the path of least resistance. Just as engineers have over time gravitated towards higher-level languages for their ease of use, so too should their pursuit for efficiency commence down the path of least resistance.

As particular avenues are explored, evaluated, and hopefully implemented, there will come a point of diminishing returns - too much effort for not enough gain. In truth, a pursuit for extracting the most out of the silicon is a niche endeavour and is seldom applicable to engineering. However, by adopting a scarcity mentality, Alice resolved substantial performance gains with relatively insignificant effort. She can now advance to engineering all the other parts of her system.

The purpose of a scarcity mentality isn't to banish lazy, high-level developers to a low-level purgatory. Instead, it can arm them with the foresight necessary to steer away from common performance-pitfalls typical of high-level languages. Steering, as opposed to rerouting, as to not get lost down the rabbit hole of performance optimisation - to develop continuously and efficiently.

Getting the most out of the silicon may be the domain of relentless low-level developers, but by developing with the silicon in mind, engineers (even those with the luxury of high-level abstractions) can have their cake and eat (most of) it too.

Unfortunately for those who use embedded systems for real-time processing, both runtime and capacity are of concern!↩︎