Floating-point face-off

What makes a great runtime library different from a run-of-the-mill runtime library?
This article will answer some of those questions with hard data and technical insights.

What is the SEGGER Runtime library?

Recently, SEGGER introduced the SEGGER Runtime Library. It can entirely replace open-source libraries used with GCC or LLVM/Clang (typically newlib or newlib-nano) and delivers much better performance and smaller code. It is based on the highly optimized library we have been using for years in Embedded Studio, with almost two decades of continuous advancement. Seeing the difference in performance and size compared to open source as well as commercial counterparts, we decided to make it available for other tool-chains. The library is completely portable and can be used on any CPU.
In addition to that, an assembly optimized version is available for ARM processors. More information on our Wiki.

Overall program size and performance of a program depend on the compiler and to almost the same extent on the code in the underlying runtime library.

Let’s get right to it!

I compare two functions that show the difference between the SEGGER Runtime Library and the libraries used in two well-respected IDEs from Vendors A and B. If you are interested in how we manage to so comprehensively outperform competitor products, then read past the results for the technical information.

Function expf(1.0) sinhf(1.0)
Vendor SEGGER A B SEGGER A B
Function code size 324 628 416 436 688 288
+ Supporting code size (linked in) 1074 1308 1662 2064
= Total code size 324 1702 1724 436 2350 2352
Execution cycles 111 954 792 159 1823 968
Code size relative to SEGGER 1.00 ×5.25 ×5.32 1.00 ×5.39 ×5.39
Performance relative to SEGGER 1.00 ÷8.60 ÷7.14 1.00 ÷11.50 ÷6.09

SEGGER’s code is more than 6 times as fast as the best competitors code!

This is for a Cortex-M3 running from flash with no wait states; all benchmarks were conducted on the same evaluation board using identical clocking. In the spirit of fair benchmarking, SEGGER’s implementation of both expf() and sinhf() are fully conformant to IEEE-754 in round-to-nearest mode and produce results correct to 0.5 units in the last place (ulps, and better than 0.5 internally) over the entire floating domain, i.e. these are not synthetic, cut-down, reduced-range “benchmark specials”.

A much harder contest…

So, how about comparing to Vendor A and B’s hardware floating-point performance for these functions? That is, we are benchmarking the SEGGER fixed-point implementation against commercial-grade libraries with the added advantage of hardware floating-point assistance. No contest, right?

Function expf(1.0) sinhf(1.0)
Vendor SEGGER A B SEGGER A B
Execution cycles 111 137 141 159 242 218

A piece of SEGGER code written entirely in integer-only C outperforms code from other vendors that use the floating point unit.
Simply incredible!

Some background

Basic arithmetic implementation in assembly language lacks any glamor: it’s a sudoku-like exercise in register management and instruction selection, rewarding when you come up with a great solution, but nobody really appreciates the effort that went into it. SEGGER’s floating-point emulation for the basics is written in assembly language, is tight, small, and fast on all ARM architectures. Hard, probably impossible to beat.

Approximating functions such as expf() and sinhf() is a rather different matter. Here you are approximating a function, typically with a polynomial, such that it is accurate and fast.

There are many ways to implement an approximation and the C standard allows you to choose your quality of implementation.

For these functions, SEGGER chooses to be compliant to the IEEE-754 standard and to deliver results correct to 0.5ulps. This means that we have set the bar to the highest possible level, there is no “better quality” to be had.

An implementation can:

  • choose to be accurate and sacrifice outright performance to achieve that;
  • choose to be fast and, in the process, sacrifice some accuracy;
  • choose a restricted domain for typical inputs and be wildly inaccurate outside that domain.

SEGGER does not sacrifice speed, does not restrict domains, and does not sacrifice accuracy: We deliver all three without compromise.

Where does inefficiency creep in?

I will skip over the way that you can approximate elementary functions using floating-point arithmetic, this is well covered in textbooks: HP’s book describing the implementation for the Itanium is particularly interesting.

SEGGER’s implementation of elementary functions uses fixed-point arithmetic, is self-contained, is fast, small, and entirely portable C code.

Vendor A and B both use approximations that run quickly on processors with floating-point units. But both vendors take those algorithms and recompile them unchanged for processors without a floating-point unit, incurring software emulation for the missing instructions. The result is slow code that drags in many support routines, self-evident from the table above.

The slow code of Vendor A and B is a consequence of the repeated packing and unpacking of floating-point values.

Each software floating-point operation unpacks the value to constituent parts, identifies if special handling is required (zero, infinity, and not-a-number are special), operates on the normal values, and then repacks—only for the next operation to undo all that good work and immediately proceed to unpack all over again…

The SEGGER library has special code for processors without floating point units that still need small, fast elementary functions. And this is where we deliver our significant advantage: we know how to do this and we don’t take an off-the-shelf solution.

SEGGER avoids repeated packing and unpacking of data between operations: data are unpacked and packed exactly once. It’s impossible for the approximation to generate infinities or NaNs during computation so special-case tests are never required.

So, the approximation “Simply works!” and life continues.

The nuts and bolts

I’ll skip the details of how I (and others) break down choosing the range to fit a polynomial approximation over; this may involve multiple polynomials over different ranges (which is the case for SEGGER’s implementation of logf(), for instance). I will also skip over how these ranges are stitched together to ensure that the overall approximation is continuous over the discrete ranges and how to ensure correct rounding and monotonicity: you can find details in many books on the subject.

Instead, let’s focus on the integer capabilities of today’s 32-bit microcontrollers. Both Arm and RISC-V processors have architectures that can multiply two 32-bit values producing a 64-bit product, and do so quickly. We can turn this capability into a way to multiply fixed-point values quickly, and with an arbitrary position for the radix point (requiring an optional shift after multiply to realign the product).

We break a floating point number, which has a combined sign, exponent, and significand, into constituent pieces and operate on each separately. A multiply, for instance, would simply add the separate exponents, exclusive-or the signs, and use the fast integer multiply instruction to multiply the significands and realign the product.

We can realize the multiplication of significands by creating a macro that wraps a 32×32 signed multiply and delivers only the high 32 bits of the product using pure C:

#define LIBC_SMULL_HI(x0, x1) \
  (long)(((long long)(long)(x0) * (long long)(long)(x1)) >> 32)

This provides some syntactic sugar when reading the code.  Let’s compile an application of this macro:

volatile long x, y, z;

void mult(void) {
  z = LIBC_SMULL_HI(x, y);
}

Here’s the code for Cortex-M:

    4B04        ldr r3, [pc, #x]
    681A        ldr r2, [r3]
    4B04        ldr r3, [pc, #y]
    681B        ldr r3, [r3]
    FB822303    smull r2, r3, r2, r3
    4A03        ldr r2, [pc, #z]
    6013        str r3, [r2]
    4770        bx lr

This produces just what we intended: a signed multiply of r2 by r3, placing the 64-bit result in the register pair r3:r2 with r3 containing the high 32 bits, and the low 32 bits in r2 never used.

On RISC-V, the story is the same…

    000007B7    lui a5, x
    0007A783    lw a5, 0(a5)
    00000737    lui a4, y
    00072703    lw a4, 0(a4)
    02E797B3    mulh a5, a5, a4
    00000737    lui a4, z
    00F72023    sw a5, 0(a4)
    8082        ret

…except there is a dedicated instruction, MULH, to deliver only the high 32 bits of the product so we don’t need to burn a register for the low 32 bits. Nice!

This is the core of the fixed-point approximation that the SEGGER Runtime library deploys. And it has a great advantage for us because the fractional part, the significand, is represented in Q1.31 or Q2.30 format (depending upon approximation), with additional guard bits which the purely-float approximation does not have: the purely-float approximation is constrained to use only the precision of the underlying floating point unit or emulator, i.e. it is limited to 24 bits.

So, how we find an appropriate approximation?

The Remez algorithm

If you’ve ever taken a course on numerical methods, you will have probably bumped into Taylor and Chebyshev, but possibly not Remez. The Remez algorithm enables us to minimize the maximum error between our polynomial approximation and the true value of the function. Implementing the Remez algorithm is somewhat tricky: it is iterative and doubles the approximation accuracy each step, but requires high-precision arithmetic to perform correctly.

Fortunately, both Sollya and Mathematica have implementations of the Remez algorithm that make it easy to play around and construct approximations, with determined error bounds, ensuring you can achieve 0.5ulp (or better) accuracy.

Once you have chosen your approximation, implementation is really no more than converting to Horner form, multiplying the significands taking care to maintain the radix point (to retain precision and also to prevent fixed-point overflow).

For instance, here is the core of the SEGGER approximation used in expf():

i0 = LIBC_SMULL_HI(i1, CONST_EXPF_0);
i0 = LIBC_SMULL_HI(i1, CONST_EXPF_1 + ((long)i0 >> 2));
i0 = LIBC_SMULL_HI(i1, CONST_EXPF_2 + ((long)i0 >> 2));
i0 = LIBC_SMULL_HI(i1, CONST_EXPF_3 + ((long)i0 >> 2));
i0 = LIBC_SMULL_HI(i1, CONST_EXPF_4 + ((long)i0 >> 2));

It produces very compact code. Below the resulting code on a Cortex-M3 (produced by GCC or LLVM/Clang):

    4A1F        ldr r2, [pc, #0x7C]
    FB832102    smull r2, r1, r3, r2
    4A1F        ldr r2, [pc, #0x7C]
    EB0202A1    add.w r2, r2, r1, asr #2
    FB832102    smull r2, r1, r3, r2
    4A1D        ldr r2, [pc, #0x74]
    EB0202A1    add.w r2, r2, r1, asr #2
    FB832102    smull r2, r1, r3, r2
    4A1C        ldr r2, [pc, #0x70]
    EB0202A1    add.w r2, r2, r1, asr #2
    FB832102    smull r2, r1, r3, r2
    4A1A        ldr r2, [pc, #0x68]
    EB0202A1    add.w r2, r2, r1, asr #2
    FB833202    smull r3, r2, r3, r2

In fact, the 32×32 to 64-bit multiply is used in many other places in the SEGGER Runtime Library, for instance to accurately range-reduce an input modulo 2π in extended precision for sinf(), or to divide by a constant such as ln(2) by multiplying by its reciprocal (again in extended precision in order to preserve accuracy).

Verifying our implementation

It never hurts to have additional confidence in your implementation. Although Sollya produces an approximation to our requirements with stated error bounds, it could be that we introduce an unintended defect in the implementation of the approximation when writing the fixed-point code. With today’s processors, exhaustive testing over all 2^32 possible inputs is within the reach of even a low-end laptop. It takes so little time to run the fixed-point implementation against another reference implementation, such as the Intel floating point unit in double precision, that this can be run after any change to retain confidence in the code.

A simplified version of exhaustive testing, put together as an example, is something like this:

int main() {
  unsigned u;
  time_t   T0;
  time_t   T1;
  char     acBuf[64];

  time(&T0);
  u = 0x00000000;
  do {
    __float32_t xx;
    __float32_t r1;
    __float32_t r2;
    int cx;
    int cy;
    xx.l = u;
    r1.f = SEGGER_expf(xx.f);
    r2.f = (float)exp(xx.f);
    cx = fpclassify(r1.f);
    cy = fpclassify(r2.f);
    if (cy == FP_SUBNORMAL) {
      cy = FP_ZERO;
      r2.f = 0;
    }
    if (cx != cy) {
      printf("Divergence @ %08x %f :: classification conflict\n", xx.l, xx.f);
    } else if (cx == FP_NORMAL && r1.l-r2.l+1 > 1+1) {
      printf("Divergence @ %08x %f :: ", xx.l, xx.f);
      xx.f = SEGGER_expf(xx.f);
      printf("%08x %f ", xx.l, xx.f);
      xx.l = u;
      xx.f = (float)expf(xx.f);
      printf(" : %08x %f \n", xx.l, xx.f);
    }
    if ((u & 0xFFFFFFF) == 0) {
      time(&T1);
      T1 = difftime(T1, T0);
      strftime(acBuf, sizeof(acBuf), "%X", gmtime(&T1));
      printf("%08x   %s\n", u, acBuf);
    }
    ++u;
  } while (u != 0);
  time(&T1);
  T1 = difftime(T1, T0);
  strftime(acBuf, sizeof(acBuf), "%X", gmtime(&T1));
  printf("Completed in %s\n", acBuf);
  getchar();
  return 0;
}

Running this against the SEGGER implementation of expf() on a MacBook Pro reveals no errors:

00000000   00:00:00
10000000   00:00:03
20000000   00:00:06
30000000   00:00:09
40000000   00:00:15
50000000   00:00:26
60000000   00:00:38
70000000   00:00:51
80000000   00:01:03
90000000   00:01:06
a0000000   00:01:09
b0000000   00:01:13
c0000000   00:01:18
d0000000   00:01:26
e0000000   00:01:33
f0000000   00:01:40
Completed: 00:01:48

Conclusion

Writing a fixed-point approximation that is general, small, and fast is no easy task, but the results show that it is worth it. Machines and software certainly assist in exploration and refining an approximation, but if this were an easy task overall, you would find that every C library using this implementation technique.

I think the numbers speak for themselves. SEGGER makes smart choices and delivers outstanding results, making each code byte and instruction cycle count.

I have some more to write regarding mathematics on microprocessors, but that is the subject of a future article.