Floating-point face-off, part 2: Comparing performance

I used to think that floating-point was not for Embedded Systems. Too slow, too much code overhead and rounding is always a problem.
It turns out that while scaled integers still have a performance benefit, floating-point computations can be done with a surprisingly high performance these days on modern Embedded CPUs. This is true not only for CPUs with floating-point unit (FPU), such as the Cortex-M4F, but also for CPUs which have to do this in software, such as a regular Cortex-M3 or M4 without FPU.

But first things first, let’s look at how things work:

Basic floating-point operations

Let’s look at a function multiplying 2 integers:

int Mul(int a, int b) {
  return a * b;
}

The generated code (in Thumb-2 instructions) can be as simple as this:

  FB01F000    mul r0, r1, r0
  4770        bx lr

For the multiplication of two floats the code is quite similar when the FPU is used:

float FMul(float a, float b) {
  return a * b;
}
  EE200A20    vmul.f32 s0, s0, s1
  4770        bx lr

Also the speed is similar. mul and vmul.f32 take just a few cycles.

What happens without FPU?

The good new is: Nothing changes from an application programmers perspective. The C-code does not need to be modified at all. We only tell the compiler that no FPU is available.

Let’s look at the output of FMul() again:

  B508        push   {r3, lr}
  F000FB20    bl     __aeabi_fmul
  BD08        pop    {r3, pc}

Without an FPU, the compiler cannot use the floating-point multiplication instruction. Instead we can see that it now adds a call to a function __aeabi_fmul to do the work.

Implicit floating-point functions

The ARM Run-time ABI defines implicit floating-point functions, 38 of which are in active use. The compiler can and typically will add calls to these functions whenever it is told to not use an FPU, i.e. the floating-point ABI type is set to “soft”, and the application program needs to perform floating-point operations, such as add, subtract, multiply, divide, or compare.

These functions perform the floating-point operation in software, using the available integer-only instructions. An implementation needs to be provided to the tool-chain, usually as part of the runtime library.

Floating-point benchmark

We were curious to find out more about the performance of such software floating-point operations and compared some implementations:

  • SEGGER RunTime Library, ASM-implementation
  • Competitor A
  • Competitor B, fast library
  • Competitor B, small library (default)
  • SEGGER RunTime Library, C-implementation
  • GNU Arm Embedded, Libgcc

How we test

We called all 38 implicit functions multiple times with different parameters, to execute most, if not all, paths of the implementations. We built an average over all calls of a function, and added the average execution time of all functions to summarize the performance with a single value. For the comparison, we have used the SEGGER RunTime Library ASM implementation as the 100% mark reference.

The code has been generated for Cortex-M4 (Arm architecture v7EM) and runs on an NXP Kinetis K66. All code is executed from RAM to eliminate degradation and variations in execution speed due to caching.

To accurately measure the execution time, the Cortex-M cycle counter, built into the CPU, has been used.

Performance of the SEGGER ASM implementation

Function Average Cycles
Float, Math __aeabi_fadd 31.0
__aeabi_fsub 39.9
__aeabi_frsub 39.9
__aeabi_fmul 26.0
__aeabi_fdiv 53.0
Float, Compare __aeabi_fcmplt 13.0
__aeabi_fcmple 13.0
__aeabi_fcmpgt 13.0
__aeabi_fcmpge 13.0
__aeabi_fcmpeq 7.0
Double, Math __aeabi_dadd 54.5
__aeabi_dsub 71.2
__aeabi_drsub 71.2
__aeabi_dmul 56.4
__aeabi_ddiv 134.0
Double, Compare __aeabi_dcmplt 14.0
__aeabi_dcmple 14.0
__aeabi_dcmpgt 14.0
__aeabi_dcmpge 14.0
__aeabi_dcmpeq 14.0
Float, Conversion __aeabi_f2iz 9.0
__aeabi_f2uiz 6.0
__aeabi_f2lz 13.5
__aeabi_f2ulz 12.0
__aeabi_i2f 10.5
__aeabi_ui2f 7.5
__aeabi_l2f 19.0
__aeabi_ul2f 13.8
__aeabi_f2d 9.0
Double, Conversion __aeabi_d2iz 10.0
__aeabi_d2uiz 8.0
__aeabi_d2lz 16.5
__aeabi_d2ulz 13.5
__aeabi_i2d 12.0
__aeabi_ui2d 8.0
__aeabi_l2d 17.9
__aeabi_ul2d 12.9
__aeabi_d2f 11.0

Comparison with other implementations

SEGGER RT Lib (ASM) Competitor A Competitor B (fast) Competitor B (small) GNU Arm libgcc SEGGER RT Lib (C)
Float, Math 100.0% 95.2% 95.2% 234.0% 179.9% 334.6%
Float, Compare 100.0% 144.1% 186.4% 81.4% 449.2% 142.4%
Double, Math 100.0% 95.6% 86.6% 375.5% 240.3% 410.7%
Double, Compare 100.0% 150.0% 157.1% 190.0% 394.3% 340.0%
Float, Conversion 100.0% 110.6% 121.1% 249.9% 688.2% 638.3%
Double, Conversion 100.0% 111.3% 137.0% 531.1% 714.3% 903.7%
Total 100.0% 106.3% 110.0% 318.0% 358.9% 456.2%

The SEGGER ASM library has been used as reference, making the values in the first column 100%.
Smaller values mean higher performance.

Detailed information on the results is available here: PDF.

Conclusions

It is surprising to see that it is possible to perform IEEE 795 compliant floating point operations so efficiently in software. Only 26 cycles on average for a multiplication, 31 cycles for addition mean that a Cortex-M4 can execute a floating point computation in fractions of a μs (such as 0.13μs at 200MHz). Modern embedded CPUs are capable of performing millions of floating point operations per second, making floating point “affordable” even on hardware without FPU. This is true for pretty much any system except maybe those whose primary purpose is number crunching, such as digital filter applications.
However, there are differences in implementations.

All commercial toolchains (SEGGER Embedded Studio as well as competitors A and B) have put a lot of effort into their floating-point code. They all use highly optimized assembly code to perform these operations.
Competitor B also supplies a “code size optimized” library, which is actually default. Beware! While the code is somewhat more compact, the performance of the small library is surprisingly low. We might take a closer look at things in another part of this series.

Libgcc for GNU Arm Embedded is also written in assembler, but by far not as optimized. Actually, its performance (about 30% of the performance of the commercial implementations) is quite disappointing.

The only library written in pure C in this test is the SEGGER RunTime Library C variant. Not surprisingly, it is about 4 time slower than its brother written in ASM. However, it is almost as fast as the assembly coded Libgcc. That is quite impressive for a library that can be basically used on any CPU, not just ARM.
Good work, Mr. Curtis! 🙂

The SEGGER RunTime Library is the overall winner, but performance differences are not significant in most cases, simply because competitors A and B have also “done their homework” and produced good code as well.

Benchmark program

Below for reference the code we have used to benchmark the different libraries.

/*********************************************************************
*                   (c) SEGGER Microcontroller GmbH                  *
*                        The Embedded Experts                        *
*                           www.segger.com                           *
**********************************************************************

-------------------------- END-OF-HEADER -----------------------------

File        : bench.c
Purpose     : Benchmark Arm EABI implicit floating-point functions.

*/

/*********************************************************************
*
*       #include section
*
**********************************************************************
*/

/*********************************************************************
*
*       Defines, fixed
*
**********************************************************************
*/
#ifdef SEMIHOST
#include "SEGGER_SEMIHOST.h"
#endif
#if !defined (__clang__) || defined(__CC_ARM)
#include <string.h>
#include <stdio.h>
#endif
#include <stdarg.h>

/*********************************************************************
*
*       Defines, configurable
*
**********************************************************************
*/
#define SPECIAL(X)     // Set to X, if specials are required

/*********************************************************************
*
*       Defines, fixed
*
**********************************************************************
*/

#define COUNTOF(X) (sizeof(X) / sizeof(X[0]))

#define DWT_CYCCNT (*(volatile unsigned *)0xE0001004)

#if defined (__clang__) && !defined(__CC_ARM)
#define  memset _MEMSET
#endif

/*********************************************************************
*
*       Types, local
*
**********************************************************************
*/

typedef unsigned long long u64;
typedef unsigned long      u32;

typedef enum {
  MODE_INT_RETURN_FLOAT,
  MODE_INT_RETURN_DOUBLE,
  MODE_LLONG_RETURN_FLOAT,
  MODE_LLONG_RETURN_DOUBLE,
  MODE_FLOAT_RETURN_INT,
  MODE_FLOAT_RETURN_LLONG,
  MODE_FLOAT_RETURN_FLOAT,
  MODE_FLOAT_RETURN_DOUBLE,
  MODE_DOUBLE_RETURN_INT,
  MODE_DOUBLE_RETURN_LLONG,
  MODE_DOUBLE_RETURN_FLOAT,
  MODE_DOUBLE_RETURN_DOUBLE,
  //
  MODE_INT_INT_RETURN_INT,
  MODE_LLONG_LLONG_RETURN_LLONG,
  MODE_FLOAT_FLOAT_RETURN_INT,
  MODE_FLOAT_FLOAT_RETURN_FLOAT,
  MODE_DOUBLE_DOUBLE_RETURN_INT,
  MODE_DOUBLE_DOUBLE_RETURN_DOUBLE,
  MODE_MAX
} ExecMode;

typedef enum {
  SEQUENCE_END,
  SEQUENCE_SPECIAL_F32xF32,
  SEQUENCE_TYPICAL_F32xF32,
  SEQUENCE_SPECIAL_F64xF64,
  SEQUENCE_TYPICAL_F64xF64,
  SEQUENCE_31_INT,
  SEQUENCE_31_FLOAT,
  SEQUENCE_31_DOUBLE,
  SEQUENCE_63_LLONG,
  SEQUENCE_63_FLOAT,
  SEQUENCE_63_DOUBLE,
  SEQUENCE_SIGNED = 1<<8
} SEQUENCE;

typedef void (*VoidFunc)(void);

typedef struct {
  unsigned index;
  unsigned last;
  int      sign;
  SEQUENCE seq;
} ExecSequence;

typedef union {
  float     f;
  double    d;
  int       i;
  long long l;
} ExecValue;

typedef volatile union {
  void               (*pfVoidReturnVoid)           (void);
  float              (*pfIntReturnFloat)           (int);
  double             (*pfIntReturnDouble)          (int);
  float              (*pfLlongReturnFloat)         (long long);
  double             (*pfLlongReturnDouble)        (long long);
  float              (*pfUllongReturnFloat)        (unsigned long long);
  double             (*pfUllongReturnDouble)       (unsigned long long);
  int                (*pfIntIntReturnInt)          (int, int);
  long long          (*pfLlongLlongReturnLlong)    (long long, long long);
  //
  float              (*pfFloatFloatReturnFloat)    (float, float);
  int                (*pfFloatFloatReturnInt)      (float, float);
  //
  double             (*pfDoubleDoubleReturnDouble) (double, double);
  int                (*pfDoubleDoubleReturnInt)    (double, double);
  //
  int                (*pfFloatReturnInt)           (float);
  float              (*pfFloatReturnFloat)         (float);
  long long          (*pfFloatReturnLlong)         (float);
  double             (*pfFloatReturnDouble)        (float);
  //
  int                (*pfDoubleReturnInt)          (double);
  long long          (*pfDoubleReturnLlong)        (double);
  float              (*pfDoubleReturnFloat)        (double);
  double             (*pfDoubleReturnDouble)       (double);
} ExecFunction;

typedef struct {
  ExecMode     Mode;
  ExecFunction Function;
  ExecValue    v0;
  ExecValue    v1;
  int          i;
  int          j;
} ExecContext;

/*********************************************************************
*
*       Prototypes (of benchmarked runtime functions)
*
**********************************************************************
*/

// ARM EAEBI

int                __aeabi_idiv    (int, int);
long long          __aeabi_ldivmod (long long, long long);
float              __aeabi_fadd    (float, float);
float              __aeabi_fsub    (float, float);
float              __aeabi_frsub   (float, float);
float              __aeabi_fmul    (float, float);
float              __aeabi_fdiv    (float, float);
int                __aeabi_fcmplt  (float, float);
int                __aeabi_fcmple  (float, float);
int                __aeabi_fcmpgt  (float, float);
int                __aeabi_fcmpge  (float, float);
int                __aeabi_fcmpeq  (float, float);
double             __aeabi_dadd    (double, double);
double             __aeabi_dsub    (double, double);
double             __aeabi_drsub   (double, double);
double             __aeabi_dmul    (double, double);
double             __aeabi_ddiv    (double, double);
int                __aeabi_dcmplt  (double, double);
int                __aeabi_dcmple  (double, double);
int                __aeabi_dcmpgt  (double, double);
int                __aeabi_dcmpge  (double, double);
int                __aeabi_dcmpeq  (double, double);
int                __aeabi_f2iz    (float);
unsigned           __aeabi_f2uiz   (float);
long long          __aeabi_f2lz    (float);
unsigned long long __aeabi_f2ulz   (float);
float              __aeabi_i2f     (int);
float              __aeabi_ui2f    (unsigned);
float              __aeabi_l2f     (long long);
float              __aeabi_ul2f    (unsigned long long);
int                __aeabi_d2iz    (double);
long long          __aeabi_d2lz    (double);
unsigned           __aeabi_d2uiz   (double);
unsigned long long __aeabi_d2ulz   (double);
double             __aeabi_i2d     (int);
double             __aeabi_ui2d    (unsigned);
double             __aeabi_l2d     (long long);
double             __aeabi_ul2d    (unsigned long long);
double             __aeabi_f2d     (float);
float              __aeabi_d2f     (double);

// GNU API
float              __addsf3        (float, float);
float              __subsf3        (float, float);
float              __mulsf3        (float, float);
float              __divsf3        (float, float);
float              __ltsf2         (float, float);
float              __lesf2         (float, float);
float              __gtsf2         (float, float);
float              __gesf2         (float, float);
float              __eqsf2         (float, float);
float              __nesf2         (float, float);
double             __adddf3        (double, double);
double             __subdf3        (double, double);
double             __muldf3        (double, double);
double             __divdf3        (double, double);
double             __ltdf2         (double, double);
double             __ledf2         (double, double);
double             __gtdf2         (double, double);
double             __gedf2         (double, double);
double             __eqdf2         (double, double);
double             __nedf2         (double, double);

/*********************************************************************
*
*       Static data, const
*
**********************************************************************
*/

// binary32 special values
static u32 _aFloatSpecials[] = {
  0x00000000,  // +0
  0x80000000,  // -0
  0x7F800000,  // +Inf
  0xFF800000,  // -Inf
  0x7FC00000,  // NaN
  0xFFC00000,  // NaN
};

// Random floats derived from quantum randomness (https://qrng.anu.edu.au)
static float _aFloatRandomUniformDistribution1[] = {
  0.7885449723,
  0.9998094715,
  0.3876576724,
  0.8356841958,
  0.3148936939,
  0.9970710786,
  0.8235131486,
  0.3335833366,
  0.1948718644,
  0.8166663091,
  0.1650510733,
  0.3968966721,
  0.3638974189,
  0.9667957495,
  0.3121612214,
  0.9223421130,
  0.7188766282,
  0.2825422601,
  0.0383919030,
  0.5764071341,
  0.4114595256,
  0.4700649972,
  0.8002487955,
  0.3655678094,
  0.6008792749,
  0.4053804503,
  0.3819831959,
  0.7347183835,
  0.4479462250,
  0.3401285649,
  0.0707507148,
  0.4984719161,
  0.3409999091,
  0.8548396639,
  0.5045839402,
  0.7739178709,
  0.0983707712,
  0.5618592840,
  0.1426608492,
  0.5289642164,
  0.1578932915,
  0.9081336126,
  0.4058290755,
  0.8012231669,
  0.8389891772,
  0.0952707962,
  0.4920716871,
  0.3719829386,
  0.0144001994,
  0.7667299990,
  0.6203624231,
  0.7813631283,
  0.6673019642,
  0.7618224988,
  0.6041512158,
  0.8233172946,
  0.6591242263,
  0.6219177115,
  0.6990491696,
  0.5953941475,
  0.7233279722,
  0.3609917109,
  0.1769333638,
  0.1089936333
};

// Random floats derived from quantum randomness (https://qrng.anu.edu.au)
static float _aFloatRandomUniformDistribution2[] = {
  0.0422564714,
  0.7728131769,
  0.8620072105,
  0.8170243470,
  0.9945166426,
  0.1984626113,
  0.9276007395,
  0.5248677401,
  0.7048731442,
  0.7535610915,
  0.5463053182,
  0.6054137050,
  0.2593339109,
  0.3244756924,
  0.4028105685,
  0.2196438660,
  0.2756980496,
  0.4626345033,
  0.0841498048,
  0.4801435920,
  0.3151815446,
  0.5968274530,
  0.6534962360,
  0.6365893527,
  0.1284928145,
  0.6721899283,
  0.6016264597,
  0.7256847994,
  0.5143220404,
  0.8687852838,
  0.1344069993,
  0.4294689739,
  0.1108499650,
  0.8959778614,
  0.6813699648,
  0.7632335353,
  0.1046082104,
  0.3226924169,
  0.9592376359,
  0.8123961553,
  0.8210336750,
  0.5806060940,
  0.8104785465,
  0.3776035579,
  0.4898308927,
  0.3280951809,
  0.1899302640,
  0.7083087792,
  0.3979903829,
  0.0754221734,
  0.2727227594,
  0.0049476867,
  0.3373789961,
  0.3441676357,
  0.6555256263,
  0.8512584435,
  0.2644237446,
  0.2510367962,
  0.7095772772,
  0.6422276897,
  0.3595680716,
  0.7666331518,
  0.7823634977,
  0.9986928948
};

// Random doubles derived from quantum randomness (https://qrng.anu.edu.au)
static double _aDoubleRandomUniformDistribution1[] = {
  0.62598670877017040467,
  0.49248291507389259323,
  0.02726059443179837415,
  0.52383376114815388239,
  0.94881962108914826333,
  0.23945969797938011460,
  0.22132856465995987511,
  0.40164002160057182308,
  0.02558438713688386477,
  0.12523811179432791317,
  0.67056860624381301735,
  0.05494466839729881311,
  0.15128037511840960857,
  0.93290446929135529390,
  0.51819119451781587437,
  0.56565829405943493592,
  0.89639821540508221456,
  0.48541199928732648388,
  0.08836267574199602456,
  0.24251967550090505148,
  0.16586885359007352595,
  0.48961907867477528217,
  0.82618915609454883542,
  0.73600718852549053876,
  0.87066246033524769869,
  0.86020591848752893062,
  0.85699897202914135194,
  0.11452935695167901460,
  0.41303841463037521702,
  0.80951287799563916322,
  0.75378633773898919971,
  0.49633766682999376297,
  0.98545748812484449544,
  0.34260954016749222648,
  0.56915335626507813411,
  0.85065987223630355238,
  0.29075114535898746357,
  0.24604485121860453263,
  0.70681987573003776796,
  0.23564755356683848740,
  0.19445599747538142750,
  0.26612471807353255859,
  0.26043225424381303005,
  0.00087537885780199165,
  0.57611537016977388272,
  0.21274132250868999946,
  0.68576149410247520150,
  0.53597164019987906463,
  0.80723091306137133570,
  0.48431508160461319525,
  0.05117989159074980911,
  0.22820212900191732869,
  0.00323988328678565153,
  0.28633918134445096158,
  0.61724704767476312226,
  0.86797895493611381017,
  0.40851001880412455855,
  0.04568938942160463537,
  0.05128283614073092389,
  0.45920412605629752877,
  0.96756956301432592105,
  0.91365827487144381776,
  0.44010767338302752699,
  0.08153736749748720152
};

// Random doubles derived from quantum randomness (https://qrng.anu.edu.au)
static double _aDoubleRandomUniformDistribution2[] = {
  0.13791062199848394430,
  0.30853562390856629686,
  0.47255436807785749811,
  0.76494137047912110909,
  0.85737237712826384305,
  0.50837580073361976443,
  0.09879071225648270259,
  0.37142974335787939576,
  0.89382622662737115497,
  0.11034956939165642209,
  0.95260237469393842878,
  0.32369926555136014278,
  0.70240408851025394699,
  0.95193126496005132973,
  0.29833512067425684597,
  0.86891023377616471572,
  0.65753247170754614536,
  0.15021233235108470092,
  0.51993705151156395899,
  0.95605688170461269955,
  0.78399271749907696931,
  0.35253001866313723186,
  0.27301178262116164802,
  0.96813863725529664873,
  0.97590087719427336527,
  0.12411551533291666025,
  0.02730357846216959856,
  0.21329428836053378704,
  0.66554383622626407541,
  0.76125224975509662526,
  0.55864211173109079551,
  0.67051126395572986824,
  0.66246756407151419597,
  0.97890897008569255528,
  0.05455944378960619263,
  0.86547464045876951401,
  0.43622915074551305806,
  0.98726021620151813075,
  0.81792085362753240587,
  0.31793107375168283805,
  0.06057444961449573412,
  0.03432623241446031505,
  0.29130429676615284313,
  0.91094642214997097136,
  0.55970045530181872927,
  0.12220353062299554107,
  0.64201703938238601704,
  0.75006836925733483649,
  0.24034675143166997700,
  0.17493414521907372445,
  0.89125908767524068203,
  0.57105276386594203357,
  0.53693818935585752770,
  0.43389582086499787074,
  0.82302863999955747663,
  0.08636985155486539717,
  0.28425740151320795268,
  0.78755776097024850001,
  0.90255131206116015747,
  0.13030404250052839406,
  0.80136503668073546207,
  0.80110538175970802405,
  0.59736931252456376089,
  0.28568214281583553721
};

// binary64 special values
static u64 _aDoubleSpecials[] = {
  0x0000000000000000uLL,  // +0
  0x8000000000000000uLL,  // -0
  0x7FF0000000000000uLL,  // +Inf
  0xFFF0000000000000uLL,  // -Inf
  0x7FF8000000000000uLL,  // NaN
  0xFFF8000000000000uLL,  // NaN
};

// Random floats with magnitudes [1..2^31]
static float _aFloat31[] = {
  0.8620072105 * (1uLL<<1),
  0.8170243470 * (1uLL<<2),
  0.9945166426 * (1uLL<<3),
  0.9276007395 * (1uLL<<4),
  0.5248677401 * (1uLL<<5),
  0.7048731442 * (1uLL<<6),
  0.7535610915 * (1uLL<<7),
  0.5463053182 * (1uLL<<8),
  0.6054137050 * (1uLL<<9),
  0.5968274530 * (1uLL<<10),
  0.6534962360 * (1uLL<<11),
  0.6365893527 * (1uLL<<12),
  0.6721899283 * (1uLL<<13),
  0.6016264597 * (1uLL<<14),
  0.7256847994 * (1uLL<<15),
  0.5143220404 * (1uLL<<16),
  0.8687852838 * (1uLL<<17),
  0.8959778614 * (1uLL<<18),
  0.6813699648 * (1uLL<<19),
  0.7632335353 * (1uLL<<20),
  0.9592376359 * (1uLL<<21),
  0.8123961553 * (1uLL<<22),
  0.8210336750 * (1uLL<<23),
  0.5806060940 * (1uLL<<24),
  0.8104785465 * (1uLL<<25),
  0.7083087792 * (1uLL<<26),
  0.6555256263 * (1uLL<<27),
  0.8512584435 * (1uLL<<28),
  0.7095772772 * (1uLL<<29),
  0.6422276897 * (1uLL<<30),
  0.7666331518 * (1uLL<<31),
};

// Random floats with magnitudes [1..2^63]
static float _aFloat63[] = {
  +0.8620072105 * (1uLL<<1),
  +0.8170243470 * (1uLL<<2),
  +0.9945166426 * (1uLL<<3),
  +0.9276007395 * (1uLL<<4),
  +0.5248677401 * (1uLL<<5),
  +0.7048731442 * (1uLL<<6),
  +0.7535610915 * (1uLL<<7),
  +0.5463053182 * (1uLL<<8),
  +0.6054137050 * (1uLL<<9),
  +0.5968274530 * (1uLL<<10),
  +0.6534962360 * (1uLL<<11),
  +0.6365893527 * (1uLL<<12),
  +0.6721899283 * (1uLL<<13),
  +0.6016264597 * (1uLL<<14),
  +0.7256847994 * (1uLL<<15),
  +0.5143220404 * (1uLL<<16),
  +0.8687852838 * (1uLL<<17),
  +0.8959778614 * (1uLL<<18),
  +0.6813699648 * (1uLL<<19),
  +0.7632335353 * (1uLL<<20),
  +0.9592376359 * (1uLL<<21),
  +0.8123961553 * (1uLL<<22),
  +0.8210336750 * (1uLL<<23),
  +0.5806060940 * (1uLL<<24),
  +0.8104785465 * (1uLL<<25),
  +0.7083087792 * (1uLL<<26),
  +0.6555256263 * (1uLL<<27),
  +0.8512584435 * (1uLL<<28),
  +0.7095772772 * (1uLL<<29),
  +0.7728131769 * (1uLL<<30),
  +0.8620072105 * (1uLL<<31),
  +0.8170243470 * (1uLL<<32),
  +0.9945166426 * (1uLL<<33),
  +0.9276007395 * (1uLL<<34),
  +0.5248677401 * (1uLL<<35),
  +0.7048731442 * (1uLL<<36),
  +0.7535610915 * (1uLL<<37),
  +0.5463053182 * (1uLL<<38),
  +0.6054137050 * (1uLL<<39),
  +0.5968274530 * (1uLL<<40),
  +0.6534962360 * (1uLL<<41),
  +0.6365893527 * (1uLL<<42),
  +0.6721899283 * (1uLL<<43),
  +0.6016264597 * (1uLL<<44),
  +0.7256847994 * (1uLL<<45),
  +0.5143220404 * (1uLL<<46),
  +0.8687852838 * (1uLL<<47),
  +0.8959778614 * (1uLL<<48),
  +0.6813699648 * (1uLL<<49),
  +0.7632335353 * (1uLL<<50),
  +0.9592376359 * (1uLL<<51),
  +0.8123961553 * (1uLL<<52),
  +0.8210336750 * (1uLL<<53),
  +0.5806060940 * (1uLL<<54),
  +0.8104785465 * (1uLL<<55),
  +0.7083087792 * (1uLL<<56),
  +0.6555256263 * (1uLL<<57),
  +0.8512584435 * (1uLL<<58),
  +0.7095772772 * (1uLL<<59),
  +0.6422276897 * (1uLL<<60),
  +0.7666331518 * (1uLL<<61),
  +0.7728131769 * (1uLL<<62),
  +0.8620072105 * (1uLL<<63),
};

// Random integers with magnitudes [1..2^31]
static int _aInt31[] = {
  (int)(0.8620072105 * (1uLL<<1)),
  (int)(0.8170243470 * (1uLL<<2)),
  (int)(0.9945166426 * (1uLL<<3)),
  (int)(0.9276007395 * (1uLL<<4)),
  (int)(0.5248677401 * (1uLL<<5)),
  (int)(0.7048731442 * (1uLL<<6)),
  (int)(0.7535610915 * (1uLL<<7)),
  (int)(0.5463053182 * (1uLL<<8)),
  (int)(0.6054137050 * (1uLL<<9)),
  (int)(0.5968274530 * (1uLL<<10)),
  (int)(0.6534962360 * (1uLL<<11)),
  (int)(0.6365893527 * (1uLL<<12)),
  (int)(0.6721899283 * (1uLL<<13)),
  (int)(0.6016264597 * (1uLL<<14)),
  (int)(0.7256847994 * (1uLL<<15)),
  (int)(0.5143220404 * (1uLL<<16)),
  (int)(0.8687852838 * (1uLL<<17)),
  (int)(0.8959778614 * (1uLL<<18)),
  (int)(0.6813699648 * (1uLL<<19)),
  (int)(0.7632335353 * (1uLL<<20)),
  (int)(0.9592376359 * (1uLL<<21)),
  (int)(0.8123961553 * (1uLL<<22)),
  (int)(0.8210336750 * (1uLL<<23)),
  (int)(0.5806060940 * (1uLL<<24)),
  (int)(0.8104785465 * (1uLL<<25)),
  (int)(0.7083087792 * (1uLL<<26)),
  (int)(0.6555256263 * (1uLL<<27)),
  (int)(0.8512584435 * (1uLL<<28)),
  (int)(0.7095772772 * (1uLL<<29)),
  (int)(0.6422276897 * (1uLL<<30)),
  (int)(0.7666331518 * (1uLL<<31)),
};

// Random doubles derived from quantum randomness (https://qrng.anu.edu.au)
static double _aDouble31[] = {
  0.62598670877017040467 * (1uLL<<1),
  0.49248291507389259323 * (1uLL<<2),
  0.02726059443179837415 * (1uLL<<3),
  0.52383376114815388239 * (1uLL<<4),
  0.94881962108914826333 * (1uLL<<5),
  0.23945969797938011460 * (1uLL<<6),
  0.22132856465995987511 * (1uLL<<7),
  0.40164002160057182308 * (1uLL<<8),
  0.02558438713688386477 * (1uLL<<9),
  0.12523811179432791317 * (1uLL<<10),
  0.67056860624381301735 * (1uLL<<11),
  0.05494466839729881311 * (1uLL<<12),
  0.15128037511840960857 * (1uLL<<13),
  0.93290446929135529390 * (1uLL<<14),
  0.51819119451781587437 * (1uLL<<15),
  0.56565829405943493592 * (1uLL<<16),
  0.89639821540508221456 * (1uLL<<17),
  0.48541199928732648388 * (1uLL<<18),
  0.08836267574199602456 * (1uLL<<19),
  0.24251967550090505148 * (1uLL<<20),
  0.16586885359007352595 * (1uLL<<21),
  0.48961907867477528217 * (1uLL<<22),
  0.82618915609454883542 * (1uLL<<23),
  0.73600718852549053876 * (1uLL<<24),
  0.87066246033524769869 * (1uLL<<25),
  0.86020591848752893062 * (1uLL<<26),
  0.85699897202914135194 * (1uLL<<27),
  0.11452935695167901460 * (1uLL<<28),
  0.41303841463037521702 * (1uLL<<29),
  0.80951287799563916322 * (1uLL<<30),
  0.75378633773898919971 * (1uLL<<31),
};

// Random doubles derived from quantum randomness (https://qrng.anu.edu.au)
static double _aDouble63[] = {
  0.62598670877017040467 * (1uLL<<1),
  0.49248291507389259323 * (1uLL<<2),
  0.02726059443179837415 * (1uLL<<3),
  0.52383376114815388239 * (1uLL<<4),
  0.94881962108914826333 * (1uLL<<5),
  0.23945969797938011460 * (1uLL<<6),
  0.22132856465995987511 * (1uLL<<7),
  0.40164002160057182308 * (1uLL<<8),
  0.02558438713688386477 * (1uLL<<9),
  0.12523811179432791317 * (1uLL<<10),
  0.67056860624381301735 * (1uLL<<11),
  0.05494466839729881311 * (1uLL<<12),
  0.15128037511840960857 * (1uLL<<13),
  0.93290446929135529390 * (1uLL<<14),
  0.51819119451781587437 * (1uLL<<15),
  0.56565829405943493592 * (1uLL<<16),
  0.89639821540508221456 * (1uLL<<17),
  0.48541199928732648388 * (1uLL<<18),
  0.08836267574199602456 * (1uLL<<19),
  0.24251967550090505148 * (1uLL<<20),
  0.16586885359007352595 * (1uLL<<21),
  0.48961907867477528217 * (1uLL<<22),
  0.82618915609454883542 * (1uLL<<23),
  0.73600718852549053876 * (1uLL<<24),
  0.87066246033524769869 * (1uLL<<25),
  0.86020591848752893062 * (1uLL<<26),
  0.85699897202914135194 * (1uLL<<27),
  0.11452935695167901460 * (1uLL<<28),
  0.41303841463037521702 * (1uLL<<29),
  0.80951287799563916322 * (1uLL<<30),
  0.75378633773898919971 * (1uLL<<31),
  0.49633766682999376297 * (1uLL<<32),
  0.98545748812484449544 * (1uLL<<33),
  0.34260954016749222648 * (1uLL<<34),
  0.56915335626507813411 * (1uLL<<35),
  0.85065987223630355238 * (1uLL<<36),
  0.29075114535898746357 * (1uLL<<37),
  0.24604485121860453263 * (1uLL<<38),
  0.70681987573003776796 * (1uLL<<39),
  0.23564755356683848740 * (1uLL<<40),
  0.19445599747538142750 * (1uLL<<41),
  0.26612471807353255859 * (1uLL<<42),
  0.26043225424381303005 * (1uLL<<43),
  0.00087537885780199165 * (1uLL<<44),
  0.57611537016977388272 * (1uLL<<45),
  0.21274132250868999946 * (1uLL<<46),
  0.68576149410247520150 * (1uLL<<47),
  0.53597164019987906463 * (1uLL<<48),
  0.80723091306137133570 * (1uLL<<49),
  0.48431508160461319525 * (1uLL<<50),
  0.05117989159074980911 * (1uLL<<51),
  0.22820212900191732869 * (1uLL<<52),
  0.00323988328678565153 * (1uLL<<53),
  0.28633918134445096158 * (1uLL<<54),
  0.61724704767476312226 * (1uLL<<55),
  0.86797895493611381017 * (1uLL<<56),
  0.40851001880412455855 * (1uLL<<57),
  0.04568938942160463537 * (1uLL<<58),
  0.05128283614073092389 * (1uLL<<59),
  0.45920412605629752877 * (1uLL<<60),
  0.96756956301432592105 * (1uLL<<61),
  0.91365827487144381776 * (1uLL<<62),
  0.44010767338302752699 * (1uLL<<63),
};

static long long _aLlong63[] = {
  (long long)(0.59248291507389259323 * (1uLL<<1)),
  (long long)(0.72726059443179837415 * (1uLL<<2)),
  (long long)(0.52383376114815388239 * (1uLL<<3)),
  (long long)(0.94881962108914826333 * (1uLL<<4)),
  (long long)(0.23945969797938011460 * (1uLL<<5)),
  (long long)(0.22132856465995987511 * (1uLL<<6)),
  (long long)(0.40164002160057182308 * (1uLL<<7)),
  (long long)(0.02558438713688386477 * (1uLL<<8)),
  (long long)(0.12523811179432791317 * (1uLL<<9)),
  (long long)(0.67056860624381301735 * (1uLL<<10)),
  (long long)(0.05494466839729881311 * (1uLL<<11)),
  (long long)(0.15128037511840960857 * (1uLL<<12)),
  (long long)(0.93290446929135529390 * (1uLL<<13)),
  (long long)(0.51819119451781587437 * (1uLL<<14)),
  (long long)(0.56565829405943493592 * (1uLL<<15)),
  (long long)(0.89639821540508221456 * (1uLL<<16)),
  (long long)(0.48541199928732648388 * (1uLL<<17)),
  (long long)(0.08836267574199602456 * (1uLL<<18)),
  (long long)(0.24251967550090505148 * (1uLL<<19)),
  (long long)(0.16586885359007352595 * (1uLL<<20)),
  (long long)(0.48961907867477528217 * (1uLL<<21)),
  (long long)(0.82618915609454883542 * (1uLL<<22)),
  (long long)(0.73600718852549053876 * (1uLL<<23)),
  (long long)(0.87066246033524769869 * (1uLL<<24)),
  (long long)(0.86020591848752893062 * (1uLL<<25)),
  (long long)(0.85699897202914135194 * (1uLL<<26)),
  (long long)(0.11452935695167901460 * (1uLL<<27)),
  (long long)(0.41303841463037521702 * (1uLL<<28)),
  (long long)(0.80951287799563916322 * (1uLL<<29)),
  (long long)(0.75378633773898919971 * (1uLL<<30)),
  (long long)(0.49633766682999376297 * (1uLL<<31)),
  (long long)(0.98545748812484449544 * (1uLL<<32)),
  (long long)(0.34260954016749222648 * (1uLL<<33)),
  (long long)(0.56915335626507813411 * (1uLL<<34)),
  (long long)(0.85065987223630355238 * (1uLL<<35)),
  (long long)(0.29075114535898746357 * (1uLL<<36)),
  (long long)(0.24604485121860453263 * (1uLL<<37)),
  (long long)(0.70681987573003776796 * (1uLL<<38)),
  (long long)(0.23564755356683848740 * (1uLL<<39)),
  (long long)(0.19445599747538142750 * (1uLL<<40)),
  (long long)(0.26612471807353255859 * (1uLL<<41)),
  (long long)(0.26043225424381303005 * (1uLL<<42)),
  (long long)(0.00087537885780199165 * (1uLL<<43)),
  (long long)(0.57611537016977388272 * (1uLL<<44)),
  (long long)(0.21274132250868999946 * (1uLL<<45)),
  (long long)(0.68576149410247520150 * (1uLL<<46)),
  (long long)(0.53597164019987906463 * (1uLL<<47)),
  (long long)(0.80723091306137133570 * (1uLL<<48)),
  (long long)(0.48431508160461319525 * (1uLL<<49)),
  (long long)(0.05117989159074980911 * (1uLL<<50)),
  (long long)(0.22820212900191732869 * (1uLL<<51)),
  (long long)(0.00323988328678565153 * (1uLL<<52)),
  (long long)(0.28633918134445096158 * (1uLL<<53)),
  (long long)(0.61724704767476312226 * (1uLL<<54)),
  (long long)(0.86797895493611381017 * (1uLL<<55)),
  (long long)(0.40851001880412455855 * (1uLL<<56)),
  (long long)(0.04568938942160463537 * (1uLL<<57)),
  (long long)(0.05128283614073092389 * (1uLL<<58)),
  (long long)(0.45920412605629752877 * (1uLL<<59)),
  (long long)(0.96756956301432592105 * (1uLL<<60)),
  (long long)(0.91365827487144381776 * (1uLL<<61)),
  (long long)(0.44010767338302752699 * (1uLL<<62)),
  (long long)(0.08153736749748720152 * (1uLL<<63)), }; 

/********************************************************************* 
* 
* Static data 
* 
********************************************************************** 
*/ 
static u64 _aOverhead[MODE_MAX]; // Overhead of calling a function that simply returns, per mode. 

/********************************************************************* 
* 
* Local functions 
* 
********************************************************************** 
*/ 
#if defined (__clang__) && !defined(__CC_ARM) 
static void* _MEMSET(void *str, int c, int n) { 
  unsigned char* p; 
  p = (unsigned char*)str; 
  while (n > 0) {
    *p++ = (unsigned char)c;
    n--;
  }

  return str;
}
#endif

/*********************************************************************
*
*       _Logf()
*
*  Function description
*    Log a formatted string via semihosting or toolchain internal loggin.
*/
static void _Logf(const char *sFormat, ...) {
  va_list ap;
  //
  va_start(ap, sFormat);
#ifdef SEMIHOST
  SEGGER_SEMIHOST_Writef(sFormat, &ap);
#else
  vprintf(sFormat, ap);
#endif
}

/*********************************************************************
*
*       _GetTime()
*
*  Function description
*    Get the current time from a performance counter.
*/
static u32 _GetTime(void) {
  return DWT_CYCCNT;
}

/*********************************************************************
*
*       NullVoidReturnVoid()
*
*  Function description
*    Naked function to measure function call overhead.
*/
#if defined(__CC_ARM)
__asm void __attribute__((noinline)) NullVoidReturnVoid(void) {
  bx lr;
};
#elif defined (__ICCARM__) || defined (__SES_ARM) || defined(__GNUC__)
void __attribute__((naked, noinline, section(".fast"))) NullVoidReturnVoid(void) {
  __asm("bx lr");
};
#endif

/*********************************************************************
*
*       _Time()
*
*  Function description
*    Get the time of executing a function.
*/
static u32 __attribute__((noinline, section(".fast"))) _Time(ExecContext *pContext) {
  u32 t0;
  u32 t1;
  //
  t0 = _GetTime();
  switch (pContext->Mode) {
  case MODE_INT_RETURN_FLOAT:            pContext->Function.pfIntReturnFloat          (pContext->v0.i);                 break;
  case MODE_INT_RETURN_DOUBLE:           pContext->Function.pfIntReturnDouble         (pContext->v0.i);                 break;
  case MODE_LLONG_RETURN_FLOAT:          pContext->Function.pfLlongReturnFloat        (pContext->v0.l);                 break;
  case MODE_LLONG_RETURN_DOUBLE:         pContext->Function.pfLlongReturnDouble       (pContext->v0.l);                 break;
  case MODE_FLOAT_RETURN_INT:            pContext->Function.pfFloatReturnInt          (pContext->v0.f);                 break;
  case MODE_FLOAT_RETURN_LLONG:          pContext->Function.pfFloatReturnLlong        (pContext->v0.f);                 break;
  case MODE_FLOAT_RETURN_FLOAT:          pContext->Function.pfFloatReturnFloat        (pContext->v0.f);                 break;
  case MODE_FLOAT_RETURN_DOUBLE:         pContext->Function.pfFloatReturnDouble       (pContext->v0.f);                 break;
  case MODE_DOUBLE_RETURN_INT:           pContext->Function.pfDoubleReturnInt         (pContext->v0.d);                 break;
  case MODE_DOUBLE_RETURN_FLOAT:         pContext->Function.pfDoubleReturnFloat       (pContext->v0.d);                 break;
  case MODE_DOUBLE_RETURN_DOUBLE:        pContext->Function.pfDoubleReturnDouble      (pContext->v0.d);                 break;
  case MODE_DOUBLE_RETURN_LLONG:         pContext->Function.pfDoubleReturnLlong       (pContext->v0.d);                 break;
  case MODE_INT_INT_RETURN_INT:          pContext->Function.pfIntIntReturnInt         (pContext->v0.i, pContext->v1.i); break;
  case MODE_LLONG_LLONG_RETURN_LLONG:    pContext->Function.pfLlongLlongReturnLlong   (pContext->v0.l, pContext->v1.l); break;
  case MODE_FLOAT_FLOAT_RETURN_INT:      pContext->Function.pfFloatFloatReturnInt     (pContext->v0.f, pContext->v1.f); break;
  case MODE_FLOAT_FLOAT_RETURN_FLOAT:    pContext->Function.pfFloatFloatReturnFloat   (pContext->v0.f, pContext->v1.f); break;
  case MODE_DOUBLE_DOUBLE_RETURN_INT:    pContext->Function.pfDoubleDoubleReturnInt   (pContext->v0.d, pContext->v1.d); break;
  case MODE_DOUBLE_DOUBLE_RETURN_DOUBLE: pContext->Function.pfDoubleDoubleReturnDouble(pContext->v0.d, pContext->v1.d); break;
  case MODE_MAX: break;
  }
  t1 = _GetTime();
  return t1 - t0;
}

/*********************************************************************
*
*       _GetSeqName()
*
*  Function description
*    Get the name of a sequence by its id.
*/
static const char * _GetSeqName(unsigned Seq) {
  switch (Seq) {
  case SEQUENCE_SPECIAL_F32xF32:
  case SEQUENCE_SPECIAL_F64xF64:
    return "+-Inf, +-NaN, +-0";
  case SEQUENCE_TYPICAL_F32xF32:
  case SEQUENCE_TYPICAL_F64xF64:
    return "Random distribution over (0, 1), operands differ";
  case SEQUENCE_31_INT:
  case SEQUENCE_31_FLOAT:
  case SEQUENCE_31_DOUBLE:
    return "Random distribution with magnitudes (1..2^31)";
  case SEQUENCE_31_INT | SEQUENCE_SIGNED:
  case SEQUENCE_31_FLOAT | SEQUENCE_SIGNED:
  case SEQUENCE_31_DOUBLE | SEQUENCE_SIGNED:
    return "Random distribution with magnitudes (1..2^31), signed";
  case SEQUENCE_63_LLONG:
  case SEQUENCE_63_FLOAT:
  case SEQUENCE_63_DOUBLE:
    return "Random distribution with magnitudes (1..2^63)";
  case SEQUENCE_63_LLONG | SEQUENCE_SIGNED:
  case SEQUENCE_63_FLOAT | SEQUENCE_SIGNED:
  case SEQUENCE_63_DOUBLE | SEQUENCE_SIGNED:
    return "Random distribution with magnitudes (1..2^63), signed";
  default:
    return "<unknown>";
  }
}

/*********************************************************************
*
*       _InitSeq()
*
*  Function description
*    Initialize a sequence.
*/
static void _InitSeq(ExecSequence *pSeq, SEQUENCE seq) {
  pSeq->seq = seq;
  pSeq->index = 0;
  pSeq->last  = 0;
  pSeq->sign  = 1;
}

/*********************************************************************
*
*       _NextSeq()
*
*  Function description
*    Get and prepare the next sequence.
*/
static int _NextSeq(ExecSequence *pSeq, ExecContext *pCtx) {
  //
  if (pSeq->last) {
    if (pSeq->seq & SEQUENCE_SIGNED) {
      pSeq->sign = -pSeq->sign;
      pSeq->index = 0;
      pSeq->last  = 0;
      if (pSeq->sign == 1) {
        return 0;
      }
    } else {
      return 0;
    }
  }
  //
  switch (pSeq->seq & 0x1f) {
  case SEQUENCE_END:
    pSeq->last = 1;
    break;
    //
  case SEQUENCE_SPECIAL_F32xF32:
    pCtx->v0.i = _aFloatSpecials[pSeq->index % COUNTOF(_aFloatSpecials)];
    pCtx->v1.i = _aFloatSpecials[pSeq->index / COUNTOF(_aFloatSpecials)];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aFloatSpecials)*COUNTOF(_aFloatSpecials);
    break;
    //
  case SEQUENCE_TYPICAL_F32xF32:
    pCtx->v0.f = _aFloatRandomUniformDistribution1[pSeq->index % COUNTOF(_aFloatRandomUniformDistribution1)];
    pCtx->v1.f = _aFloatRandomUniformDistribution2[pSeq->index / COUNTOF(_aFloatRandomUniformDistribution1)];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aFloatRandomUniformDistribution1)*COUNTOF(_aFloatRandomUniformDistribution2);
    break;
    //
  case SEQUENCE_SPECIAL_F64xF64:
    pCtx->v0.l = _aDoubleSpecials[pSeq->index % COUNTOF(_aDoubleSpecials)];
    pCtx->v1.l = _aDoubleSpecials[pSeq->index / COUNTOF(_aDoubleSpecials)];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aDoubleSpecials)*COUNTOF(_aDoubleSpecials);
    break;
    //
  case SEQUENCE_TYPICAL_F64xF64:
    pCtx->v0.d = _aDoubleRandomUniformDistribution1[pSeq->index % COUNTOF(_aDoubleRandomUniformDistribution1)];
    pCtx->v1.d = _aDoubleRandomUniformDistribution2[pSeq->index / COUNTOF(_aDoubleRandomUniformDistribution1)];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aDoubleRandomUniformDistribution1)*COUNTOF(_aDoubleRandomUniformDistribution2);
    break;
    //
  case SEQUENCE_31_INT:
    pCtx->v0.i = pSeq->sign * _aInt31[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aInt31);
    break;
    //
  case SEQUENCE_31_FLOAT:
    pCtx->v0.f = pSeq->sign * _aFloat31[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aFloat31);
    break;
    //
  case SEQUENCE_31_DOUBLE:
    pCtx->v0.d = pSeq->sign * _aDouble31[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aLlong63);
    break;
    //
  case SEQUENCE_63_LLONG:
    pCtx->v0.l = pSeq->sign * _aLlong63[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aLlong63);
    break;
    //
  case SEQUENCE_63_FLOAT:
    pCtx->v0.f = pSeq->sign * _aFloat63[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aLlong63);
    break;
    //
  case SEQUENCE_63_DOUBLE:
    pCtx->v0.d = pSeq->sign * _aDouble63[pSeq->index];
    ++pSeq->index;
    pSeq->last = pSeq->index >= COUNTOF(_aLlong63);
    break;
    //
  default:
    _Logf("Unknown sequence!\n");
    pSeq->last = 1;
    break;
  }
  //
  return 1;
}

/*********************************************************************
*
*       _Benchmark()
*
*  Function description
*    Run a benchmark function.
*/
static void _Benchmark(void *pFn, ExecMode Mode, const char *sLabel, ...) {
  va_list      ap;
  u32          Min;
  u32          Max;
  u32          Cnt;
  u32          Tot;
  u32          t;
  SEQUENCE     SelSeq;
  ExecContext  Ctx;
  ExecSequence Seq;
  //
  va_start(ap, sLabel);
  //
  Ctx.Function.pfVoidReturnVoid = (VoidFunc)pFn;
  Ctx.Mode = Mode;
  //
  SelSeq = (SEQUENCE)va_arg(ap, unsigned);
  while (SelSeq != SEQUENCE_END) {
    _InitSeq(&Seq, SelSeq);
    Min = ~0u;
    Max = 0u;
    Tot = 0;
    Cnt = 0;
    while (_NextSeq(&Seq, &Ctx)) {
      t = _Time(&Ctx);
      t -= _aOverhead[Ctx.Mode];
      Cnt += 1;
      Tot += t;
      if (t < Min) { Min = t; } if (t > Max) {
        Max = t;
      }
      if (t > 100) {
        t = _Time(&Ctx);
      }
    }
    _Logf("%-15s  %6u  %6u  %6.1f    %s\n", sLabel, Min, Max, (float)Tot / Cnt, _GetSeqName(SelSeq));
    SelSeq = (SEQUENCE)va_arg(ap, unsigned);
  }
}

/*********************************************************************
*
*       _CalculateOverheads()
*
*  Function description
*    Get the overheads of calling functions.
*/
static void _CalculateOverheads(void) {
  ExecContext Context;
  //
  memset(&Context, 0, sizeof(Context));
  Context.Function.pfVoidReturnVoid = NullVoidReturnVoid;
  //
  Context.Mode = (ExecMode)0;
  while (Context.Mode < MODE_MAX) {
    _aOverhead[Context.Mode] = _Time(&Context);
    Context.Mode = (ExecMode)(Context.Mode+1);
  }
}

/*********************************************************************
*
*       Global functions
*
**********************************************************************
*/
/*********************************************************************
*
*       main()
*
*  Function description
*    Application entry point.
*/
int main(void) {
  _Logf("IEEE-754 Floating-point Library Benchmarks\n");
  _Logf("Copyright (c) 2018-2019 SEGGER Microcontroller GmbH.\n\n");
  //
  _Logf("Target: Cortex-M");
  _Logf("\n\n");
  //
  _Logf("Function            Min     Max     Avg    Description\n");
  _Logf("--------------   ------  ------  ------    -------------------------------\n");
  //
  _CalculateOverheads();
  //
  _Benchmark((void *)__aeabi_fadd,   MODE_FLOAT_FLOAT_RETURN_FLOAT,    "__aeabi_fadd",   SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fsub,   MODE_FLOAT_FLOAT_RETURN_FLOAT,    "__aeabi_fsub",   SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_frsub,  MODE_FLOAT_FLOAT_RETURN_FLOAT,    "__aeabi_frsub",  SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fmul,   MODE_FLOAT_FLOAT_RETURN_FLOAT,    "__aeabi_fmul",   SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fdiv,   MODE_FLOAT_FLOAT_RETURN_FLOAT,    "__aeabi_fdiv",   SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fcmplt, MODE_FLOAT_FLOAT_RETURN_INT,      "__aeabi_fcmplt", SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fcmple, MODE_FLOAT_FLOAT_RETURN_INT,      "__aeabi_fcmple", SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fcmpgt, MODE_FLOAT_FLOAT_RETURN_INT,      "__aeabi_fcmpgt", SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fcmpge, MODE_FLOAT_FLOAT_RETURN_INT,      "__aeabi_fcmpge", SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_fcmpeq, MODE_FLOAT_FLOAT_RETURN_INT,      "__aeabi_fcmpeq", SPECIAL(SEQUENCE_SPECIAL_F32xF32) SEQUENCE_TYPICAL_F32xF32, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dadd,   MODE_DOUBLE_DOUBLE_RETURN_DOUBLE, "__aeabi_dadd",   SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dsub,   MODE_DOUBLE_DOUBLE_RETURN_DOUBLE, "__aeabi_dsub",   SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_drsub,  MODE_DOUBLE_DOUBLE_RETURN_DOUBLE, "__aeabi_drsub",  SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dmul,   MODE_DOUBLE_DOUBLE_RETURN_DOUBLE, "__aeabi_dmul",   SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_ddiv,   MODE_DOUBLE_DOUBLE_RETURN_DOUBLE, "__aeabi_ddiv",   SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dcmplt, MODE_DOUBLE_DOUBLE_RETURN_INT,    "__aeabi_dcmplt", SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dcmple, MODE_DOUBLE_DOUBLE_RETURN_INT,    "__aeabi_dcmple", SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dcmpgt, MODE_DOUBLE_DOUBLE_RETURN_INT,    "__aeabi_dcmpgt", SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dcmpge, MODE_DOUBLE_DOUBLE_RETURN_INT,    "__aeabi_dcmpge", SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);
  _Benchmark((void *)__aeabi_dcmpeq, MODE_DOUBLE_DOUBLE_RETURN_INT,    "__aeabi_dcmpeq", SPECIAL(SEQUENCE_SPECIAL_F64xF64) SEQUENCE_TYPICAL_F64xF64, SEQUENCE_END);

  _Benchmark((void *)__aeabi_f2iz,   MODE_FLOAT_RETURN_INT,            "__aeabi_f2iz",   SEQUENCE_31_FLOAT | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_f2uiz,  MODE_FLOAT_RETURN_INT,            "__aeabi_f2uiz",  SEQUENCE_31_FLOAT, SEQUENCE_END);
  _Benchmark((void *)__aeabi_f2lz,   MODE_FLOAT_RETURN_LLONG,          "__aeabi_f2lz",   SEQUENCE_63_FLOAT | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_f2ulz,  MODE_FLOAT_RETURN_LLONG,          "__aeabi_f2ulz",  SEQUENCE_63_FLOAT, SEQUENCE_END);
  _Benchmark((void *)__aeabi_i2f,    MODE_INT_RETURN_FLOAT,            "__aeabi_i2f",    SEQUENCE_31_INT | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_ui2f,   MODE_INT_RETURN_FLOAT,            "__aeabi_ui2f",   SEQUENCE_31_INT, SEQUENCE_END);
  _Benchmark((void *)__aeabi_l2f,    MODE_LLONG_RETURN_FLOAT,          "__aeabi_l2f",    SEQUENCE_63_LLONG | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_ul2f,   MODE_LLONG_RETURN_FLOAT,          "__aeabi_ul2f",   SEQUENCE_63_LLONG, SEQUENCE_END);

  _Benchmark((void *)__aeabi_d2iz,   MODE_DOUBLE_RETURN_INT,           "__aeabi_d2iz",   SEQUENCE_31_DOUBLE | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_d2uiz,  MODE_DOUBLE_RETURN_INT,           "__aeabi_d2uiz",  SEQUENCE_31_DOUBLE, SEQUENCE_END);
  _Benchmark((void *)__aeabi_d2lz,   MODE_DOUBLE_RETURN_LLONG,         "__aeabi_d2lz",   SEQUENCE_63_DOUBLE | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_d2ulz,  MODE_DOUBLE_RETURN_LLONG,         "__aeabi_d2ulz",  SEQUENCE_63_DOUBLE, SEQUENCE_END);
  _Benchmark((void *)__aeabi_i2d,    MODE_INT_RETURN_DOUBLE,           "__aeabi_i2d",    SEQUENCE_31_INT | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_ui2d,   MODE_INT_RETURN_DOUBLE,           "__aeabi_ui2d",   SEQUENCE_31_INT, SEQUENCE_END);
  _Benchmark((void *)__aeabi_l2d,    MODE_LLONG_RETURN_DOUBLE,         "__aeabi_l2d",    SEQUENCE_63_LLONG | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_ul2d,   MODE_LLONG_RETURN_DOUBLE,         "__aeabi_ul2d",   SEQUENCE_63_LLONG, SEQUENCE_END);

  _Benchmark((void *)__aeabi_f2d,    MODE_FLOAT_RETURN_DOUBLE,         "__aeabi_f2d",    SEQUENCE_63_FLOAT | SEQUENCE_SIGNED, SEQUENCE_END);
  _Benchmark((void *)__aeabi_d2f,    MODE_DOUBLE_RETURN_FLOAT,         "__aeabi_d2f",    SEQUENCE_63_DOUBLE | SEQUENCE_SIGNED, SEQUENCE_END);
  //
  _Logf("\n");
  _Logf("STOP.\n");
  //
  return 0;
}