Why you should benchmark your embedded system

There are plenty of potential reasons why an embedded system may not deliver the full CPU performance. This is not even that easy to detect, so here is a way to check if your system gives you the performance you expect.

What can go wrong?

Today’s embedded systems are complex computers. Microcontrollers are usually easiest to bring up. But even here, some things have to be done correctly:

  • PLL settings: Run CPU at the desired frequency
  • Wait states: Does your memory (RAM / Flash) need wait states? If so, how many?
  • Memory areas: Some memory areas are faster than others. For example, on Cortex-M3/4 devices, code can be executed faster from a RAM located in the I/D area ranging from 0000_0000 – 1FFF_FFFF, than from the S area, ranging from 2000_0000 – 3FFF_FFFF.
  • Some CPU flashes need accelerators (a simple form of caching), other CPUs come with full blown caches.
  • On Microprocessors (with external memory), data located in external memory can easily slow down the CPU by factor 10.
  • In bigger systems, a lot of times caches and MMUs are present. The caching needs to be set up correctly to deliver maximum performance.

How do I check the performance?

This is not as easy as it sounds, especially as there are multiple factors involved. The easiest way we found is to run a benchmark program, using the exact same settings used by the application.
We have written a small benchmark that finds prime numbers (sieve of eratosthenes). It uses some RAM, but not too much, so it still fits into the RAM of almost every system. It can simply be called from main(), by adding a simple function call to

void SEGGER_MeasureCPUPerformance(void); // Call as last thing before OS_Start().

Make sure you add this prototype at the top of your application file.
It runs the Prime finder algorithm multiple times in a loop, for the duration of 1 second, using the function OS_GetTime() to get a time in milliseconds. This means that if embOS is available, it runs without any further change. If it runs on another system, then the function OS_GetTime() needs to be supplied to provide a time base.  It outputs the performance as a frequency in comparison to a Cortex-M4 CPU running in a Zero-Wait state system.

Use case

Last week, we had a customer using our emSSL software library contacting us suspecting a poor performance of that software. emSSL is a SEGGER software library that enables secure connections across the Internet. We tried to reproduce the issue and ran benchmark tests with the SSL software, but found no issue with the performance at all. Since the SSL software did not cause the problem we wanted to help the customer to investigate the issue.

We sent the benchmark application to the customer and asked him to run it.

Output:
Loops/sec: 1147
Your target is running a speed equivalent to a Cortex-M3 or M4 device at 16 MHz.
This value is calculated in comparison to a reference Cortex-M4 running in iRAM with zero wait states and full compiler optimizations built on GCC V5.4.3 used in Segger Embedded Studio.

The benchmark revealed his CPU performance was around eight times slower than we expect from his Cortex-M CPU running at 120 MHz. With this result it was not difficult to identify the cause for this poor performance. The customer used external RAM for his application data. Most customers are unaware an external RAM access is much slower than an access to the internal RAM. We asked the customer to repeat the performance test in internal RAM and the result showed the CPU ran at full speed:

Output:
Loops/sec: 8599
Your target is running a speed equivalent to a Cortex-M3 or M4 device at 120 MHz.
This value is calculated in comparison to a reference Cortex-M4 running in iRAM with zero wait states and full compiler optimizations built on GCC V5.4.3 used in Segger Embedded Studio.

The customer realized the poor emSSL performance was caused by running the code in the very slow, uncached external RAM. Moving the time critical portions into internal RAM solved the problem.

Conclusion

I would like to suggest everyone to run this performance test in his embedded system. This simple test can be done within a few minutes and guarantees you will get the full CPU performance.

Benchmark application

This is the complete sample application:


/*********************************************************************
*               (c) SEGGER Microcontroller GmbH & Co. KG             *
*                        The Embedded Experts                        *
*                           www.segger.com                           *
**********************************************************************
*                                                                    *
* All rights reserved.                                               *
*                                                                    *
* * This software may in its unmodified form be freely redistributed *
*   in source form.                                                  *
* * The source code may be modified, provided the source code        *
*   retains the above copyright notice, this list of conditions and  *
*   the following disclaimer.                                        *
*                                                                    *
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND             *
* CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,        *
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF           *
* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE           *
* DISCLAIMED. IN NO EVENT SHALL SEGGER Microcontroller BE LIABLE FOR *
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR           *
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT  *
* OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;    *
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF      *
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT          *
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE  *
* USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH   *
* DAMAGE.                                                            *
*                                                                    *
**********************************************************************
-------------------------- END-OF-HEADER -----------------------------

File    : SEGGER_MeasureCPU_Performance.c
Purpose : CPU performance measurement and comparison to a known
          reference hardware using embOS time base.

Additional information:
  The test routine MeasureCompareCpuPerformance() has to be called
  directly before OS_Start() and uses printf() to output the result.
*/

#include "RTOS.h"
#include "stdio.h"
#include "stdlib.h"

/*********************************************************************
*
*       Defines, fixed
*
**********************************************************************
*/

//
// The following is our reference value measured on a
// Freescale Kinetis K66 Cortex-M4 running from iRAM at 168MHz.
//
#define REFERENCE_LOOPS_SEC_MHZ  7166  // Reference result from defined hardware [Loops/sec/MHz].

/*********************************************************************
*
*       Static data
*
**********************************************************************
*/

static char         aIsPrime[1000];
static unsigned int NumPrimes;

/*********************************************************************
*
*       Local functions
*
**********************************************************************
*/

/*********************************************************************
*
*       _CalcPrimes()
*
*  Function description
*    This is the actual benchmark test function
*/
static void _CalcPrimes(unsigned int NumItems) {
  unsigned int i;
  unsigned int j;

  //
  // Mark all as potential prime numbers
  //
  memset(aIsPrime, 1, NumItems);
  //
  // 2 deserves a special treatment
  //
  for (i = 4; i < NumItems; i += 2) {
    aIsPrime[i] = 0;     // Cross it out: not a prime
  }
  //
  // Cross out multiples of every prime starting at 3. Crossing out starts at i^2.
  //
  for (i = 3; i * i < NumItems; i++) {
    if (aIsPrime[i]) {
      j = i * i;    // The square of this prime is the first we need to cross out
      do {
        aIsPrime[j] = 0;     // Cross it out: not a prime
        j += 2 * i;          // Skip even multiples (only 3*, 5*, 7* etc)
      } while (j < NumItems);
    }
  }
  //
  // Count prime numbers
  //
  NumPrimes = 0;
  for (i = 2; i < NumItems; i++) {
    if (aIsPrime[i]) {
      NumPrimes++;
    }
  }
}

/*********************************************************************
*
*       Global functions
*
**********************************************************************
*/

/********************************************************************
*
*       SEGGER_MeasureCPUPerformance()
*
*  Function description
*    This function measures your CPU performance for 1 second and
*    outputs the result compared to the reference value REFERENCE_RESULT
*    of a defined hardware running completely from internal RAM.
*    This function has to be called directly before OS_Start() to ensure
*    an exact measurement.
*/
void SEGGER_MeasureCPUPerformance(void) {
  unsigned int Cnt;
  int          TestTime;

  while(1) {
    Cnt = 0;
    OS_Delay(1);  // Sync to tick
    TestTime = OS_GetTime() + 1000;
    while ((TestTime - OS_GetTime()) >= 0) {
      _CalcPrimes(sizeof(aIsPrime));
      Cnt++;
    }
      if (NumPrimes != 168) {
      printf("Error");
    } else {
      printf("\nLoops/sec: %u\n", Cnt);
      printf("Your target is running a speed equivalent to a Cortex-M3 or M4 device at %u MHz.\n", (Cnt * 100) / REFERENCE_LOOPS_SEC_MHZ);
      printf("This value is calculated in comparison to a reference Cortex-M4 running in iRAM with zero wait states and full compiler optimizations built on GCC V5.4.3 used in Segger Embedded Studio.\n\n");
    }
  }
}