Profiling software in Embedded Systems

Good engineers for embedded software strive to write software that is efficient, small, and maintainable. However, inefficiencies can appear from the most unexpected places, places that you would never think to look for a performance bottleneck or “sinkhole”. This short article describes different ways to profile an application, showing the path to identifying such a sinkhole in software, hidden from view, surprising, and revealed by SEGGER tools.
As a special treat, it contains a 12 second video of an actual, live profiling session!

What’s the goal?

The SEGGER Cryptographic Toolkit is a software library that is the foundation upon which emSecure, emSSL, and emSSH are built. It provides all the standard tools that you would expect from such a library, targeted to embedded devices short on memory and processing power. Although fast, the particular cryptographic algorithm we felt could do better implements a modular exponentiation, the basis of RSA and Diffie-Hellman public key cryptography. Improving the performance of this primitive results in better response from secure web servers, plain and simple.

Statistical profiling

A good place to start is to profile the application which pinpoints functions that dominate execution time: candidates for optimization! This is an excellent approach because where we think the CPU is spending all its time is not necessarily in-line with reality.

SEGGER Embedded Studio has a built-in statistical profiler that uses J-Link’s High Speed Sampling capability to rapidly sample the program counter as the application runs. With this sample data it’s possible to build an execution profile of the application. Setup is simple, you set the trace type to PC Sampling in your project:

Embedded Studio Trace Properties

And immediately after starting the application (a Barrett modular exponentiation benchmark) you see the result in the Execution Profile window:

Embedded Studio Proffile Window

It clearly shows where the CPU spends most of the time, in this case about 60%: In the routine CRYPTO_MPI_Mul_Comba_Partial. This means that if we optimize this routine to be say being twice as fast, we would use only 70% of the time we used previously, meaning that the resulting code runs at a factor 100/7 = 1.43, so 43% faster.

J-Link’s high-speed sampling samples typically multiple thousand times per second. This is still orders of magnitude slower than the target’s execution frequency and provides only coarse approximation to where time is spent. With this in mind, Embedded Studio does not resolve a statistical execution profile down to the line or instruction level as the sampled data simply does not have sufficient resolution to point you to individual instructions. It does show the function(s) that we need to look at, but does not show us which part of these functions have used how much time or even which particular instructions. It gives you a pretty good clue on what you can do now: Either try to optimize your program by restructuring it in a more efficient way, or by looking at the disassembly of your C-code, so at the instructions generated.
But there is a better, more efficient way of doing this: Getting precise profiling data with instruction granularity.

Tooling up for business

What would be useful is an x-ray of the application’s execution down to the instruction level, revealing the innermost secrets of code generation, the efficiency of the algorithm, and the optimization capability of the compiler.

For this we need two things: a target that offers to provide this trace data, and a set of tools that can reliably collect and analyze all trace data ejected from the target at blistering speed.

The target end in this case is easy: every SEGGER engineer has an emPower board with a 20-pin Cortex-M trace connector on his desk for software development and test (but any target with such a trace connector does equally well). The trace connector provides regular JTAG/SWD debug control and, critically, a 4-bit trace data port.

SEGGER have been providing quality debug and trace tools for years, but the latest enhancement to the J-Trace PRO and Ozone offers something different: live profiling and code coverage.

For investigating the modular exponentiation implementation, it’s exactly what we need: capture all instructions executed in our program flow and update execution profiles and instruction counts as the application runs in real time, eliminating any requirement to store the entire trace and load it for offline analysis later.

Don’t bore us, get to the chorus!

With Ozone and the J-Trace PRO set up, show the Code Profile window and run the application. Play the video to reveal what happens when observing the profile of the basic (not Barrett) modular exponentiation benchmark:


As you step through the video, taken as a live capture with the target running, you can see functions jostling for position. The user-written functions that take the most time are _MPI_MagDivModNormalized and CRYPTO_MPI_Mul_Comba_Partial. But topping the pile is the mysterious function __int64_udivmod taking around 60% of the time!

This function implements 64-bit by 64-bit unsigned division and is provided by the runtime library in Embedded Studio. This implementation hails from software written for the MSP430 in CrossWorks, Embedded Studio’s sister product. The MSP430 implementation is for a memory-limited microprocessor with 128 bytes of RAM and 2 KB of flash, so uses the simple shift-subtract technique for division. But surely a modern Cortex-M4 would make light work of this? Seems not.

The Cortex-M4 has no support for 64-bit division at all, so it must be done in software. However, the Cortex device does have a 32-bit by 32-bit unsigned divide instruction and a fast multiplier, which we can press into service. After tinkering for a little bit, a final version reduces the processing overhead:



How does this affect performance? Well, here is the output from the modular exponentiation benchmark with the original division in place, in milliseconds, for 512-bit, 1024-bit, and 2048-bit keys:

| Algorithm                 |      512 |     1024 |     2048 |
| Binary, Basic             |   68.80  |  292.25  | 1563.00  |

And with the new division code in place:

| Algorithm                 |      512 |     1024 |     2048 |
| Binary, Basic             |   45.22  |  192.67  | 1168.00  |

We have achieved performance speedups of x1.34 to x1.52 for this benchmark which directly translates into better performance for the secure web server and any other software that relies on big integer arithmetic.


This example demonstrates that however efficient your code, inefficiencies can creep in from places you least expect them. Using instruction-count-accurate profiling, we revealed an optimization opportunity using SEGGER’s debugger Ozone and SEGGER’s J-Trace PRO probe. Without this insight, no amount of tinkering with user-written code would have produced such a pronounced speed increase.


This optimization just wasn’t enough for us—we went further and implemented new algorithms for modular multiply, using J-Trace PRO and Ozone to profile and tune them. Using these tools, the new software is now part of the SEGGER Cryptographic Toolkit and delivers spectacular results:

| Algorithm                 |      512 |     1024 |     2048 |
| Configured                |   20.45  |   88.92  |  458.67  |

This is a cross-the-board speedup of x3.35, guided by SEGGER tools and targeted software optimization.