Performance tuning our software

As you may have noticed, SEGGER have introduced a cryptographic algorithm library, emCrypt. We released this product as existing and new customers wanted to use the “hidden” cryptographic capabilities of emSSL but didn’t need to run SSL/TLS as a protocol. Well, that is not entirely true, some customers already had licenses for emSSL but also wanted a fully-documented, tested and verified cryptographic library for other parts of their secure systems with additional algorithms that are not offered as part of emSSL.

But potential customers always ask questions, requesting comparisons, benchmarks, and resource use. This isn’t unreasonable for a cryptographic library as we know that public key cryptography involves expensive one-time operations for secret key agreement and that secret key cryptography needs to be fast for transporting bulk data.

About emCrypt

emCrypt is designed primarily for constrained embedded devices but can also scale to workstation-class machines as the algorithms are written in pure C. Most other libraries provided with popular SSL products are not designed this way: you select one of a number of configurations, compile the library, and then measure performance in your application — and you have no idea of the amount of memory that you will be consuming as typically the library uses “malloc” for its temporary data and fragmentation is an issue. All emCrypt algorithms can run in fixed memory and we provide benchmarks for these, both in terms of performance and memory utilized.

Modular exponentiation revisited

In this post I concentrate on the underlying modular exponentiation primitive that underpins RSA and Diffie-Hellman public key cryptosystems.

As I said, customers want hard numbers. And what they can do is simply get themselves a development license for emCrypt and run one of the canned benchmarks that is provided in the distribution, compiling on their platform with their toolset. In fact, we endevor to provide benchmark frameworks for every algorithm that we support in the emCrypt distribution. Internally, we run these on different hardware and use the results to tune the cryptographic library for processors and compilers, checking for performance regressions between releases of toolchains and releases of the underlying algorithms. We concentrate on the compilers that we use the most, and primarily on SEGGER Embedded Studio which is the defacto standard development environment for ARM devices in SEGGER.

So…let’s begin!

We start out with hard numbers for emCrypt on a NXP K66 device using SEGGER Embedded Studio 3.12 and the Clang compiler:

+----------------------+--------------------------------+
|              Modulus |                      2048 bits |
| Algorithm            |    Time      x   Memory      x |
+----------------------+--------------------------------+
| Basic, fast          |  943.00  1.00x     1340  1.00x |
| Basic, ladder        | 1475.00  0.64x     1608  1.20x |
| Montgomery, fast     |  515.00  1.83x     1340  1.00x |
+----------------------+--------------------------------+

This is an highly abbreviated version of the benchmark output so that we can see the wood and we concentrate on a small set of numbers. This shows that the Montgomery power ladder version (Basic, ladder) that prevents timing attacks is slower than the “fast” version where some multiplies are dropped for zero bits when scanning the exponent. It also shows that using Montgomery form for modular exponentiation is a big win over the basic algorithm. Additional algorithms are provided in emCrypt and for those interested the full output is shown at the end of this post for a 2048-bit modulus.

There’s more than one way to benchmark

After benchmarking with Embedded Studio, I then benchmarked the same application using another highly-respected compiler which produced the following:

+----------------------+--------------------------------+
|              Modulus |                      2048 bits |
| Algorithm            |    Time      x   Memory      x |
+----------------------+--------------------------------+
| Basic, fast          |  968.00  1.00x     1340  1.00x |
| Basic, ladder        | 1494.00  0.65x     1608  1.20x |
| Montgomery, fast     |  621.00  1.56x     1340  1.00x |
+----------------------+--------------------------------+

So, the numbers between the two compilers isn’t very much different for performance. The numbers for memory consumption are identical between the two, which is as expected as emCrypt does not use the C system memory allocator by default, it uses an abstract API that allows tuning. In fact, emCrypt can support “crypto memory” which you find on some highly secure microcontrollers using this “pluggable allocator” capability.

Iterated refinement

However, I thought we could do better still. Analyzing the algorithm, the compiler’s optimizer, and the code generator reveals opportunities to better deal with the way the cryptographic primitives work, utilizing the capabilities of the microprocessor much more efficiently. After some tinkering, profiling, thinking, and late night development, I enhanced the Embedded Studio toolset to deliver far better results:

+----------------------+--------------------------------+
|              Modulus |                      2048 bits |
| Algorithm            |    Time      x   Memory      x |
+----------------------+--------------------------------+
| Basic, fast          |  684.50  1.00x     1340  1.00x |
| Basic, ladder        | 1081.00  0.63x     1608  1.20x |
| Montgomery, fast     |  485.67  1.41x     1340  1.00x |
+----------------------+--------------------------------+

The relative performance between the two “basic” algorithms remains, but every single algorithm benefits, in absolute terms, from the enhancements made. The enhancement required no code changes to the application, as it is delivered by the toolset, and the new capabilities will appear in SEGGER Embedded Studio 3.20.

Conclusions

Gathering all the above analysis and data together, we can summarize as follows.

  • The code generated by Embedded Studio 3.12 outperforms the compiler I benchmarked against. From the fast Montgomery numbers, we are 20.6% faster than the benchmark compiler already.
  • After some investigation, we managed to improve performance significantly. Using the Montgomery numbers again, we achieve a performance increase of 6.2% over Embedded Studio 3.12 and a 27.9% increase in performance over the benchmark compiler.
  • Choosing the “basic, fast” algorithm, we deliver an astounding 38.2% performance advantage over the benchmark compiler.
  • Although I have not provided code sizes as this is not the theme of this post, all application code sizes are comparable and the performance of Embedded Studio 3.20 is attained without any explosion in code size. In fact, it’s slightly smaller than the 3.12 release!
  • All benchmarks were executed from flash memory with wait states, so execution will be faster if the time-critical functions are placed into fast RAM. However, performance is excellent even from flash memory.

I now believe SEGGER has the best toolset capable of delivering peak cryptographic performance!

Footnote

Why provide “fast” and “constant time” versions, isn’t “fast” always preferred?

Well, the answer is no, fast is not always preferred. It depends upon the context in which the modular exponentiation is used. When verifying a signature using a public key, all data is in the open and public, there is nothing to hide: the modulus and the encryption exponent are known, so it doesn’t matter whether you leak data through a side channel, that data is never “private” or “secret”. However, when signing or using the decryption exponent then you do not wish to leak any secret information — in this case the decryption exponent. Leaking information can happen by capturing timing information, by using power analysis, or by injecting faults into the system. emCrypt provides a number of constant-time implementations with different performance characteristics.

Here’s the unabridged output of the benchmark, for a 2048-bit modulus, using SEGGER Embedded Studio 3.20.

All numbers are correct at time of publication.

Modular Arithmetic Performance
==============================

CRT private key, exponent length = modulus length, all times in ms

+----------------------+--------------------------------+
|              Modulus |                      2048 bits |
| Algorithm            |    Time      x   Memory      x |
+----------------------+--------------------------------+
| Basic, fast          |  684.50  1.00x     1340  1.00x |
| Basic, ladder        | 1081.00  0.63x     1608  1.20x |
| Basic, 2b, FW        |  661.50  1.03x     2412  1.80x |
| Basic, 3b, FW        |  649.50  1.05x     3484  2.60x |
| Basic, 4b, FW        |  632.00  1.08x     5628  4.20x |
| Basic, 5b, FW        |  621.00  1.10x     9916  7.40x |
| Basic, 6b, FW        |  615.50  1.11x    18492 13.80x |
+----------------------+--------------------------------+
| Basic, 2b, RM        |  651.00  1.05x     2412  1.80x |
| Basic, 3b, RM        |  634.00  1.08x     2948  2.20x |
| Basic, 4b, RM        |  620.00  1.10x     4020  3.00x |
| Basic, 5b, RM        |  609.50  1.12x     6164  4.60x |
| Basic, 6b, RM        |  602.50  1.14x    10452  7.80x |
+----------------------+--------------------------------+
| Barrett, fast        |  668.50  1.02x     1876  1.40x |
| Barrett, ladder      | 1024.00  0.67x     2144  1.60x |
| Barrett, 2b, FW      |  647.50  1.06x     2948  2.20x |
| Barrett, 3b, FW      |  622.50  1.10x     4020  3.00x |
| Barrett, 4b, FW      |  601.00  1.14x     6164  4.60x |
| Barrett, 5b, FW      |  589.50  1.16x    10452  7.80x |
| Barrett, 6b, FW      |  585.00  1.17x    19028 14.20x |
+----------------------+--------------------------------+
| Barrett, 2b, RM      |  630.00  1.09x     2948  2.20x |
| Barrett, 3b, RM      |  607.50  1.13x     3484  2.60x |
| Barrett, 4b, RM      |  590.00  1.16x     4556  3.40x |
| Barrett, 5b, RM      |  579.00  1.18x     6700  5.00x |
| Barrett, 6b, RM      |  571.50  1.20x    10988  8.20x |
+----------------------+--------------------------------+
| Montgomery, fast     |  485.67  1.41x     1340  1.00x |
| Montgomery, 2b, FW   |  484.00  1.41x     2412  1.80x |
| Montgomery, 3b, FW   |  432.00  1.58x     3484  2.60x |
| Montgomery, 4b, FW   |  408.00  1.68x     5628  4.20x |
| Montgomery, 5b, FW   |  397.33  1.72x     9916  7.40x |
| Montgomery, 6b, FW   |  397.33  1.72x    18492 13.80x |
+----------------------+--------------------------------+
| Montgomery, 2b, RM   |  445.33  1.54x     2412  1.80x |
| Montgomery, 3b, RM   |  418.00  1.64x     2948  2.20x |
| Montgomery, 4b, RM   |  400.33  1.71x     4020  3.00x |
| Montgomery, 5b, RM   |  391.33  1.75x     6164  4.60x |
| Montgomery, 6b, RM   |  386.33  1.77x    10452  7.80x |
+----------------------+--------------------------------+