CPU Design at SEGGER

CPU design is not normally what we do. But: We actually have 2 CPU designed and in use, an 8-bit and a 32-bit CPU.

In this article we look at our 32-bit CPU, or rather how we are creating an enhanced version of it with very high Code Density


Designing a CPU is an interesting challenge. How to design it depends very much on what should be achieved. Similar to car design, there is no “One design fits all.” For some systems, 8-bits are sufficient, other applications need 64-bits. Some things they share: Smaller is better, lower power consumption is better, higher code density is better, more compact instruction sets are better than inflated ones.

Speedy — an ultra-fast and simple 8-bit CPU

In our Flasher Programmers, we have an 8-bit interface processor (“Speedy”), which is designed to run inside an FPGA (soft-core),
using very few resources (both power and gates) and running at a high speed with predictable timing.

We implemented it in Verilog, and it runs without timing violations at 200MHz and more, with single-cycle instruction execution, which is really good for a soft core.

The source code is only about 500 lines of Verilog and is very easy to understand. We also wrote a simulator for it. Actually, the simulator is based on the output of Verilator, a Verilog-to-C++ translator. Once we know the test program in the simulator passes, we know the design is OK.

Programs for Speedy are written in assembly language. At 200MHz, Speedy allows us to read / set / toggle signal pins in a guaranteed 5ns frame. If we need a UART, I2C, JTAG, cJTAG, SWD, … Speedy can do it. Very convenient so we do not have to write new HDL code for new interfaces.

I’ll write more about Speedy in another post.

S32 — an optimized 32-bit CPU

A bit of background first…

Both J-Link and Flasher need a way to download and execute a program at run time, to execute some special time-critical function such as a sequence that must be output right after target reset release, or an entire flash loader.

For this purpose, we have a “virtual CPU” in these devices. The program is interpreted one instruction at a time. With a highly efficient instruction set and a fast executor, this is more efficient than one would think. We achieve execution rates of about 25M virtual processor instructions per second on a 600MHz host processor, more than sufficient for what we need.

We have our own compiler for this CPU, which Paul wrote and which generates surprisingly good code, typically as good as what can be produced by hand-crafted assembly. In any case, we looked at it and saw some options for improvement, to achieve a higher code density, which in our case also leads to higher performance.

S32E — a 32-bit CPU with super high code density

We created new instructions and Paul modified the compiler to use them. We repeated the process until we did not see any (significant) room for improvement.

The result is stunning. It seems we are achieving a code density better than RISC-V and ARM!

While we originally optimized things looking at typical flash loaders, we now switched to test code and benchmarks that we can easily compile on any platform.

Here’s the code used for this comparison:

// expect: success
// title:  sieve benchmark, a larger test of the code generator

#define SIZE 8190

char flags[SIZE+1];

int main(void) {
  int i, prime, k, count, iter; 
  //
  for (iter = 1; iter <= 10; iter ++) {
    count = 0;
    for (i = 0; i <= SIZE; i++)
      flags[i] = 1;
    for (i = 0; i <= SIZE; i++) {
      if (flags[i]) {
        prime = i + i + 3;
        k = i + prime;
        while (k <= SIZE) {
          flags[k] = 0;
          k += prime;
        }
        count = count + 1;
      }
    }
  }
  //
  return count == 1899 ? 0 : count;
}

The code stems from a test for the code generator and is somewhat similar to the Sieve of Eratosthenes, used to find prime numbers.

We just use it to check code density here.

Code density comparison

Results are interesting:

RISC-V: 96 bytes (RV32IMAC instruction set)
ARM: 100 bytes (Thumb-2, ARMv7M instruction set)
S32E: 70 bytes

We achieve a code density which is 43% (here (100-30)/70) better than that of ARM!

We will look at S32E’s code density on a lot more software. If we can keep code density 20% better than ARM’s v7M, that would already be great for a CPU design which is actually quite simple and can be implemented in software, in an FPGA (soft core) or as a hard core.

RISC-V code density is typically below ARM’s, so that is automatically beaten.

Using single byte instructions?

The current version of S32 has instruction sizes which are multiples of 2 bytes, so 2,4 and 6 byte instructions. If we allow other sizes and introduce single-byte and three-byte instructions, we can further reduce the footprint.

Byte-granularity for instructions has one downside: Relative jumps can now only cover half the distance, which has a negative impact on code density (as in some cases wider jumps or even two jumps need to be used), so there is a price to pay.

In this case here, introducing 5 single-byte instructions can save 12 bytes (Bringing us down to 58 from 70!) However, we want to first compare S32E on more code, and plan to introduce byte-size instructions as a final step.

Further plans

We are planning to do more code-size benchmarks, and will continue to publish them once available. Potentially we will publish the instruction set of the S32E and will make it available, or generate an HDL version of the core.
Code density is important, and code density is not a strong side of RISC-V (even though it is looking quite good in this benchmark).

We shall see…  Stay tuned for more!

Generated output on S32E

             main:
    000000 xxxxxxxx          MOVW  R9, #8190
    000004 xxxx              MOV   R10, #flags               ; '&flags'

       11:  for (iter = 1; iter <= 10; iter ++) {

    000006 xxxx              MOV   R4, #1

             .L5:

       12:  count = 0;

    000008  0006             MOV   R5, #0

       13:  for (i = 0; i <= 8190; i++)   // Init statement

    00000A  0006             MOV   R8, #0

       14:  flags[i] = 1;
    00000C                   MOV   R0, #1       // move outside of the loop for speed improvement
            .L7:   
    00000E                   STRB  R0, [R10, R8]
    000010  F80C             ADD   R8, #1
    000012  09B0             CGE   R9, R8
    000014  xxxx             BNE   .L7

       15:  for (i = 0; i <= 8190; i++) {

             .L6:
    000016  0006             MOV   R8, #0
 
             .L9:

       16:  if (flags[i]) {

    000018  0088             LDRB  R0, [R10,R8]                 ; dereference pointer, compare setting Z=1 if equal
    00001A  1832             BEQ   R0, .L13                 ; start true arm of 'if' statement

       17:  prime = i + i + 3;

    00001C  F80C             MOV   R7, R8
    00001E  1001             LSL   R7, #1
    000020  3011             ADD   R7, #3

       18:  k = i + prime;

    000022  F80C             MOV   R6, R8
    000024  1001             ADD   R6, R7

       20:  flags[k] = 0;
    000026  0006             MOV   R0, #0
                   .L11:
    000028  1008             STRB  R0, [R10, R6]

       21:  k += prime;
    00002A  E819             ADD   R6, R7

       19:  while (k <= 8190) {
    00002C  08B0             CGE   R9, R6
    00002E  08B0             BNE   .L11

       23:  count = count + 1;

               .L10:
    000030  xxxx             ADD   R5, #1

       15:  for (i = 0; i <= 8190; i++) {

            .L13:                                ; end of 'then' arm of 'if' statement
    000032  F80C             ADD   R8, #1
    000034  xxxx             CGE   R9, R8
    000036  xxxx             BNE   .L9

       11:  for (iter = 1; iter <= 10; iter ++) {

           .L8:
    000038  xxxx             ADD   R4, #1
    00003A  0146             MOV   R0, #10
    00003C  3CB0             BNE   R0, R4, .L5

       28:  return count == 1899 ? 0 : count;

    00003E  100F 076B        MOVW  R1, #0x076B
    000040  1021             SUB   R1, R5                   ; compare setting Z=1 if equal
    000042  xxxx             MOVZ  R0, #0, R5 
    000044  000B             RET                            ; return from function
    000046

ARM v7M, compiled to optimize size

    ;==============================================================================================
    ; .text.main
    ;==============================================================================================
    ;  Module:     test-stdlib.o
    ;  Attributes: 0x00000006: read-only, executable (SHF_EXECINSTR), allocatable (SHF_ALLOC), %progbits
    ;  Size:       100 (0x64) bytes
    ;  Align:      4 bytes
    ;
    ;  Uses:
    ;    0x20000040  flags
    ;
    ;  Used by:
    ;    0x20002ECC  _start ()

    main:
      0x20002E40:  E92D 41F0    PUSH.W     {R4-R8, LR}
      0x20002E44:  F04F 0E01    MOV.W      LR, #1
      0x20002E48:  4915         LDR        R1, =flags               ; [PC, #84] [0x20002EA0] =0x20000040
      0x20002E4A:  F641 78FF    MOVW       R8, #0x1FFF
      0x20002E4E:  F641 7CFE    MOVW       R12, #0x1FFE
      0x20002E52:  2400         MOVS       R4, #0
      0x20002E54:  230B         MOVS       R3, #11

    .L1:
      0x20002E56:  3B01         SUBS       R3, #1
      0x20002E58:  D01A         BEQ        .L8                      ; 0x20002E90
      0x20002E5A:  F44F 5000    MOV.W      R0, #0x2000
      0x20002E5E:  460A         MOV        R2, R1

    .L2:
      0x20002E60:  3801         SUBS       R0, #1
      0x20002E62:  D002         BEQ        .L3                      ; 0x20002E6A
      0x20002E64:  F802 EB01    STRB       LR, [R2], #1
      0x20002E68:  E7FA         B          .L2                      ; 0x20002E60

    .L3:
      0x20002E6A:  2700         MOVS       R7, #0
      0x20002E6C:  2503         MOVS       R5, #3
      0x20002E6E:  2203         MOVS       R2, #3
      0x20002E70:  2600         MOVS       R6, #0

    .L4:
      0x20002E72:  4547         CMP        R7, R8
      0x20002E74:  D0EF         BEQ        .L1                      ; 0x20002E56
      0x20002E76:  5DC8         LDRB       R0, [R1, R7]
      0x20002E78:  B130         CBZ        R0, .L7                  ; 0x20002E88
      0x20002E7A:  4610         MOV        R0, R2

    .L5:
      0x20002E7C:  4560         CMP        R0, R12
      0x20002E7E:  D802         BHI        .L6                      ; 0x20002E86
      0x20002E80:  540C         STRB       R4, [R1, R0]
      0x20002E82:  4428         ADD        R0, R5
      0x20002E84:  E7FA         B          .L5                      ; 0x20002E7C

    .L6:
      0x20002E86:  3601         ADDS       R6, #1

    .L7:
      0x20002E88:  3502         ADDS       R5, #2
      0x20002E8A:  3203         ADDS       R2, #3
      0x20002E8C:  3701         ADDS       R7, #1
      0x20002E8E:  E7F0         B          .L4                      ; 0x20002E72

    .L8:
      0x20002E90:  F240 706B    MOVW       R0, #0x076B
      0x20002E94:  1A30         SUBS       R0, R6, R0
      0x20002E96:  BF18         IT         NE
      0x20002E98:  4630         MOVNE      R0, R6
      0x20002E9A:  E8BD 81F0    POP.W      {R4-R8, PC}
      0x20002E9E:  BF00         NOP
      0x20002EA0:  20000040     DC.W       flags

RISC-V IMAC

    ;==============================================================================================
    ; .text.startup.main
    ;==============================================================================================
    ;  Module:     test-stdlib.o
    ;  Attributes: 0x00000006: read-only, executable (SHF_EXECINSTR), allocatable (SHF_ALLOC), %progbits
    ;  Size:       96 (0x60) bytes
    ;  Align:      2 bytes
    ;
    ;  Uses:
    ;    0x00800000  flags
    ;
    ;  Used by:
    ;    0x00400000  _start ()

    main:
      0x00400044:  6709         LI         a4, 0x2000
      0x00400046:  45A9         LI         a1, 10                   ; 0x00800000 = flags
      0x00400048:  80018793     ADDI       a5, gp, -0x0800          ; 0x00800000 = flags
      0x0040004C:  4305         LI         t1, 1
      0x0040004E:  FFF70E13     ADDI       t3, a4, -1
      0x00400052:  FFE70693     ADDI       a3, a4, -2
      0x00400056:  6E99         LI         t4, 0x6000

    .L1:
      0x00400058:  4701         LI         a4, 0

    .L2:
      0x0040005A:  00F70633     ADD        a2, a4, a5
      0x0040005E:  00660023     SB         t1, 0(a2)
      0x00400062:  0705         ADDI       a4, a4, 1
      0x00400064:  FFC71BE3     BNE        a4, t3, .L2              ; 0x0040005A
      0x00400068:  88BE         MV         a7, a5
      0x0040006A:  480D         LI         a6, 3
      0x0040006C:  470D         LI         a4, 3
      0x0040006E:  4501         LI         a0, 0

    .L3:
      0x00400070:  0008C603     LBU        a2, 0(a7)
      0x00400074:  C609         BEQZ       a2, .L5                  ; 0x0040007E
      0x00400076:  863A         MV         a2, a4

    .L4:
      0x00400078:  02C6D063     BLE        a2, a3, .L7              ; 0x00400098
      0x0040007C:  0505         ADDI       a0, a0, 1

    .L5:
      0x0040007E:  070D         ADDI       a4, a4, 3
      0x00400080:  0885         ADDI       a7, a7, 1
      0x00400082:  0809         ADDI       a6, a6, 2
      0x00400084:  FFD716E3     BNE        a4, t4, .L3              ; 0x00400070
      0x00400088:  15FD         ADDI       a1, a1, -1
      0x0040008A:  F5F9         BNEZ       a1, .L1                  ; 0x00400058
      0x0040008C:  76B00793     LI         a5, 0x076B
      0x00400090:  00F51363     BNE        a0, a5, .L6              ; 0x00400096
      0x00400094:  4501         LI         a0, 0

    .L6:
      0x00400096:  8082         RET

    .L7:
      0x00400098:  00C78F33     ADD        t5, a5, a2
      0x0040009C:  000F0023     SB         zero, 0(t5)
      0x004000A0:  9642         ADD        a2, a2, a6
      0x004000A2:  BFD9         J          .L4                      ; 0x00400078