CPU design is not normally what we do. But: We actually have 2 CPU designed and in use, an 8-bit and a 32-bit CPU.
In this article we look at our 32-bit CPU, or rather how we are creating an enhanced version of it with very high Code Density
Designing a CPU is an interesting challenge. How to design it depends very much on what should be achieved. Similar to car design, there is no “One design fits all.” For some systems, 8-bits are sufficient, other applications need 64-bits. Some things they share: Smaller is better, lower power consumption is better, higher code density is better, more compact instruction sets are better than inflated ones.
Speedy — an ultra-fast and simple 8-bit CPU
In our Flasher Programmers, we have an 8-bit interface processor (“Speedy”), which is designed to run inside an FPGA (soft-core),
using very few resources (both power and gates) and running at a high speed with predictable timing.
We implemented it in Verilog, and it runs without timing violations at 200 MHz and more, with single-cycle instruction execution, which is really good for a soft core.
The source code is only about 500 lines of Verilog and is very easy to understand. We also wrote a simulator for it. Actually, the simulator is based on the output of Verilator, a Verilog-to-C++ translator. Once we know the test program in the simulator passes, we know the design is OK.
Programs for Speedy are written in assembly language. At 200MHz, Speedy allows us to read / set / toggle signal pins in a guaranteed 5ns frame. If we need a UART, I2C, JTAG, cJTAG, SWD, … Speedy can do it. Very convenient so we do not have to write new HDL code for new interfaces.
I’ll write more about Speedy in another post.
S32 — an optimized 32-bit CPU
A bit of background first…
Both J-Link and Flasher need a way to download and execute a program at run time, to execute some special time-critical function such as a sequence that must be output right after target reset release, or an entire flash loader.
For this purpose, we have a “virtual CPU” in these devices. The program is interpreted one instruction at a time. With a highly efficient instruction set and a fast executor, this is more efficient than one would think. We achieve execution rates of about 25M virtual processor instructions per second on a 600 MHz host processor, more than sufficient for what we need.
We have our own compiler for this CPU, which Paul wrote and which generates surprisingly good code, typically as good as what can be produced by hand-crafted assembly. In any case, we looked at it and saw some options for improvement, to achieve a higher code density, which in our case also leads to higher performance.
S32E — a 32-bit CPU with super high code density
We created new instructions and Paul modified the compiler to use them. We repeated the process until we did not see any (significant) room for improvement.
The result is stunning. It seems we are achieving a code density better than RISC-V and ARM!
While we originally optimized things looking at typical flash loaders, we now switched to test code and benchmarks that we can easily compile on any platform.
Here’s the code used for this comparison:
// expect: success // title: sieve benchmark, a larger test of the code generator #define SIZE 8190 char flags[SIZE+1]; int main(void) { int i, prime, k, count, iter; // for (iter = 1; iter <= 10; iter ++) { count = 0; for (i = 0; i <= SIZE; i++) flags[i] = 1; for (i = 0; i <= SIZE; i++) { if (flags[i]) { prime = i + i + 3; k = i + prime; while (k <= SIZE) { flags[k] = 0; k += prime; } count = count + 1; } } } // return count == 1899 ? 0 : count; }
The code stems from a test for the code generator and is somewhat similar to the Sieve of Eratosthenes, used to find prime numbers.
We just use it to check code density here.
Code density comparison
Results are interesting:
RISC-V: 96 bytes (RV32IMAC instruction set)
ARM: 100 bytes (Thumb-2, ARMv7M instruction set)
S32E: 70 bytes
We achieve a code density which is 43 % (here (100-30)/70) better than that of ARM!
We will look at S32E’s code density on a lot more software. If we can keep code density 20 % better than ARM’s v7M, that would already be great for a CPU design which is actually quite simple and can be implemented in software, in an FPGA (soft core) or as a hard core.
RISC-V code density is typically below ARM’s, so that is automatically beaten.
Using single byte instructions?
The current version of S32 has instruction sizes which are multiples of 2 bytes, so 2,4 and 6 byte instructions. If we allow other sizes and introduce single-byte and three-byte instructions, we can further reduce the footprint.
Byte-granularity for instructions has one downside: Relative jumps can now only cover half the distance, which has a negative impact on code density (as in some cases wider jumps or even two jumps need to be used), so there is a price to pay.
In this case here, introducing 5 single-byte instructions can save 12 bytes (Bringing us down to 58 from 70!) However, we want to first compare S32E on more code, and plan to introduce byte-size instructions as a final step.
Further plans
We are planning to do more code-size benchmarks, and will continue to publish them once available. Potentially we will publish the instruction set of the S32E and will make it available, or generate an HDL version of the core.
Code density is important, and code density is not a strong side of RISC-V (even though it is looking quite good in this benchmark).
We shall see… Stay tuned for more!
Generated output on S32E
main: 000000 xxxxxxxx MOVW R9, #8190 000004 xxxx MOV R10, #flags ; '&flags' 11: for (iter = 1; iter <= 10; iter ++) { 000006 xxxx MOV R4, #1 .L5: 12: count = 0; 000008 0006 MOV R5, #0 13: for (i = 0; i <= 8190; i++) // Init statement 00000A 0006 MOV R8, #0 14: flags[i] = 1; 00000C MOV R0, #1 // move outside of the loop for speed improvement .L7: 00000E STRB R0, [R10, R8] 000010 F80C ADD R8, #1 000012 09B0 CGE R9, R8 000014 xxxx BNE .L7 15: for (i = 0; i <= 8190; i++) { .L6: 000016 0006 MOV R8, #0 .L9: 16: if (flags[i]) { 000018 0088 LDRB R0, [R10,R8] ; dereference pointer, compare setting Z=1 if equal 00001A 1832 BEQ R0, .L13 ; start true arm of 'if' statement 17: prime = i + i + 3; 00001C F80C MOV R7, R8 00001E 1001 LSL R7, #1 000020 3011 ADD R7, #3 18: k = i + prime; 000022 F80C MOV R6, R8 000024 1001 ADD R6, R7 20: flags[k] = 0; 000026 0006 MOV R0, #0 .L11: 000028 1008 STRB R0, [R10, R6] 21: k += prime; 00002A E819 ADD R6, R7 19: while (k <= 8190) { 00002C 08B0 CGE R9, R6 00002E 08B0 BNE .L11 23: count = count + 1; .L10: 000030 xxxx ADD R5, #1 15: for (i = 0; i <= 8190; i++) { .L13: ; end of 'then' arm of 'if' statement 000032 F80C ADD R8, #1 000034 xxxx CGE R9, R8 000036 xxxx BNE .L9 11: for (iter = 1; iter <= 10; iter ++) { .L8: 000038 xxxx ADD R4, #1 00003A 0146 MOV R0, #10 00003C 3CB0 BNE R0, R4, .L5 28: return count == 1899 ? 0 : count; 00003E 100F 076B MOVW R1, #0x076B 000040 1021 SUB R1, R5 ; compare setting Z=1 if equal 000042 xxxx MOVZ R0, #0, R5 000044 000B RET ; return from function 000046
ARM v7M, compiled to optimize size
;============================================================================================== ; .text.main ;============================================================================================== ; Module: test-stdlib.o ; Attributes: 0x00000006: read-only, executable (SHF_EXECINSTR), allocatable (SHF_ALLOC), %progbits ; Size: 100 (0x64) bytes ; Align: 4 bytes ; ; Uses: ; 0x20000040 flags ; ; Used by: ; 0x20002ECC _start () main: 0x20002E40: E92D 41F0 PUSH.W {R4-R8, LR} 0x20002E44: F04F 0E01 MOV.W LR, #1 0x20002E48: 4915 LDR R1, =flags ; [PC, #84] [0x20002EA0] =0x20000040 0x20002E4A: F641 78FF MOVW R8, #0x1FFF 0x20002E4E: F641 7CFE MOVW R12, #0x1FFE 0x20002E52: 2400 MOVS R4, #0 0x20002E54: 230B MOVS R3, #11 .L1: 0x20002E56: 3B01 SUBS R3, #1 0x20002E58: D01A BEQ .L8 ; 0x20002E90 0x20002E5A: F44F 5000 MOV.W R0, #0x2000 0x20002E5E: 460A MOV R2, R1 .L2: 0x20002E60: 3801 SUBS R0, #1 0x20002E62: D002 BEQ .L3 ; 0x20002E6A 0x20002E64: F802 EB01 STRB LR, [R2], #1 0x20002E68: E7FA B .L2 ; 0x20002E60 .L3: 0x20002E6A: 2700 MOVS R7, #0 0x20002E6C: 2503 MOVS R5, #3 0x20002E6E: 2203 MOVS R2, #3 0x20002E70: 2600 MOVS R6, #0 .L4: 0x20002E72: 4547 CMP R7, R8 0x20002E74: D0EF BEQ .L1 ; 0x20002E56 0x20002E76: 5DC8 LDRB R0, [R1, R7] 0x20002E78: B130 CBZ R0, .L7 ; 0x20002E88 0x20002E7A: 4610 MOV R0, R2 .L5: 0x20002E7C: 4560 CMP R0, R12 0x20002E7E: D802 BHI .L6 ; 0x20002E86 0x20002E80: 540C STRB R4, [R1, R0] 0x20002E82: 4428 ADD R0, R5 0x20002E84: E7FA B .L5 ; 0x20002E7C .L6: 0x20002E86: 3601 ADDS R6, #1 .L7: 0x20002E88: 3502 ADDS R5, #2 0x20002E8A: 3203 ADDS R2, #3 0x20002E8C: 3701 ADDS R7, #1 0x20002E8E: E7F0 B .L4 ; 0x20002E72 .L8: 0x20002E90: F240 706B MOVW R0, #0x076B 0x20002E94: 1A30 SUBS R0, R6, R0 0x20002E96: BF18 IT NE 0x20002E98: 4630 MOVNE R0, R6 0x20002E9A: E8BD 81F0 POP.W {R4-R8, PC} 0x20002E9E: BF00 NOP 0x20002EA0: 20000040 DC.W flags
RISC-V IMAC
;============================================================================================== ; .text.startup.main ;============================================================================================== ; Module: test-stdlib.o ; Attributes: 0x00000006: read-only, executable (SHF_EXECINSTR), allocatable (SHF_ALLOC), %progbits ; Size: 96 (0x60) bytes ; Align: 2 bytes ; ; Uses: ; 0x00800000 flags ; ; Used by: ; 0x00400000 _start () main: 0x00400044: 6709 LI a4, 0x2000 0x00400046: 45A9 LI a1, 10 ; 0x00800000 = flags 0x00400048: 80018793 ADDI a5, gp, -0x0800 ; 0x00800000 = flags 0x0040004C: 4305 LI t1, 1 0x0040004E: FFF70E13 ADDI t3, a4, -1 0x00400052: FFE70693 ADDI a3, a4, -2 0x00400056: 6E99 LI t4, 0x6000 .L1: 0x00400058: 4701 LI a4, 0 .L2: 0x0040005A: 00F70633 ADD a2, a4, a5 0x0040005E: 00660023 SB t1, 0(a2) 0x00400062: 0705 ADDI a4, a4, 1 0x00400064: FFC71BE3 BNE a4, t3, .L2 ; 0x0040005A 0x00400068: 88BE MV a7, a5 0x0040006A: 480D LI a6, 3 0x0040006C: 470D LI a4, 3 0x0040006E: 4501 LI a0, 0 .L3: 0x00400070: 0008C603 LBU a2, 0(a7) 0x00400074: C609 BEQZ a2, .L5 ; 0x0040007E 0x00400076: 863A MV a2, a4 .L4: 0x00400078: 02C6D063 BLE a2, a3, .L7 ; 0x00400098 0x0040007C: 0505 ADDI a0, a0, 1 .L5: 0x0040007E: 070D ADDI a4, a4, 3 0x00400080: 0885 ADDI a7, a7, 1 0x00400082: 0809 ADDI a6, a6, 2 0x00400084: FFD716E3 BNE a4, t4, .L3 ; 0x00400070 0x00400088: 15FD ADDI a1, a1, -1 0x0040008A: F5F9 BNEZ a1, .L1 ; 0x00400058 0x0040008C: 76B00793 LI a5, 0x076B 0x00400090: 00F51363 BNE a0, a5, .L6 ; 0x00400096 0x00400094: 4501 LI a0, 0 .L6: 0x00400096: 8082 RET .L7: 0x00400098: 00C78F33 ADD t5, a5, a2 0x0040009C: 000F0023 SB zero, 0(t5) 0x004000A0: 9642 ADD a2, a2, a6 0x004000A2: BFD9 J .L4 ; 0x00400078