In previous blog posts I have described the division algorithms SEGGER implemented in emRun. However, which algorithm is best (in terms of code size, execution speed, or power efficiency) is very dependent on the target instruction set architecture (ISA) and the way the ISA is implemented in silicon. This article explains how we help to select the best division algorithm for use.
An ideal RISC-V CPU
For this article, we will assume execution on a hypothetical 32-bit RISC-V core with the following characteristics:
- Simple arithmetic and logical instructions execute in a single cycle.
- Loads from memory always hit the cache and execute in a single cycle.
- Branches are never mispredicted and, whether taken or not, execute in a single cycle.
- Multiply instructions complete in three cycles, which seems common for many ISA implementations.
- Division instructions deliver 1, 2, or 4 bits of quotient per cycle, the exact number configured by the chip designer.
With these in place, we can turn to benchmarking division algorithms on hardware where the division instruction has different performance characteristics.
The benchmark framework
SEGGER have internal benchmarks to determine the performance of emRun. These benchmarks run on Embedded Studio‘s simulator and measure performance of division algorithms, both 32-bit and 64-bit, using a spread of inputs to ensure performance is amortized.
I will focus on 64-bit floating-point division executed on a 32-bit processor without FPU.
emRun offers a number of configurations for division, from clockwork bit-by-bit algorithms to high-speed multiply-by-reciprocal algorithms. I’ll explore these in the following sections.
Clockwork division
If you have no hardware multiplier and no hardware divider, there is no other sensible choice for division other than bit by bit. With the SEGGER Embedded Studio simulator and the SEGGER emRun benchmark, the performance is:
Function Min Max Avg Description ----------- ------ ------ ------ ------------------ __divdf3 648 843 750.2 Clockwork division
There isn’t really any chance to optimize this other than unrolling the inner loop, making the function much larger.
Using the M extension
The RISC-V M extension provides both 32-bit multiply and divide instructions, and these can be effectively used to implement 64-bit division. emRun provides a number of implementations for this situation according to how your core is configured.
Division and remainder are pressed into use by the DIVU (unsigned divide) and REMU (unsigned remainder) instructions. One implementation of 53-bit by 53-bit division (used to divide double-precision significands) uses the schoolbook algorithm, delivering 10 or 11 bits of quotient per step, along with a final rounding bit.
Using DIVU and REMU in this way results in the following performance, depending upon the implementation of the hardware divider:
Function Min Max Avg Description ----------- ------ ------ ------ ------------------------- __divdf3 437 452 441.4 1 quotient bit per cycle __divdf3 277 292 281.4 2 quotient bits per cycle __divdf3 197 212 201.4 4 quotient bits per cycle
Typically, DIVU and REMU are not fused and do not execute together, they execute sequentially. Without fusing, calculating a quotient and a remainder takes twice as long as it should because the division is performed twice, wasting time and energy.
Rather than using the REMU instruction to calculate a remainder, it is faster to use the multiplier and calculate the remainder manually. Using this technique improves performance at the expense of one additional instruction per division step, so five instructions in total:
Function Min Max Avg Description ----------- ------ ------ ------ ------------------------- __divdf3 297 312 301.4 1 quotient bit per cycle __divdf3 217 232 221.4 2 quotient bits per cycle __divdf3 177 192 181.4 4 quotient bits per cycle
If performance matters, clearly using a fast multiplier to calculate the remainder after division is a win.
Using the Zmmul extension
The RISC-V Zmmul extension introduces multiplication instructions only: No divider. Before Zmmul was ratified, there were some devices shipped in “non-conforming” multiplier-only configuration. But now that Zmmul is ratified, the non-conforming devices and this multiply-only ISA extension deserve special treatment.
In previous articles I discussed how to divide using a multiply-by-reciprocal technique. emRun offers speedy division for multiplier-only cores, and it performs exceptionally well:
Function Min Max Avg Description ----------- ------ ------ ------ ------------------- __divdf3 246 250 249.2 3-cycle multiplier __divdf3 391 395 394.2 8-cycle multiplier __divdf3 623 627 626.2 16-cycle multiplier
Taking the case of a 3-cycle multiplier, emRun can divide two double-precision values, on average, in 249.2 cycles. Using a 16-cycle (2-quotient-bits-per-cycle) divider, emRun can divide two double-precision values, on average, in 221.4 cycles. The performance difference between these two is about 12.5 %. With this knowledge, it is possible to ship a core without a divider and still attain the performance of a 2-quotient-bits-per-cycle divider using emRun!
Integer division and beyond!
emRun offers these specialisations for integer arithmetic too. But it also takes it further, utilizing the P extension for even faster multiplication and division along with more compact code. And the multiple variants of the B extension are deployed in the same way, again delivering faster and more compact code.
Wrapping up
Reading this, I hope you have some appreciation of the effort that went into designing and implementing many of these algorithms. We have them in generic C, and optimized assembly language versions of all of them for every Arm and RISC-V architecture.
And to our friends designing RISC-V cores: Division is not used frequently as multiplication. As division can be done efficiently using multiply-by-reciprocal, it really is not worth putting in a hardware divider, unless speed of division is important and the divider is very fast (or cost of the additional silicon is irrelevant).
We have paid attention to the details in emRun. It’s just what SEGGER’s Embedded Experts do. And it simply works!