RISC-V: Dividing efficiently across different hardware

In previous blog posts I have described the division algorithms SEGGER implemented in emRun. However, which algorithm is best (in terms of code size, execution speed, or power efficiency) is very dependent on the target instruction set architectue (ISA) and the way the ISA is implemented in silicon. This article explains how we help to select the best division algorithm for use.

An ideal RISC-V CPU

For this article, we will assume execution on a hypothetical 32-bit RISC-V core with the following characteristics:

  • Simple arithmetic and logical instructions execute in a single cycle.
  • Loads from memory always hit the cache and execute in a single cycle.
  • Branches are never mispredicted and, whether taken or not, execute in a single cycle.
  • Multiply instructions complete in three cycles, which seems common for many ISA implementations.
  • Division instructions deliver 1, 2, or 4 bits of quotient per cycle, the exact number configured by the chip designer.

With these in place, we can turn to benchmarking division algorithms on hardware where the division instruction has different performance characteristics.

The benchmark framework

SEGGER have internal benchmarks to determine the performance of emRun. These benchmarks run on Embedded Studio’s simulator and measure performance of division algorithms, both 32-bit and 64-bit, using a spread of inputs to ensure performance is amortized.

I will focus on 64-bit floating-point division executed on a 32-bit processor without FPU.

emRun offers a number of configurations for division, from clockwork bit-by-bit algorithms to high-speed multiply-by-reciprocal algorithms.  I’ll explore these in the following sections.

Clockwork division

If you have no hardware multiplier and no hardware divider, there is no other sensible choice for division other than bit by bit.  With the SEGGER Embedded Studio simulator and the SEGGER emRun benchmark, the performance is:

Function        Min     Max     Avg    Description
-----------  ------  ------  ------    ------------------
__divdf3        648     843   750.2    Clockwork division

There isn’t really any chance to optimize this other than unrolling the inner loop, making the function much larger.

Using the M extension

The RISC-V M extension provides both 32-bit multiply and divide instructions and these can be effectively used to implement 64-bit division. emRun provides a number of implementations for this situation according to how your core is configured.

Division and remainder are pressed into use by the DIVU (unsigned divide) and REMU (unsigned remainder) instructions. One implementation of 53-bit by 53-bit division (used to divide double-precision significands) uses the schoolbook algorithm, delivering 10 or 11 bits of quotient per step, along with a final rounding bit.

Using DIVU and REMU in this way results in the following performance, depending upon the implementation of the hardware divider:

Function        Min     Max     Avg    Description
-----------  ------  ------  ------    -------------------------
__divdf3        437     452   441.4    1 quotient bit per cycle
__divdf3        277     292   281.4    2 quotient bits per cycle
__divdf3        197     212   201.4    4 quotient bits per cycle

Typically, DIVU and REMU are not fused and do not execute together, they execute sequentially. Without fusing, calculating a quotient and a remainder takes twice as long as it should because the division is performed twice, wasting time and energy.

Rather than using the REMU instruction to calculate a remainder, it is faster to use the multiplier and calculate the remainder manually. Using this technique improves performance at the expense of one additional instruction per division step, so five instructions in total:

Function        Min     Max     Avg    Description
-----------  ------  ------  ------    -------------------------
__divdf3        297     312   301.4    1 quotient bit per cycle
__divdf3        217     232   221.4    2 quotient bits per cycle
__divdf3        177     192   181.4    4 quotient bits per cycle

If performance matters, clearly using a fast multiplier to calculate the remainder after division is a win.

Using the Zmmul extension

The RISC-V Zmmul extension introduces multiplication instructions only: No divider. Before Zmmul was ratified, there were some devices shipped in “non-conforming” multiplier-only configuration. But now that Zmmul is ratified, the non-conforming devices and this multiply-only ISA extension deserve special treatment.

In previous articles I discussed how to divide using a multiply-by-reciprocal technique. emRun offers speedy division for multiplier-only cores, and it performs exceptionally well:

Function        Min     Max     Avg    Description
-----------  ------  ------  ------    -------------------
__divdf3        246     250   249.2    3-cycle multiplier
__divdf3        391     395   394.2    8-cycle multiplier
__divdf3        623     627   626.2    16-cycle multiplier

Taking the case of a 3-cycle multiplier, emRun can divide two double-precision values, on average, in 249.2 cycles. Using a 16-cycle (2-quotient-bits-per-cycle) divider, emRun can divide two double-precision values, on average, in 221.4 cycles. The performance difference between these two is about 12.5%. With this knowledge, it is possible to ship a core without a divider and still attain the performance of a 2-quotient-bits-per-cycle divider using emRun!

Integer division and beyond!

emRun offers these specialisations for integer arithmetic too. But it also takes it further, utilising the P extension for even faster multiplication and division along with more compact code. And the multiple variants of the B extension are deployed in the same way, again delivering faster and more compact code.

Wrapping up

Reading this, I hope you have some appreciation of the effort that went into designing and implementing many of these algorithms. We have them in generic C, and optimized assembly language versions of all of them for every Arm and RISC-V architecture.

And to our friends designing RISC-V cores: Division is not used frequently as multiplication. As division can be done efficiently using multiply-by-reciprocal, it really is not worth putting in a hardware divider, unless speed of division is important and the divider is very fast (or cost of the additional silicon is irrelevant).

We have paid attention to the details in emRun. It’s just what SEGGER’s Embedded Experts do. And it simply works!