Algorithms for division – part 1

This is the first in a series of articles that attempt to dispel the mystique around division algorithms. We do this by presenting coded algorithms that correctly divide, and describe how variations of these algorithms are used in emRun and emFloat, our C runtime and floating-point libraries.

Plain and simple division

Let’s start with some code. Most texts on division start with how we did it as kids using long division and how this is transformed into a program with loops, if-tests, termination conditions, and puzzlement over a troublesome zero.

That’s pretty dull when all you want is to see some working code. So here is that unvarnished code, dividing a 16-bit value u by a 16-bit value v, no loops, no termination conditions, no bullshit:

uint16_t Div_U16_U16(uint16_t u, uint16_t v) {
  uint16_t q;
  //
  q = 0;
  if (v <= (u >> 15)) { u -= v << 15; q += 1 << 15; }
  if (v <= (u >> 14)) { u -= v << 14; q += 1 << 14; }
  if (v <= (u >> 13)) { u -= v << 13; q += 1 << 13; }
  if (v <= (u >> 12)) { u -= v << 12; q += 1 << 12; }
  if (v <= (u >> 11)) { u -= v << 11; q += 1 << 11; }
  if (v <= (u >> 10)) { u -= v << 10; q += 1 << 10; }
  if (v <= (u >>  9)) { u -= v <<  9; q += 1 <<  9; }
  if (v <= (u >>  8)) { u -= v <<  8; q += 1 <<  8; }
  if (v <= (u >>  7)) { u -= v <<  7; q += 1 <<  7; }
  if (v <= (u >>  6)) { u -= v <<  6; q += 1 <<  6; }
  if (v <= (u >>  5)) { u -= v <<  5; q += 1 <<  5; }
  if (v <= (u >>  4)) { u -= v <<  4; q += 1 <<  4; }
  if (v <= (u >>  3)) { u -= v <<  3; q += 1 <<  3; }
  if (v <= (u >>  2)) { u -= v <<  2; q += 1 <<  2; }
  if (v <= (u >>  1)) { u -= v <<  1; q += 1 <<  1; }
  if (v <= (u >>  0)) { u -= v <<  0; q += 1 <<  0; }
  //
  return q;
}

Even an engineer that has never contemplated division algorithms before can see the pattern and scale this up to 32/32 and 64/64 division, or down to 8/8.

If you want to consider the workings of this, be my guest. Can you see what happens when dividing by zero? And does that align with what you might expect?  Try tracing the algorithm dividing 0x8000 by 3, and then 6 by 3 — do you notice something fundamental about the algorithm, an opportunity for improvement?

This open-coded algorithm is incredibly useful in more-complex division algorithms, and variants of it appear in both emRun and emFloat.

What’s more, it is realized by very tight code for Arm processors: each step is just three instructions. If you know Arm assembly language, the code writes itself, the first step (of 16) being:

cmp   r1, r0, lsr #15          // r0 holds u, r1 holds v
subls r0, r0, r1, lsl #15      // Conditionally subtract
addls r2, r2, #1<<15           // Conditionally add, r2 is the developing quotient

How elegant is that? I am always amazed at what you can do with conditional execution and the shifter hanging off the end of an instruction: both are powerful tools provided by the architecture.

For 16 bits, the division algorithm on Arm would requires 16 steps each of three instructions, so 48 instructions in all; initializing q and moving q to outgoing registers with function return adds a further three instructions; the total instruction count is therefore 51. (Note: this is a human-coded 51 instructions, supplying the C code to a compiler results in a function that is a little larger than 51 instructions.)

Arm executes each of these instructions in a single cycle, so 51 cycles for a 16/16 division. To be clear, that’s a constant 51 cycles for any value of u and any value of v we divide. Is there an improvement we can make that would improve efficiency overall? If you’ve been following along, it turns out there is.

Eliminating unproductive work

If you examine the algorithm above and carry out an old-school dry run of the code, or execute it under a debugger, you soon realize that dividing certain pairs of values results in a lot of code not making any forward progress to calculating the quotient — all the action happens towards the end of the code.

This isn’t a coincidence. Dividing u, an m-bit number, by v, an n-bit number, produces a quotient q that has no more than kmn+1 bits. For example, an 11-bit value such as 2903 divided by a 4-bit value such as 15 results in a quotent of at most 11−4+1 = 8 bits; indeed, 2904/15 = 193, a value represented in 8 bits.

In the simple division routine, code at the head of the function dealing with quotient bits greater than k contributes nothing to the quotient and can therefore be skipped. What can we do, or use, to detect this this condition?

Many processors have a wonderfully useful instruction to count the number of leading zeros in a register. There’s no standard C function to do this for you, but let’s assume that a call to clz() does.

The number of bits required to represent a value x in a processor with 16-bit registers is 15−clz(x). With this we can calculate m and n from u and v and therefore derive k:

m = 15 - clz(u);
n = 15 - clz(v);
k = m - n + 1;

The calculation can be further simplified by algebra:

(15 - clz(u)) - (15 - clz(v)) + 1  =  15 - clz(u) - 15 + clz(v) + 1
                                   =  -clz(u) + clz(v) + 1
                                   =  clz(v) - clz(u) + 1

If n is greater than m then the divisor v is greater than u and k turns negative, a simple test to set the quotient to zero!

How can we transfer this new knowledge to code?  Use a switch statement! The revised code is:

uint16_t Div_U16_U16_Switched(uint16_t u, uint16_t v) {
  uint16_t q;
  int      k;
  //
  q = 0;
  k = clz(v) - clz(u);  // Note: don't add 1, adjust case values
  //
  switch (k) {
  case 15: if (v <= (u >> 15)) { u -= v << 15; q += 1 << 15; }
  case 14: if (v <= (u >> 14)) { u -= v << 14; q += 1 << 14; }
  case 13: if (v <= (u >> 13)) { u -= v << 13; q += 1 << 13; }
  case 12: if (v <= (u >> 12)) { u -= v << 12; q += 1 << 12; }
  case 11: if (v <= (u >> 11)) { u -= v << 11; q += 1 << 11; }
  case 10: if (v <= (u >> 10)) { u -= v << 10; q += 1 << 10; }
  case  9: if (v <= (u >>  9)) { u -= v <<  9; q += 1 <<  9; }
  case  8: if (v <= (u >>  8)) { u -= v <<  8; q += 1 <<  8; }
  case  7: if (v <= (u >>  7)) { u -= v <<  7; q += 1 <<  7; }
  case  6: if (v <= (u >>  6)) { u -= v <<  6; q += 1 <<  6; }
  case  5: if (v <= (u >>  5)) { u -= v <<  5; q += 1 <<  5; }
  case  4: if (v <= (u >>  4)) { u -= v <<  4; q += 1 <<  4; }
  case  3: if (v <= (u >>  3)) { u -= v <<  3; q += 1 <<  3; }
  case  2: if (v <= (u >>  2)) { u -= v <<  2; q += 1 <<  2; }
  case  1: if (v <= (u >>  1)) { u -= v <<  1; q += 1 <<  1; }
  case  0: if (v <= (u >>  0)) { u -= v <<  0; q += 1 <<  0; }
  default: break;
  }
  //
  return q;
}

Note that I eliminated the +1 and adjusted the case values accordingly. Also worth mentioning is that there are no break statements at the end of each case, cases all fall through. Some believe that no breaks in case clauses are evil, but I think the code here is pretty elegant. In the same vein, the default case doesn’t need to be there, the function is well-defined without it, but compilers and source code analysis tools warn you that you probably forgot it, along with the warnings for no break statements either.

This compiles with only a few more instructions at the head of the function, but the improvement is well-worth having: over the entire range of magnitudes of u and v, it can provide an average speedup of 2x over the unswitched code.

What’s next?

These functions translate very efficiently to the Arm instruction set, but other architectures are not so fortunate.

RISC-V processors that do not have a divide instruction (or are configured without one) compare poorly to Arm: RISC-V has no predicated execution and no shifter operand, leading to five or six (depending on step number) instructions per developed quotient bit. And one of those instructions is a branch. Recall, Arm required three instructions and no branch.

When developing a quotient bit, the branch is generally taken 50% of the time thereby defeating any branch predictor. To compound the disadvantage, the base RV32I instruction set and all ratified instruction set extensions (at the time of writing) offer no CLZ instruction, so our switched divide algorithm can’t deploy CLZ to improve performance — devastating!

Architectures that have small register widths, such as 8-bit AVR, need multiple instructions to handle 16-bit data, making the unrolled code larger still on devices with constrained program sizes.

Therefore we need to address better efficiency for RISC-V and smaller code size e.g. for AVR. The next article will consider how we can achieve these goals. With code, of course!

Addendum

For those curious to compare RISC-V to Arm, here is the first division step coded for RV32IM with u=a0, v=a1, and q=a3:

srli  a2, a0, 15        // a2 = u >> 15
bgtu  a1, a2, .Lskip    // if (...
slli  a2, a1, 15        //   a2 = v << 15
sub   a0, a0, a2        //   u -= v << 15
li    a2, 1<<15         //   a2 = 1 << 15
add   a3, a3, a2        //   q += 1 << 15
.Lskip:

If you don’t want to incur branch penalties, the branch-free implementation is:

srli  t0, a0, 15        // t0 = u >> 15
sltu  t0, a1, t0        // t0 = v < (u >> 15) [t0 is now 0 or 1]
addi  t0, t0, -1        // t0 = v >= (u >> 15) [t0 is now 0 or -1]
slli  a2, a1, 15        // a2 = v << 15
and   a2, a2, t0        // a2 = v << 15 or 0, depending on condition
sub   a0, a0, a3        // u -= v << 15, if condition true
li    a2, 1<<15         // a2 = 1 << 15
and   a2, a2, t0        // a2 = 1 << 15 or 0, depending upon condition
add   a3, a3, a2        // q += 1 << 15, if condition true