Algorithms for division – part 2 – Classics

This article will explore the classic division algorithms and how they can be implemented efficiently, in terms of code space and execution time, by exploiting machine features. Don’t expect anything astounding here, the algorithms are classic for a reason: this path is well-trodden.

Division by clockwork

The classic division algorithm for binary arithmetic is the shift-subtract algorithm. This algorithm continually subtracts multiples of the divisor from the dividend to calculate the quotient. It does this by shifting the divisor and testing whether it can be subtracted from the dividend, recording quotient digits as it goes.

What follows is that classic algorithm for division of values that fit into the unsigned type. For 32-bit processors such as Arm, this will be 32 bits, and for 16-bit and 8-bit processors, this will be 16 bits. There are complications for 64-bit processors where an unsigned might be 32 bits or 64 bits, best not to make assumptions.

Here’s the code:

unsigned Div_Restoring(unsigned u, unsigned v) {
  unsigned q;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);  // Calculate number of quotient digits (minus 1)
  v <<= k;                // Normalize divisor
  q   = 0;                // Initialize developing quotient
  //
  // Iterate k+1 times, each iteration developing one quotient bit.
  //
  do {
    u -= v;               // Trial subtraction
    if ((int)u >= 0) {    // u didn't turn negative from subtraction
      q <<= 1;            // Record '1' quotient digit
      q += 1;
    } else {
      u += v;             // u turned negative, restore u by adding back v
      q <<= 1;            // Record '0' quotient digit
    }
    v >>= 1;
  } while (--k >= 0);
  //
  return q;
}

This implementation always subtracts v from u and:

  • If the result turns negative, v can’t be subtracted from u, so record a zero quotient digit and restore v by adding back u.
  • If the result is positive, v is successfully subtracted from u, so record a one quotient digit.

At the point the quotient is returned, the remainder after division can be found in u, should you need this.

Experienced C programmers will immediately spot opportunities for change, making the code a little more elegant. Rather than always run a trial subtraction, the loop can be rewritten to be nondestructive:

unsigned Div_Restoring(unsigned u, unsigned v) {
  unsigned q;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);   // Calculate number of quotient digits (minus 1)
  v <<= k;                 // Normalize divisor
  q   = 0;                 // Initialize developing quotient
  //
  // Iterate k+1 times, each iteration developing one quotient bit.
  //
  do {
    q <<= 1;               // Record preliminary '0' quotient digit
    if (u >= v) {          // Subtraction will succeed...
      u -= v;              // ...so do it
      q += 1;              // Turn preliminary '0' quotient digit to '1'
    }
    v >>= 1;
  } while (--k >= 0);
  //
  return q;
}

Translating this to assembly language is no real hardship and for engineers familiar with low-level details of their target machine: shifts, subtracts, and adds fall neatly into line using the carry flag output.

Here is a translation of this for Arm instruction set, extended to 32/32 division as this is the natural word size of the processor; on entry, r0=u, r1=v; the quotient is developed into r3 and finally returned in r0:

    clz   r2, r1        // k = clz(v) - clz(u)
    clz   r3, r0
    cmp   r0, r1
    movlo r0, #0        // if (u < v) return 0;
    bxlo  lr
    lsl   r1, r1, r2    // v <<= k
    mov   r3, #0        // q = 0
.Loop:
    cmp   r0, r1        // if (u >= v)
    subcs r0, r0, r1    // ... u -= v
    adc   r3, r3, r3    // q <<= 1; q += 1/0 if u >= v true/false
    lsr   r1, r1, #1    // v >>= 1
    subs  r2, r2, #1    // while (--k >= 0)
    bpl   .Loop
    mov   r0, r3        // return q
    bx    lr

The notable features of this translation are that the quotient digit is delivered in the carry flag by the compare, which is then shifted into the quotient using an add with carry (ADC) instruction, and that the subtraction of v from u is conditional, again based on the carry flag: the SUBCS instruction will only subtract if the carry is set. The Arm translation is a nicely compact 15 instructions. Compared to the unrolled versions of 51 instructions, it is surely a saving in code size.

Avoiding branches

Arm has predicated instructions that can be used in the place of a branch, but some architectures don’t have this capability. A branch-free, word-size-independent version of the algorithm can be coded in C like this:

#define BPU  (sizeof(unsigned)*8)   // Number of bits per unsigned

unsigned Div_Restoring(unsigned u, unsigned v) {
  unsigned q;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);         // Calculate number of quotient digits (minus 1)
  v <<= k;                       // Normalize divisor
  q   = 0;                       // Initialize developing quotient
  //
  // Iterate k+1 times, each iteration developing one quotient bit.
  //
  do {
    q <<= 1;                     // Record preliminary '0' quotient digit
    f = (int)(u - v) >> (BPU-1); // f is -1/0 if u >= v is true/false
    q -= f;                      // Adjust recording for correct quotient bit
    u -= v & f;                  // Conditionally subtract v from u
    v >>= 1;
  } while (--k >= 0);
  //
  return q;
}

Assuming that RV32I offers a CLZ instruction (provided by the B and P extensions), the corresponding 32-bit division code for RISC-V would be:

    bltu  a0, a1, .Zero   // if (u < v) return 0
    clz   a2, a1          // k = clz(v) - clz(u)
    clz   a3, a0        
    sub   a2, a2, a3    
    sll   a1, a1, a2      // v <<= k
    li    a3, 0           // q = 0
.Loop:                  
    slli  a3, a3, 1       // q <<= 1
    sub   t0, a0, a1      // t0 is 0xFFFFFFFF/0 if u >= v is true/false
    srai  t0, t0, 31
    sub   a3, a3, t0      // q -= f
    and   t0, t0, a1      // u -= v & f
    sub   a0, a0, t0
    srli  a1, a1, 1       // v >>= 1
    addi  a2, a2, -1      // while (--k >= 0)
    bgez  a2, .Loop     
    mv    a0, a3        
    ret
.Zero:                  
    li    a0, 0         
    ret                 

Doing away with the count

It’s possible to eliminate the count, k, when the shift-subtract division loop is iterated. Rather than count down, the quotient shift register is initialized in such a way that when a quotient bit is shfted in, the bit shifted out is used as a condition to terminate the loop. For Arm, and other processors that have condition codes and a carry bit, the implementation of this is very natural.

Here’s the C code that implements the division:

#define BPU  (sizeof(unsigned)*8)   // Number of bits per unsigned
#define MSB  (1u << (BPU-1))        // Most-significant-bit mask

unsigned Div_RestoringFlag(unsigned u, unsigned v) {
  unsigned q, done;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);   // Calculate number of quotient digits (minus 1)
  v <<= k;               // Normalize divisor
  q   = MSB >> k;        // Set sentinel bit in quotient register
  //
  // Iterate k+1 times, each iteration developing one quotient bit.
  //
  do {
    done = q & MSB;      // Test sentinel bit
    q <<= 1;             // Record preliminary '0' quotient digit
    if (u >= v) {        // subtraction will succeed...
      u -= v;            // ...so do it
      q += 1;            // Turn preliminary '0' quotient digit to '1'
    }
    v >>= 1;
  } while (done == 0);
  //
  return q;
}

It doesn’t look so great in C with the termination condition extracted at the head of the loop and tested at the end of the loop. It is improved by peeling away the last iteration of the existing loop and handling the final quotient digit separately:

#define BPU  (sizeof(unsigned)*8)   // Number of bits per unsigned
#define MSB  (1u << (BPU-1))        // Most-significant-bit mask

unsigned Div_RestoringFlag(unsigned u, unsigned v) {
  unsigned q, done;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);     // Calculate number of quotient digits (minus 1)
  v <<= k;                   // Normalize divisor
  q   = MSB >> k;            // Set sentinel bit in quotient register
  //
  // Iterate k+1 times, each iteration developing one quotient bit.
  //
  while ((q & MSB) == 0) {   // Test sentinel bit
    q <<= 1;                 // Record preliminary '0' quotient digit
    if (u >= v) {            // subtraction will succeed...
      u -= v;                // ...so do it
      q += 1;                // Turn preliminary '0' quotient digit to '1'
    }
    v >>= 1;
  }
  //
  // Record final quotient digit.
  //
  q <<= 1;
  q += u >= v;
  //
  return q;
}

The original version, when translated to Arm code for 32-bit division), decreaes the number of instructions per quotient bit developed, improving performance:

    clz   r2, r1        // k = clz(v) - clz(u)
    clz   r3, r0
    subs  r2, r2, r3
    movmi r0, #0        // if (k < 0) return 0;
    bxmi  lr
    lsl   r1, r1, r2    // v <<= k
    mov   r3, #1<<31    // q = 0x80000000 >> k
    lsr   r3, r3, r2
.Loop:
    cmp   r0, r1        // if (u >= v)
    subcs r0, r0, r1    // ... u -= v
    adcs  r3, r3, r3    // q <<= 1; q += 1/0 if u >= v true/false and...
                        // ...sentinel bit for 'done' is shfited into carry!
    lsr   r1, r1, #1    // v >>= 1
    bcc   .Loop         // while (done == 0)
    mov   r0, r3        // return q
    bx    lr

Clearly, when optimizing for size, this isn’t such a bad implementation to have, and you can choose it as one of the algorithms in emFloat and emRun when optimizing for minimal code size.

Nonrestoring division

The following algorithm is called nonrestoring because the subtraction of the divisor from the dividend is always executed and, if the subtraction underflows, the dividend is never restored. For some processors, nonrestoring division is faster than the restoring version, but for more recent architectures such as Arm and RISC-V, nonrestoring division is no more efficient than well-coded restoring division. It happens that v6M benefits from non-restoring division and is selectable in emRun.

Here is the algorithm, for completeness:

#define BPU  (sizeof(unsigned)*8)   // Number of bits per unsigned
#define MSB  (1u << (BPU-1))        // Most-significant-bit mask

unsigned Div_Nonrestoring(unsigned u, unsigned v) {
  unsigned q;
  int      k;
  //
  // Early-out for zero quotient.
  //
  if (u < v) {
    return 0;
  }
  //
  // Set up for division loop.
  //
  k   = clz(v) - clz(u);           // Calculate number of quotient digits (minus 1)
  v <<= k;                         // Normalize divisor
  q   = 0;                         // Initialize developing quotient
  //
  // Enter with positive dividend for first trial subtraction.
  //
  for (;;) {
    u -= v;                        // Trial subtraction
    if ((u & MSB) != 0) goto add; // If dividend turned negative, go to -ve dividend loop
sub:
    q <<= 1;                       // Record '1' quotient digit
    q  += 1;
    if (--k < 0) return q;        // Termination when all iterations run
    v >>= 1;
  }
  for (;;) {
    u += v;                        // Trial addition.
    if ((u & MSB) == 0) goto sub; // If dividend turned positive, go to +ve dividend loop
add:
    q <<= 1;                       // Record '0' quotient digit
    if (--k < 0) return q;        // Termination when all iterations run
    v >>= 1;
  }
}

You see two loops: one loop keeps subtracting whilst the dividend remains positive and, when it turns negative, it enters a second loop that keeps adding back whilst the dividend remains negative, at which point control is transferred to the loop dealing with positive dividends.

Execution bounces between these two loops, with quotient digits being developed and values tested. Writing this code without using goto is possible, but the symmetry of the two loops is lost.

Conclusion and what’s next…

The algorithms presented here are not revolutionary, they’re classic, and they all share the same DNA: shift and subtract. They can be translated into machine instructions in many ways with opportunities to optimize using facets of the target archtecture that are not exposed at the C level — for example, the rotation of a register through the carry flag, Arm’s ADCS instruction.

But no matter what amazing low-level architectural bit-trickery is employed, all algorithms presented so far are bit-serial and limited in performance: developing a quotient one bit at a time is painfully slow.

The next article will show how it’s possible to divide faster, and things turn far more intersting from here!