Algorithms for division – part 3 – Using multiplication

We’ve explored simple algorithms that develop a quotient one digit at a time: reliable, understandable, but ultimately quite slow. Now it’s time to change up a couple of gears and turbocharge division for speed!

For those of you just interested in the code, and not the background or explanation, jump right to the end and take what you need.

What’s the problem with dividing?

Cheap low-end microcontrollers typically offer some form of multiplication ability (by instruction or by a peripheral), but don’t offer a complementary divide capability. Why is that?

Division in hardware can be slow, using the shift-subtract algorithms we’ve seen. And a multi-cycle uninterruptible divide increases interrupt latency. This type of implementation really does exist which might be a deal-breaker for some applications.

Division in hardware can also be complex, requiring significant die space dedicated to efficient algorithms which require a lot of computation. This die space and complexity isn’t included for many microcontrollers simply because division is infrequently used and doesn’t justify the cost to include it. Besides, all this complexity can be offloaded to software to sort out!

Multiplication in hardware is cheap compared to division, and most multipliers are fast, delivering their result in a few cycles, or even a single cycle. This fact is the foundation of many complex numerical algorithms in emFloat, and is leveraged in emRun to provide fast integer division on processors that don’t have division instructions.

Dividing by multiplying

To improve division speed our aim is to produce more than “one bit per step”.

We start by rewriting the division u / v as a multiplication, u × (1 / v). On the face of it, this really doesn’t look like progress, there’s still a division in there, 1 / v. But we’re going to make this work.

Let’s take, by way of example, a division of 6 by 3. On any calculator, if you enter “6 ÷ 3 =” you would expect to see “2“. But from calculators of days long-gone, we intuitively know that entering the algebraically-identical “1 ÷ 3 × 6 =” does not display “2“. The reason for this is clear, 1/3 is a recuring decimal and the calculator truncates the true value of 1/3 (which is 0.333333…) after 8 or 10 decimal places, working with limited precision.

It would seem that even if we can calculate a reciprocal 1/3, the endeavor to find a quotient is still futile because the inaccurate reciprocal, when multiplied by 3, will result in a number that is not 2. That is entirely true, but it turns out that we can tolerate not-entirely-accurate reciprocals, use some software artistry and calculate an entirely accurate quotient. So let’s do it!

Calculating a reciprocal by looking it up

Let’s take a 16/16 unsigned division as a way to pare down the problem to something that is tractable. We are going to work in fixed-point arithmetic using scaled integers. This might sound scary, but all it means is that each value we work with is scaled by a factor 2n, i.e. is shifted right by n bits, which we must track explicitly or implicitly.

Rather than show the mathematics of this, it far easier to show by example.

Let’s take our candidate division, u/v. Rather than calculate the reciprocal of v, I’m just going to look it up, it’s fast and it’s compact: I’ll use the leading eight bits of v to index into a table of pre-calculated reciprocals. First off, I need to normalize v and record the number of bits of normalization which I will need later:

n = clz16(v);  // Count leading zeros of a nonzero 16-bit value (0..15)
v <<= n;

The table lookup uses the leading eight bits of as an index. But I know that the most-significant bit of v is one after normalization, so those eight bits only take values 0x80 through 0xFF and therefore the table requires 128 entries:

uint16_t _aRecip_7i_16o[128] = { ... };

The reciprocal lookup is therefore:

n = clz16(v);
r = _aRecip_7i_16o[(v << n >> 8) - 0x80];

The value from the table is an approximation of the reciprocal of v to 16 bits.

Let’s deal with what’s stored in that reciprocal table.

Populating the lookup table

The leadng eight bits of the normalised value of v, call it , can be thought of as a value between 1 and 2 by placing a binary point immediately after the leading one bit. This is equivalent to scaling by 2-7. Using this, 0x80 represents 128/128 = 1.0 and 0xFF represents 255/128 ≈ 1.99. For those familiar with fixed-point types, this is simply unsigned Q1.7 format.

The reciprocal of is a value between 0.5 and 1. The reciprocal stored in the lookup table is naturally represented as a 16-bit value scaled by 2-16, i.e. unsigned Q16 format.

But here lies a problem. The reciprocal of 1 is 1, and 1 isn’t representable in Q16 format. As the reciprocal is only an approximation anyway, each reciprocal entry is adjusted to be 0xFFFF / . The reciprocal of 1.0 is therefore stored as 0xFFFF and is 65535/65536 ≈ 0.9999, close enough because all we require is an approximation (which is corrected for later).

Calculating this using a program, the reciprocal table is as follows:

static const uint16_t _aRecip_7i_16o[128] = {
  0xFFFF, 0xFE03, 0xFC0F, 0xFA23, 0xF83E, 0xF660, 0xF489, 0xF2B9,
  0xF0F0, 0xEF2E, 0xED73, 0xEBBD, 0xEA0E, 0xE865, 0xE6C2, 0xE525,
  0xE38E, 0xE1FC, 0xE070, 0xDEE9, 0xDD67, 0xDBEB, 0xDA74, 0xD901,
  0xD794, 0xD62B, 0xD4C7, 0xD368, 0xD20D, 0xD0B6, 0xCF64, 0xCE16,
  0xCCCC, 0xCB87, 0xCA45, 0xC907, 0xC7CE, 0xC698, 0xC565, 0xC437,
  0xC30C, 0xC1E4, 0xC0C0, 0xBFA0, 0xBE82, 0xBD69, 0xBC52, 0xBB3E,
  0xBA2E, 0xB921, 0xB817, 0xB70F, 0xB60B, 0xB509, 0xB40B, 0xB30F,
  0xB216, 0xB11F, 0xB02C, 0xAF3A, 0xAE4C, 0xAD60, 0xAC76, 0xAB8F,
  0xAAAA, 0xA9C8, 0xA8E8, 0xA80A, 0xA72F, 0xA655, 0xA57E, 0xA4A9,
  0xA3D7, 0xA306, 0xA237, 0xA16B, 0xA0A0, 0x9FD8, 0x9F11, 0x9E4C,
  0x9D89, 0x9CC8, 0x9C09, 0x9B4C, 0x9A90, 0x99D7, 0x991F, 0x9868,
  0x97B4, 0x9701, 0x964F, 0x95A0, 0x94F2, 0x9445, 0x939A, 0x92F1,
  0x9249, 0x91A2, 0x90FD, 0x905A, 0x8FB8, 0x8F17, 0x8E78, 0x8DDA,
  0x8D3D, 0x8CA2, 0x8C08, 0x8B70, 0x8AD8, 0x8A42, 0x89AE, 0x891A,
  0x8888, 0x87F7, 0x8767, 0x86D9, 0x864B, 0x85BF, 0x8534, 0x84A9,
  0x8421, 0x8399, 0x8312, 0x828C, 0x8208, 0x8184, 0x8102, 0x8080
};

Approximating the quotient

The quotient q = u / v is calculated by multiplying the reciprocal estimate by the dividend, u. Our reciprocal is in Q16 format, u is a 16-bit integer, and the product of this, the quotient, is a Q16.16 value:

uint32_t qx;
uint16_t q;

qx = (uint32_t)u * r;           // Q16.16 = U16 * Q16
q  = (uint16_t)(qx >> 16);      // U16 = trunc(Q16.16)
q  = q >> (15-n);               // Undo normalization of v

The capability to multiply of two 16-bit values to a 32-bit value is generally available in microcontrollers, even small ones: The 16-bit MSP430 has a 16×16 to 32 multiplier peripheral, for instance. However, we’re discarding the low 16 bits of the product, so a processor with a “multiply high” instruction would be perfect for this.

Here, the quotient is not exact as the reciprocal estimate isn’t exact. It needs correction.

Correcting the quotient

To correct the quotient we use the division theorem: there is a unique solution to q×u + r = v with 0 ≤ |r| < v. If q and r are calculated incorrectly, they can be corrected.

Our estimate of q may be too high (e.g. 0x201 / 0x101 gives an estimate q=2 but the true value is q=1), or may be too low.

Calculation of the remainder is, at first glance, straightforward:

u -= q * v;

But we must be careful as the calculation q × v may overflow 16 bits; and in fact it does with u=65137, v=1111 estimating q=59 and then 59*1111 = 65549.

We proceed by assuming that the quotient is always too high and correct it later if we find it’s too low:

if (q > 0) {  // Assume too high, avoid overflow by underestimating quotient
  --q;
}
u -= q * v;  // Can never overflow

If q and u don’t satisfy the quotient theorem, they are adjusted until they do:

while (u >= v) {  // too low
  q += 1;
  u -= v;
}

How many times does this need to be done? In this case, a maximum of two times (either induced by correcting for possible overflow, or by inherent inaccuracy in the calculated quotient), so the loop can be rewritten as straightline code:

if (u >= v) {
  q += 1;
  u -= v;
  if (u >= v) {
    q += 1;
  }
}

And that’s all there is too it. Now, bring it all together…

The final code

Here is the entire function, minus the lookup table which is found above:

unsigned Div_U16_U16_Reciprocal_Plain(unsigned u, unsigned v) {
  uint16_t q, r;
  uint32_t qx;
  unsigned n;
  //
  // Calculate estimate of quotient using lookup table.
  //
  n  = clz16(v);                  // Count leading zeros of a nonzero 16-bit value (0..15)
  r  = _aRecip_7i_16o[(v << n >> 8) - 0x80];
  qx = (uint32_t)u * r;           // Q16.16 = U16 * Q16
  q  = (uint16_t)(qx >> 16);      // U16 = trunc(Q16.16)
  q  = q >> (15-n);               // Undo normalization of v
  //
  // Prevent potential overflow in calculation of remainder;
  // also compensates for quotient "one too high."
  // 
  if (q > 0) {
    --q;
  }
  //
  // Calculate remainder to u.
  //
  u -= q * v;
  //
  // Final correction of quotient to satisfy division theorem.
  //
  if (u >= v) {                   // Failed to satisfy division theorem?
    q += 1;                       // Adjust quotient and remainder.
    u -= v;
    if (u >= v) {                 // Failed again?
      q += 1;                     // Remainder u not updated...
    }
  }
  //
  // Return a fully accurate quotient.
  //
  return q;
}

Measuring up

How does this algorithm translate on real hardware? If we target the Arm v7A architecture (implementations of which do not have a division instruction, for example Cortex-A9), the output is very clean, all straight-line code:

  0x000000FA  FAB1 F281    CLZ        R2, R1
  0x000000FE  FA01 F302    LSL.W      R3, R1, R2
  0x00000102  0E1B         LSRS       R3, R3, #24
  0x00000104  F640 5C10    MOVW       R12, #0x0D10             ; 0x00000D10 = _aRecip_7i_16o
  0x00000108  F2C0 0C00    MOVT       R12, #0                  ; 0x00000D10 = _aRecip_7i_16o
  0x0000010C  EB0C 0343    ADD.W      R3, R12, R3, LSL #1
  0x00000110  F06F 0CFF    MVN        R12, #255
  0x00000114  F1C2 021F    RSB        R2, R2, #31
  0x00000118  F833 300C    LDRH.W     R3, [R3, R12]
  0x0000011C  4343         MULS       R3, R0, R3
  0x0000011E  FA33 F202    LSRS.W     R2, R3, R2
  0x00000122  BF18         IT         NE
  0x00000124  3A01         SUBNE      R2, #1
  0x00000126  FB02 0011    MLS        R0, R2, R1, R0
  0x0000012A  1A43         SUBS       R3, R0, R1
  0x0000012C  428B         CMP        R3, R1
  0x0000012E  419B         SBCS       R3, R3
  0x00000130  3302         ADDS       R3, #2
  0x00000132  4288         CMP        R0, R1
  0x00000134  BF28         IT         CS
  0x00000136  441A         ADDCS      R2, R3
  0x00000138  4610         MOV        R0, R2
  0x0000013A  4770         BX         LR

If we target the v6M architecture, this specialist division algorithm outperforms the general division algorithm supplied in the SEGGER C library: 52 instructions executed versus 70 for the library. But of course, the library version is a general 32/32 diviside, not a highly-tuned 16/16 division.

But there’s even more to be gained here. The entire function can be inlined where it is used, eliminating the overhead of a call to the runtime division function! In a time-critical loop the payback can be immense: no call, no register save-restore, just straightline code with predictable timing, delivering hardware-level performance in software.

Processors without a division instruction?!

I think many would be surprised that a processor such as the Cortex-A9 has a floating-point divide instruction but no integer divide instruction. And this is not uncommon for RISC-V either, customers request cores from vendors that have a multiplier but no divider (a nonstandard configuration, but supported by RISC-V compilation tools nonetheless).

This method for calculating the quotient also scales to 32/32 and 64/64 division, but is more complex and is the subject of another article in the series. As you might expect, this style of division is offered in both emRun and emFloat, for Arm and for RISC-V.

Conclusion

If you know you’re only dividing 16-bit data, this algorithm can improve performance if dropped into your application. If it is inlined, it performs even better as it avoids a call/return to the runtime division function.

What’s next?

Spending 128 16-bit words (0.25 KB) for the LUT might be expensive in a microcontroller application. The next part of this series will explore how we can reduce the size of the LUT and still divide quickly.