Algorithms for division – part 4 – Using Newton’s method

In the previous article I presented an algorithm for division by calculating a reciprocal. This article presents a way to calculate the reciprocal, rather than looking it up, trading size of lookup table against speed of calculation.

Newton’s method

In your school mathematics class, you were most likely introduced to Newton’s method for finding roots of a real-valued function f(x). The idea is that for a well-behaved function, you can use successive approximations to make your estimate more accurate, doubling the accuracy after each stage.

I won’t present the theory here, it is adequately explained in many articles, such as this one. Applying Newton’s method using floating-point arithmetic is easy: the function and its derivative, along with the approximation framework, are written in plain C code without any regard to the intermediate values in the calculation: it simply works. But for fixed-point arithmetic, things are very different, you must know the range of values you are dealing with, how those values are represented in a machine word or C data type, and carefully manage change of representation, when needed, yourself.

Let’s start!

Calculating a reciprocal in floating point

Excluding division, I will illustrate here the Wikipedia article on calculating the reciprocal of a value a with an initial approximation of its reciprocal x. It’s straightforward:

float Reciprocal_F32(float a, float x) {
  //
  // Apply Newton iterations.
  //
  x = x * (2 - a*x);
  x = x * (2 - a*x);
  x = x * (2 - a*x);
  x = x * (2 - a*x);
  //
  return x;
}

This turns out to be pretty good, for instance calculating the reciprocal of 3 with an initial approximation of 0.5, as…

int main(void) {
  printf("Approximation to 1/3 is %f", Reciprocal_F32(3, 0.5));
}

…prints “0.333328”. If we use six steps, it prints “0.333333”.

The accuracy of the calculated reciprocal depends on both the number of Newton steps and the quality of the initial reciprocal estimate. If we change the initial estimate above from 0.5 to 1.0, say, then the same five steps print “-1431655808.000000”. Hmm.

So, how do we find a good estimate, and how many steps will we need?

It turns out that microcontrollers and desktop processors typically provide an instruction to deliver an approximation to the reciprocal. For an Arm device, there exists VRECPE or FRECPE, depending on architecture. Clearly we’re not first to the party in recognizing that division by reciprocal multiplication is viable.

In integer-focused software, we typically look up the reciprocal estimate from a table, indexing by a number of leading nonzero bits taken from the divisor. This is what the previous article did. The more accurate the estimate, the fewer iterations you need.  If you have an accuracy of n bits, each Newton iteration will double the accuracy to 2n bits (quadratic convergence).

We’re going to take this as far as we can and implement a 16/16 divide using a lookup table of only eight bytes!

Converting the Newton step to fixed-point arithmetic

If you haven’t read the previous article, read it now, you’ll need to before continuing.

A quick recap. In the previous article, the reciprocal was taken from a lookup table (LUT) as a Q16 value. What I will do is replace that lookup, along with the 128-entry 16-bit table (0.25 KB), with a very compact LUT and some calculations, to generate a Q16 reciprocal.

First, let’s break out the calculation of the approximation like this:

float NewtonStep(float a, float x) {
  float t;
  //
  t = a * x;
  t = 2 - t;
  t = t * x;
  //
  return x;
}

float Reciprocal_F32(float a, float x) {
  //
  // Apply Newton iterations.
  //
  x = NewtonStep(a, x);
  x = NewtonStep(a, x);
  x = NewtonStep(a, x);
  x = NewtonStep(a, x);
  x = NewtonStep(a, x);
  //
  return x;
}

I will constrain the inputs a and x such that 1 <= a < 2 and 0.5 <= x < 1 (as the reciprocal of a lies within these bounds). With these bounds it’s possible to annotate the calculation like this:

float NewtonStep(float a,     // 1 &lt;= a &lt; 2
                 float x) {   // 0.5 &lt;= x &lt; 1 
  float t;
  //
  t = a * x;   // 0.5 &lt;= ax &lt;  2
  t = 2 - t;   // 0 &lt; 2-ax &lt;;= 1.5
  t = t * x;   // 0.5 &lt;= x(2-ax) &lt; 1
  //
  return t;
}

But this is worst case, and I will be providing a more accurate initial reciprocal estimate, so stricter bounds apply, the newly calculated reciprocal approximation bounded by the range [0.5, 1).

  t = a * x;   // 1 &lt; ax &lt; 1.5
  t = 2 - t;   // 0.5 &lt; 2-ax &lt; 1
  t = t * x;   // 0.5 &lt;= x(2-ax) &lt; 1

This is carefully converted to fixed-point format, assuming a function MulHi_U16_U16(a, b) which delivers the high 16 bits of a × b where a and b are 16-bit unsigned integers:

uint16_t NewtonStep(uint16_t a,     // Q1.15  1 &lt;= a &lt; 2
                    uint16_t x) {   // Q16    0.5 &lt;= x &lt; 1 
  uint16_t t;
  //
  t = MulHi_U16_U16(a, x);   // Q1.15 = Q1.15 * Q16      1 &lt;= ax &lt; 1.5
  t = 0u - t;                // Q1.15 = Q2.15 - Q1.15    0.5 &lt; 2-ax &lt; 1
  t = MulHi_U16_U16(t, x);   // Q1.15 = Q1.15 * Q16      0.5 &lt;= x(2-ax) &lt; 1
  t = t &lt;&lt; 1;                // Q16   = Q1.15
  //
  return t;
}

This is relatively straightforward, but two lines deserve explanation.

First, the subtraction from 2. One of the nice things about this algorithm is if the inputs are correctly constrained, the subtraction from 2 always generates a value that can be represented in Q1.15 without fear of overflow (examine the commented bounds). As 2 in Q2.15 format is 0x10000, and the subtrahend and result are in Q1.15 format, C’s unsigned modulo arithmetic means we can recast (0x10000-t) as simply (0-t). So we do.

Secondly, consider the conversion of Q1.15 format to Q16 format using “t = t << 1”. As t lies in the range [0,5, 1), this can be converted to Q16 format by shifting the 15 fraction bits of the Q1.15 value and discarding the most significant bit which, by definition, is zero here.

The reciprocal estimate

I said we’d be using a much smaller LUT for the reciprocal estimate, and this is it:

static const uint8_t _aRecip_3i_8o[] = {
  0xFF, 0xE3, 0xCC, 0xBA, 0xAA, 0x9D, 0x92, 0x88
};

As you can see, it has fewer entries, just eight, and each entry is eight bits wide, not 16. Modifying the previous article’s code, we have:

static unsigned Div_U16_U16_Reciprocal_3i_8o_LUT(uint16_t u, uint16_t v) {
  uint16_t i;
  int      n;
  //
  // Look up initial estimate of reciprocal using leading 3 nonzero bits of v.
  //
  n = clz16(v);
  i = v &lt;&lt; n &gt;&gt; 12;
  r = _aRecip_3i_8o[i - 8];  // Q8
  r &lt;&lt;= 8;                   // Q16

This retrieves the initial reciprocal estimate in Q16 format. We know how to apply Newton steps in fixed point, so we’re ready to construct the calculation of the reciprocal, the quotient estimate, and thereby the true quotient.

Calculating the quotient

Each Newton step doubles the accuracy of the quotient. The 8-entry table provides a mere four bits of accuracy for the reciprocal estimate (as we drop so many bits from the divisor when looking it up). Two iterations should generate 16 bits of accuracy., and in a perfect world it would, but we sacrifice one bit of precision when calculating the reciprocal as intermediate values are only held to 15-bit precision in Q1.15 format and we require a Q16 reciprocal.

This drop is best illustrated by tracing through the reciprocal calculation for 17:

  • The value 1/17 in Q16 format, normalized, is 1/17 × 65536 × 16 ≈ 61680.94.  Rounding this results in 61681, or 0xF0F1 in hexadecimal.
  • The initial approximation of the reciprocal uses the leading four nonzero bits of 17. 17 is 10001 in binary, the leading four nonzero bits of which are 1000. Indexing into the table after eliminating the leading one bit delivers an initial approximation to the reciprocal of 0xFF in Q8 format and 0xFF00 in Q16 — a long way from 0xF0F1, only the most significant four bits are correct.
  • The first Newton step refines the reciprocal estimate to 0xF01E, now 8 bits are correct.
  • The second Newton step refines the reciprocal estimate to 0xF0F0, but only 15 bits are correct. Additional steps will never improve this as intermediate results are calculated in Q1.15 format, sacrificing one bit of precision.

You may well wonder if it is possible to calculate the reciprocal estimate to Q16 precision. Of course it is, we can calculate in Q1.16 precision, but this requires 17 bits, high-word multiplication requires at least 33 bits, and it just does not fit well, it generates more code and would be inefficient. Rather than calculate to 16 bits of precision, we tolerate the loss of one bit of precision to calculate everything using 16-bit quantities and extend the quotient correction from two stages to three to compensate for that loss.

Putting it all together

Here is the final code, cleaned up, building on the code from the previous article using the Newton steps presented above and a three-stage quotient correction.

static const uint8_t _aRecip_3i_8o[] = {
  0xFF, 0xE3, 0xCC, 0xBA, 0xAA, 0x9D, 0x92, 0x88
};

static uint16_t MulHi_U16_U16(uint16_t a, uint16_t b) {
  return (uint16_t)(((uint32_t)a * b) &gt;&gt; 16);
}

static unsigned Div_U16_U16_Reciprocal_3i_8o_LUT(uint16_t u, uint16_t v) {
  uint16_t i;
  uint16_t q;
  uint16_t r;
  int      n;
  //
  n = clz16(v);
  //
  // Look up initial estimate of reciprocal using leading 3 nonzero bits of v.
  //
  i = v &lt;&lt; n &gt;&gt; 12;
  r = _aRecip_3i_8o[i - 8];  // Q8
  r &lt;&lt;= 8;                   // Q16
  v &lt;&lt;= n;
  //
  // Newton iterations to refine reciprocal.
  //
  r = MulHi_U16_U16(0u - MulHi_U16_U16(r, v), r) &lt;&lt; 1;
  r = MulHi_U16_U16(0u - MulHi_U16_U16(r, v), r) &lt;&lt; 1;
  //
  // Calculate quotient estimate and undo normalization.
  //
  q = MulHi_U16_U16(u, r);
  q &gt;&gt;= 15-n;
  v &gt;&gt;= n;
  //
  // Prevent potential overflow in calculation of remainder;
  // also compensates for quotient "one too high."
  // 
  if (q &gt; 0) {
    --q;
  }
  //
  // Calculate remainder to u.
  //
  u -= q*v;
  //
  // Final correction of quotient to satisfy division theorem.
  //
  if (u &gt;= v) {
    q += 1;
    u -= v;
    if (u &gt;= v) {
      q += 1;
      u -= v;
      if (u &gt;= v) {
        q += 1;
      }
    }
  }
  //
  // Return a fully accurate quotient.
  //
  return q;
}

Comparing architectures

It’s intriguing to compare how compilers generate code for different architectures when processing this source. Here’s the code for 32-bit Arm and RISC-V, and 16-bit dsPIC30.

Arm

The code generated for a Cortex-A9 using the SEGGER C compiler is 124 bytes in size. The code is ultra-lean, but the correction steps have room for improvement: the compiler could have recognized that both instances of CMP R12, R1…SUBW R12, R12, R1 can be replaced by an unconditional SUB R12, R12, R1 before the carry is tested, reducing code size and shaving cycles.

Div_U16_U16_Reciprocal_3i_8o_LUT:
  0x0000008C  B500         PUSH       {LR}
  0x0000008E  4684         MOV        R12, R0
  0x00000090  FAB1 F081    CLZ        R0, R1
  0x00000094  F1A0 0E10    SUB.W      LR, R0, #16
  0x00000098  FA01 F20E    LSL.W      R2, R1, LR
  0x0000009C  F240 1148    MOVW       R1, #0x0148              ; 0x00000148 = _aRecip_3i_8o
  0x000000A0  F2C0 0100    MOVT       R1, #0                   ; 0x00000148 = _aRecip_3i_8o
  0x000000A4  EB01 3112    ADD.W      R1, R1, R2, LSR #12
  0x000000A8  F1C0 001F    RSB        R0, R0, #31
  0x000000AC  F811 1C08    LDRB       R1, [R1, #-8]
  0x000000B0  0209         LSLS       R1, R1, #8
  0x000000B2  FB01 F302    MUL        R3, R1, R2
  0x000000B6  0C1B         LSRS       R3, R3, #16
  0x000000B8  4359         MULS       R1, R3, R1
  0x000000BA  424B         RSBS       R3, R1, #0
  0x000000BC  FA22 F10E    LSR.W      R1, R2, LR
  0x000000C0  F06F 0E01    MVN        LR, #1
  0x000000C4  EA0E 33D3    AND.W      R3, LR, R3, LSR #15
  0x000000C8  435A         MULS       R2, R3, R2
  0x000000CA  0C12         LSRS       R2, R2, #16
  0x000000CC  435A         MULS       R2, R3, R2
  0x000000CE  4252         RSBS       R2, R2, #0
  0x000000D0  EA0E 3ED2    AND.W      LR, LR, R2, LSR #15
  0x000000D4  FB0E FE0C    MUL        LR, LR, R12
  0x000000D8  EA4F 4E1E    LSR.W      LR, LR, #16
  0x000000DC  FA3E F000    LSRS.W     R0, LR, R0
  0x000000E0  BF18         IT         NE
  0x000000E2  3801         SUBNE      R0, #1
  0x000000E4  FB00 CC11    MLS        R12, R0, R1, R12
  0x000000E8  458C         CMP        R12, R1
  0x000000EA  BF38         IT         CC
  0x000000EC  BD00         POPCC      {PC}
  0x000000EE  EBAC 0C01    SUB.W      R12, R12, R1
  0x000000F2  458C         CMP        R12, R1
  0x000000F4  BF3C         ITT        CC
  0x000000F6  3001         ADDCC      R0, #1
  0x000000F8  BD00         POPCC      {PC}
  0x000000FA  EBAC 0C01    SUB.W      R12, R12, R1
  0x000000FE  458C         CMP        R12, R1
  0x00000100  4189         SBCS       R1, R1
  0x00000102  3103         ADDS       R1, #3
  0x00000104  4408         ADD        R0, R1
  0x00000106  BD00         POP        {PC}

RISC-V

Using the newly introduced SEGGER compiler for RISC-V results in 184 bytes of code. Unfortunately, RV32IMAC doesn’t have a count-leading-zeros instruction and this hurts code size and performance. As an aside, compiling with the Packed SIMD instruction set active (the RISC-V P extension), code size shrinks to 132 bytes using the GNU compiler.

Div_U16_U16_Reciprocal_3i_8o_LUT:
  0x00400060  862A         MV         a2, a0
  0x00400062  4501         LI         a0, 0
  0x00400064  66C1         LI         a3, 0x010000
  0x00400066  F0068693     ADDI       a3, a3, -0x0100
  0x0040006A  00D5F7B3     AND        a5, a1, a3
  0x0040006E  0017B713     SEQZ       a4, a5
  0x00400072  46E1         LI         a3, 24
  0x00400074  C391         BEQZ       a5, .L1                  ; 0x00400078
  0x00400076  46C1         LI         a3, 16

.L1:
  0x00400078  00D596B3     SLL        a3, a1, a3
  0x0040007C  01C6D793     SRLI       a5, a3, 28
  0x00400080  070E         SLLI       a4, a4, 3
  0x00400082  00A79563     BNE        a5, a0, .L2              ; 0x0040008C
  0x00400086  00476713     ORI        a4, a4, 4
  0x0040008A  0692         SLLI       a3, a3, 4

.L2:
  0x0040008C  01E6D793     SRLI       a5, a3, 30
  0x00400090  00A79563     BNE        a5, a0, .L3              ; 0x0040009A
  0x00400094  00276713     ORI        a4, a4, 2
  0x00400098  068A         SLLI       a3, a3, 2

.L3:
  0x0040009A  FFF6C693     NOT        a3, a3
  0x0040009E  82FD         SRLI       a3, a3, 31
  0x004000A0  8F55         OR         a4, a4, a3
  0x004000A2  00E595B3     SLL        a1, a1, a4
  0x004000A6  00C5D693     SRLI       a3, a1, 12
  0x004000AA  004007B7     LI         a5, 0x400000             ; 0x00400046 = _aRecip_3i_8o
  0x004000AE  04678793     ADDI       a5, a5, 70
  0x004000B2  96BE         ADD        a3, a3, a5
  0x004000B4  FF86C683     LBU        a3, -8(a3)
  0x004000B8  06A2         SLLI       a3, a3, 8
  0x004000BA  02B687B3     MUL        a5, a3, a1
  0x004000BE  83C1         SRLI       a5, a5, 16
  0x004000C0  02F686B3     MUL        a3, a3, a5
  0x004000C4  40D506B3     SUB        a3, a0, a3
  0x004000C8  82BD         SRLI       a3, a3, 15
  0x004000CA  9AF9         ANDI       a3, a3, -2
  0x004000CC  02B687B3     MUL        a5, a3, a1
  0x004000D0  83C1         SRLI       a5, a5, 16
  0x004000D2  02F686B3     MUL        a3, a3, a5
  0x004000D6  40D506B3     SUB        a3, a0, a3
  0x004000DA  82BD         SRLI       a3, a3, 15
  0x004000DC  9AF9         ANDI       a3, a3, -2
  0x004000DE  02C686B3     MUL        a3, a3, a2
  0x004000E2  82C1         SRLI       a3, a3, 16
  0x004000E4  00F74793     XORI       a5, a4, 15
  0x004000E8  00F6D6B3     SRL        a3, a3, a5
  0x004000EC  00E5D5B3     SRL        a1, a1, a4
  0x004000F0  00A68463     BEQ        a3, a0, .L4              ; 0x004000F8
  0x004000F4  FFF68513     ADDI       a0, a3, -1

.L4:
  0x004000F8  02B506B3     MUL        a3, a0, a1
  0x004000FC  8E15         SUB        a2, a2, a3
  0x004000FE  00B66C63     BLTU       a2, a1, .L6              ; 0x00400116
  0x00400102  8E0D         SUB        a2, a2, a1
  0x00400104  00B67463     BGEU       a2, a1, .L5              ; 0x0040010C
  0x00400108  0505         ADDI       a0, a0, 1
  0x0040010A  8082         RET

.L5:
  0x0040010C  8E0D         SUB        a2, a2, a1
  0x0040010E  00B635B3     SLTU       a1, a2, a1
  0x00400112  8D0D         SUB        a0, a0, a1
  0x00400114  050D         ADDI       a0, a0, 3

.L6:
  0x00400116  8082         RET

dsPIC30

Compiling for a dsPIC30 using the Microchip XC16 compiler, the code is highly respectable: 45 instructions total, that’s 135 code bytes. There are opportunities for improvement: the compiler didn’t code the Newton step the best way possible (the subtract with borrow can be eliminated by rearranging), but it’s not disastrous. The compiler also missed the fact the quotient correction’s “compare and subtract” can be combined, same as the Arm compiler missed it.

000001d6 &lt;_Div_U16_U16_Reciprocal_3i_8o_LUT&gt;:
 1d6:   80 01 eb        clr.w     w3
 1d8:   01 00 e0        cp0.w     w1
 1da:   02 00 35        bra       LT, 0x1e0 &lt;.L2&gt;
 1dc:   81 01 df        fbcl      w1, w3
 1de:   e1 81 11        subr.w    w3, #0x1, w3

000001e0 &lt;.L2&gt;:
 1e0:   83 08 dd        sl.w      w1, w3, w1
 1e2:   4c 09 de        lsr.w     w1, #0xc, w2
 1e4:   68 01 51        sub.w     w2, #0x8, w2
 1e6:   e4 1c 28        mov.w     #0x81ce, w4
 1e8:   64 42 79        mov.b     [w4+w2], w4
 1ea:   48 22 dd        sl.w      w4, #0x8, w4
 1ec:   01 23 b8        mul.uu    w4, w1, w6
 1ee:   07 a2 b9        mul.ss    w4, w7, w4
 1f0:   80 02 eb        clr.w     w5
 1f2:   60 02 12        subr.w    w4, #0x0, w4
 1f4:   e0 82 1a        subbr.w   w5, #0x0, w5
 1f6:   05 82 42        add.w     w5, w5, w4
 1f8:   01 23 b8        mul.uu    w4, w1, w6
 1fa:   07 a2 b9        mul.ss    w4, w7, w4
 1fc:   80 02 eb        clr.w     w5
 1fe:   60 02 12        subr.w    w4, #0x0, w4
 200:   e0 82 1a        subbr.w   w5, #0x0, w5
 202:   05 82 42        add.w     w5, w5, w4
 204:   04 02 b8        mul.uu    w0, w4, w4
 206:   6f 81 11        subr.w    w3, #0xf, w2
 208:   02 29 de        lsr.w     w5, w2, w2
 20a:   83 08 de        lsr.w     w1, w3, w1
 20c:   02 00 e0        cp0.w     w2
 20e:   01 00 32        bra       Z, 0x212 &lt;.L3&gt;
 210:   02 01 e9        dec.w     w2, w2

00000212 &lt;.L3&gt;:
 212:   01 92 b9        mul.ss    w2, w1, w4
 214:   04 00 50        sub.w     w0, w4, w0
 216:   81 0f 50        sub.w     w0, w1, [w15]
 218:   09 00 39        bra       NC, 0x22c &lt;.L4&gt;
 21a:   02 01 e8        inc.w     w2, w2
 21c:   01 00 50        sub.w     w0, w1, w0
 21e:   81 0f 50        sub.w     w0, w1, [w15]
 220:   05 00 39        bra       NC, 0x22c &lt;.L4&gt;
 222:   02 01 e8        inc.w     w2, w2
 224:   01 00 50        sub.w     w0, w1, w0
 226:   81 0f 50        sub.w     w0, w1, [w15]
 228:   01 00 39        bra       NC, 0x22c &lt;.L4&gt;
 22a:   02 01 e8        inc.w     w2, w2

0000022c &lt;.L4&gt;:
 22c:   02 00 78        mov.w     w2, w0
 22e:   00 00 06        return

Conclusion

Presented here is an algorithm that builds on the previous article’s division code and reduces the size of the lookup table whilst still performing well. The previous algorithm for Arm’s v7A required 66 code bytes and 256 data types, totaling 322 bytes. The algorithm here requires 124 code bytes and 8 data bytes, a total of 132 bytes…less than half the space whilst still providing excellent performance.

This style of division, using a compact lookup table and Newton iterations is deployed in emRun and emFloat, providing excellent performance with compact code size.

 

This is the final part of a four-part blog series. Read the first three installments here: