In the previous article I presented an algorithm for division by calculating a reciprocal. This article presents a way to calculate the reciprocal, rather than looking it up, trading size of lookup table against speed of calculation.
Newton’s method
In your school mathematics class, you were most likely introduced to Newton’s method for finding roots of a real-valued function f(x). The idea is that for a well-behaved function, you can use successive approximations to make your estimate more accurate, doubling the accuracy after each stage.
I won’t present the theory here, it is adequately explained in many articles, such as this one. Applying Newton’s method using floating-point arithmetic is easy: the function and its derivative, along with the approximation framework, are written in plain C code without any regard to the intermediate values in the calculation: it simply works. But for fixed-point arithmetic, things are very different, you must know the range of values you are dealing with, how those values are represented in a machine word or C data type, and carefully manage change of representation, when needed, yourself.
Let’s start!
Calculating a reciprocal in floating point
Excluding division, I will illustrate here the Wikipedia article on calculating the reciprocal of a value a with an initial approximation of its reciprocal x. It’s straightforward:
1 2 3 4 5 6 7 8 9 10 11 | float Reciprocal_F32( float a, float x) { // // Apply Newton iterations. // x = x * (2 - a*x); x = x * (2 - a*x); x = x * (2 - a*x); x = x * (2 - a*x); // return x; } |
This turns out to be pretty good, for instance calculating the reciprocal of 3 with an initial approximation of 0.5, as…
1 2 3 | int main( void ) { printf ( "Approximation to 1/3 is %f" , Reciprocal_F32(3, 0.5)); } |
…prints “0.333328”. If we use six steps, it prints “0.333333”.
The accuracy of the calculated reciprocal depends on both the number of Newton steps and the quality of the initial reciprocal estimate. If we change the initial estimate above from 0.5 to 1.0, say, then the same five steps print “-1431655808.000000”. Hmm.
So, how do we find a good estimate, and how many steps will we need?
It turns out that microcontrollers and desktop processors typically provide an instruction to deliver an approximation to the reciprocal. For an Arm device, there exists VRECPE or FRECPE, depending on architecture. Clearly we’re not first to the party in recognizing that division by reciprocal multiplication is viable.
In integer-focused software, we typically look up the reciprocal estimate from a table, indexing by a number of leading nonzero bits taken from the divisor. This is what the previous article did. The more accurate the estimate, the fewer iterations you need. If you have an accuracy of n bits, each Newton iteration will double the accuracy to 2n bits (quadratic convergence).
We’re going to take this as far as we can and implement a 16/16 divide using a lookup table of only eight bytes!
Converting the Newton step to fixed-point arithmetic
If you haven’t read the previous article, read it now, you’ll need to before continuing.
A quick recap. In the previous article, the reciprocal was taken from a lookup table (LUT) as a Q16 value. What I will do is replace that lookup, along with the 128-entry 16-bit table (0.25 KB), with a very compact LUT and some calculations, to generate a Q16 reciprocal.
First, let’s break out the calculation of the approximation like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | float NewtonStep( float a, float x) { float t; // t = a * x; t = 2 - t; t = t * x; // return x; } float Reciprocal_F32( float a, float x) { // // Apply Newton iterations. // x = NewtonStep(a, x); x = NewtonStep(a, x); x = NewtonStep(a, x); x = NewtonStep(a, x); x = NewtonStep(a, x); // return x; } |
I will constrain the inputs a and x such that 1 <= a < 2 and 0.5 <= x < 1 (as the reciprocal of a lies within these bounds). With these bounds it’s possible to annotate the calculation like this:
1 2 3 4 5 6 7 8 9 10 | float NewtonStep( float a, // 1 <= a < 2 float x) { // 0.5 <= x < 1 float t; // t = a * x; // 0.5 <= ax < 2 t = 2 - t; // 0 < 2-ax <;= 1.5 t = t * x; // 0.5 <= x(2-ax) < 1 // return t; } |
But this is worst case, and I will be providing a more accurate initial reciprocal estimate, so stricter bounds apply, the newly calculated reciprocal approximation bounded by the range [0.5, 1).
1 2 3 | t = a * x; // 1 < ax < 1.5 t = 2 - t; // 0.5 < 2-ax < 1 t = t * x; // 0.5 <= x(2-ax) < 1 |
This is carefully converted to fixed-point format, assuming a function MulHi_U16_U16(a, b) which delivers the high 16 bits of a × b where a and b are 16-bit unsigned integers:
1 2 3 4 5 6 7 8 9 10 11 | uint16_t NewtonStep( uint16_t a, // Q1.15 1 <= a < 2 uint16_t x) { // Q16 0.5 <= x < 1 uint16_t t; // t = MulHi_U16_U16(a, x); // Q1.15 = Q1.15 * Q16 1 <= ax < 1.5 t = 0u - t; // Q1.15 = Q2.15 - Q1.15 0.5 < 2-ax < 1 t = MulHi_U16_U16(t, x); // Q1.15 = Q1.15 * Q16 0.5 <= x(2-ax) < 1 t = t << 1; // Q16 = Q1.15 // return t; } |
This is relatively straightforward, but two lines deserve explanation.
First, the subtraction from 2. One of the nice things about this algorithm is if the inputs are correctly constrained, the subtraction from 2 always generates a value that can be represented in Q1.15 without fear of overflow (examine the commented bounds). As 2 in Q2.15 format is 0x10000, and the subtrahend and result are in Q1.15 format, C’s unsigned modulo arithmetic means we can recast (0x10000-t) as simply (0-t). So we do.
Secondly, consider the conversion of Q1.15 format to Q16 format using “t = t << 1”. As t lies in the range [0,5, 1), this can be converted to Q16 format by shifting the 15 fraction bits of the Q1.15 value and discarding the most significant bit which, by definition, is zero here.
The reciprocal estimate
I said we’d be using a much smaller LUT for the reciprocal estimate, and this is it:
1 2 3 | static const uint8_t _aRecip_3i_8o[] = { 0xFF, 0xE3, 0xCC, 0xBA, 0xAA, 0x9D, 0x92, 0x88 }; |
As you can see, it has fewer entries, just eight, and each entry is eight bits wide, not 16. Modifying the previous article’s code, we have:
1 2 3 4 5 6 7 8 9 10 | static unsigned Div_U16_U16_Reciprocal_3i_8o_LUT( uint16_t u, uint16_t v) { uint16_t i; int n; // // Look up initial estimate of reciprocal using leading 3 nonzero bits of v. // n = clz16(v); i = v << n >> 12; r = _aRecip_3i_8o[i - 8]; // Q8 r <<= 8; // Q16 |
This retrieves the initial reciprocal estimate in Q16 format. We know how to apply Newton steps in fixed point, so we’re ready to construct the calculation of the reciprocal, the quotient estimate, and thereby the true quotient.
Calculating the quotient
Each Newton step doubles the accuracy of the quotient. The 8-entry table provides a mere four bits of accuracy for the reciprocal estimate (as we drop so many bits from the divisor when looking it up). Two iterations should generate 16 bits of accuracy., and in a perfect world it would, but we sacrifice one bit of precision when calculating the reciprocal as intermediate values are only held to 15-bit precision in Q1.15 format and we require a Q16 reciprocal.
This drop is best illustrated by tracing through the reciprocal calculation for 17:
- The value 1/17 in Q16 format, normalized, is 1/17 × 65536 × 16 ≈ 61680.94. Rounding this results in 61681, or 0xF0F1 in hexadecimal.
- The initial approximation of the reciprocal uses the leading four nonzero bits of 17. 17 is 10001 in binary, the leading four nonzero bits of which are 1000. Indexing into the table after eliminating the leading one bit delivers an initial approximation to the reciprocal of 0xFF in Q8 format and 0xFF00 in Q16 — a long way from 0xF0F1, only the most significant four bits are correct.
- The first Newton step refines the reciprocal estimate to 0xF01E, now 8 bits are correct.
- The second Newton step refines the reciprocal estimate to 0xF0F0, but only 15 bits are correct. Additional steps will never improve this as intermediate results are calculated in Q1.15 format, sacrificing one bit of precision.
You may well wonder if it is possible to calculate the reciprocal estimate to Q16 precision. Of course it is, we can calculate in Q1.16 precision, but this requires 17 bits, high-word multiplication requires at least 33 bits, and it just does not fit well, it generates more code and would be inefficient. Rather than calculate to 16 bits of precision, we tolerate the loss of one bit of precision to calculate everything using 16-bit quantities and extend the quotient correction from two stages to three to compensate for that loss.
Putting it all together
Here is the final code, cleaned up, building on the code from the previous article using the Newton steps presented above and a three-stage quotient correction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | static const uint8_t _aRecip_3i_8o[] = { 0xFF, 0xE3, 0xCC, 0xBA, 0xAA, 0x9D, 0x92, 0x88 }; static uint16_t MulHi_U16_U16( uint16_t a, uint16_t b) { return ( uint16_t )((( uint32_t )a * b) >> 16); } static unsigned Div_U16_U16_Reciprocal_3i_8o_LUT( uint16_t u, uint16_t v) { uint16_t i; uint16_t q; uint16_t r; int n; // n = clz16(v); // // Look up initial estimate of reciprocal using leading 3 nonzero bits of v. // i = v << n >> 12; r = _aRecip_3i_8o[i - 8]; // Q8 r <<= 8; // Q16 v <<= n; // // Newton iterations to refine reciprocal. // r = MulHi_U16_U16(0u - MulHi_U16_U16(r, v), r) << 1; r = MulHi_U16_U16(0u - MulHi_U16_U16(r, v), r) << 1; // // Calculate quotient estimate and undo normalization. // q = MulHi_U16_U16(u, r); q >>= 15-n; v >>= n; // // Prevent potential overflow in calculation of remainder; // also compensates for quotient "one too high." // if (q > 0) { --q; } // // Calculate remainder to u. // u -= q*v; // // Final correction of quotient to satisfy division theorem. // if (u >= v) { q += 1; u -= v; if (u >= v) { q += 1; u -= v; if (u >= v) { q += 1; } } } // // Return a fully accurate quotient. // return q; } |
Comparing architectures
It’s intriguing to compare how compilers generate code for different architectures when processing this source. Here’s the code for 32-bit Arm and RISC-V, and 16-bit dsPIC30.
Arm
The code generated for a Cortex-A9 using the SEGGER C compiler is 124 bytes in size. The code is ultra-lean, but the correction steps have room for improvement: the compiler could have recognized that both instances of CMP R12, R1…SUBW R12, R12, R1 can be replaced by an unconditional SUB R12, R12, R1 before the carry is tested, reducing code size and shaving cycles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | Div_U16_U16_Reciprocal_3i_8o_LUT: 0x0000008C B500 PUSH {LR} 0x0000008E 4684 MOV R12, R0 0x00000090 FAB1 F081 CLZ R0, R1 0x00000094 F1A0 0E10 SUB.W LR, R0, #16 0x00000098 FA01 F20E LSL.W R2, R1, LR 0x0000009C F240 1148 MOVW R1, #0x0148 ; 0x00000148 = _aRecip_3i_8o 0x000000A0 F2C0 0100 MOVT R1, #0 ; 0x00000148 = _aRecip_3i_8o 0x000000A4 EB01 3112 ADD.W R1, R1, R2, LSR #12 0x000000A8 F1C0 001F RSB R0, R0, #31 0x000000AC F811 1C08 LDRB R1, [R1, #-8] 0x000000B0 0209 LSLS R1, R1, #8 0x000000B2 FB01 F302 MUL R3, R1, R2 0x000000B6 0C1B LSRS R3, R3, #16 0x000000B8 4359 MULS R1, R3, R1 0x000000BA 424B RSBS R3, R1, #0 0x000000BC FA22 F10E LSR.W R1, R2, LR 0x000000C0 F06F 0E01 MVN LR, #1 0x000000C4 EA0E 33D3 AND.W R3, LR, R3, LSR #15 0x000000C8 435A MULS R2, R3, R2 0x000000CA 0C12 LSRS R2, R2, #16 0x000000CC 435A MULS R2, R3, R2 0x000000CE 4252 RSBS R2, R2, #0 0x000000D0 EA0E 3ED2 AND.W LR, LR, R2, LSR #15 0x000000D4 FB0E FE0C MUL LR, LR, R12 0x000000D8 EA4F 4E1E LSR.W LR, LR, #16 0x000000DC FA3E F000 LSRS.W R0, LR, R0 0x000000E0 BF18 IT NE 0x000000E2 3801 SUBNE R0, #1 0x000000E4 FB00 CC11 MLS R12, R0, R1, R12 0x000000E8 458C CMP R12, R1 0x000000EA BF38 IT CC 0x000000EC BD00 POPCC {PC} 0x000000EE EBAC 0C01 SUB.W R12, R12, R1 0x000000F2 458C CMP R12, R1 0x000000F4 BF3C ITT CC 0x000000F6 3001 ADDCC R0, #1 0x000000F8 BD00 POPCC {PC} 0x000000FA EBAC 0C01 SUB.W R12, R12, R1 0x000000FE 458C CMP R12, R1 0x00000100 4189 SBCS R1, R1 0x00000102 3103 ADDS R1, #3 0x00000104 4408 ADD R0, R1 0x00000106 BD00 POP {PC} |
RISC-V
Using the newly introduced SEGGER compiler for RISC-V results in 184 bytes of code. Unfortunately, RV32IMAC doesn’t have a count-leading-zeros instruction and this hurts code size and performance. As an aside, compiling with the Packed SIMD instruction set active (the RISC-V P extension), code size shrinks to 132 bytes using the GNU compiler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | Div_U16_U16_Reciprocal_3i_8o_LUT: 0x00400060 862A MV a2, a0 0x00400062 4501 LI a0, 0 0x00400064 66C1 LI a3, 0x010000 0x00400066 F0068693 ADDI a3, a3, -0x0100 0x0040006A 00D5F7B3 AND a5, a1, a3 0x0040006E 0017B713 SEQZ a4, a5 0x00400072 46E1 LI a3, 24 0x00400074 C391 BEQZ a5, .L1 ; 0x00400078 0x00400076 46C1 LI a3, 16 .L1: 0x00400078 00D596B3 SLL a3, a1, a3 0x0040007C 01C6D793 SRLI a5, a3, 28 0x00400080 070E SLLI a4, a4, 3 0x00400082 00A79563 BNE a5, a0, .L2 ; 0x0040008C 0x00400086 00476713 ORI a4, a4, 4 0x0040008A 0692 SLLI a3, a3, 4 .L2: 0x0040008C 01E6D793 SRLI a5, a3, 30 0x00400090 00A79563 BNE a5, a0, .L3 ; 0x0040009A 0x00400094 00276713 ORI a4, a4, 2 0x00400098 068A SLLI a3, a3, 2 .L3: 0x0040009A FFF6C693 NOT a3, a3 0x0040009E 82FD SRLI a3, a3, 31 0x004000A0 8F55 OR a4, a4, a3 0x004000A2 00E595B3 SLL a1, a1, a4 0x004000A6 00C5D693 SRLI a3, a1, 12 0x004000AA 004007B7 LI a5, 0x400000 ; 0x00400046 = _aRecip_3i_8o 0x004000AE 04678793 ADDI a5, a5, 70 0x004000B2 96BE ADD a3, a3, a5 0x004000B4 FF86C683 LBU a3, -8(a3) 0x004000B8 06A2 SLLI a3, a3, 8 0x004000BA 02B687B3 MUL a5, a3, a1 0x004000BE 83C1 SRLI a5, a5, 16 0x004000C0 02F686B3 MUL a3, a3, a5 0x004000C4 40D506B3 SUB a3, a0, a3 0x004000C8 82BD SRLI a3, a3, 15 0x004000CA 9AF9 ANDI a3, a3, -2 0x004000CC 02B687B3 MUL a5, a3, a1 0x004000D0 83C1 SRLI a5, a5, 16 0x004000D2 02F686B3 MUL a3, a3, a5 0x004000D6 40D506B3 SUB a3, a0, a3 0x004000DA 82BD SRLI a3, a3, 15 0x004000DC 9AF9 ANDI a3, a3, -2 0x004000DE 02C686B3 MUL a3, a3, a2 0x004000E2 82C1 SRLI a3, a3, 16 0x004000E4 00F74793 XORI a5, a4, 15 0x004000E8 00F6D6B3 SRL a3, a3, a5 0x004000EC 00E5D5B3 SRL a1, a1, a4 0x004000F0 00A68463 BEQ a3, a0, .L4 ; 0x004000F8 0x004000F4 FFF68513 ADDI a0, a3, -1 .L4: 0x004000F8 02B506B3 MUL a3, a0, a1 0x004000FC 8E15 SUB a2, a2, a3 0x004000FE 00B66C63 BLTU a2, a1, .L6 ; 0x00400116 0x00400102 8E0D SUB a2, a2, a1 0x00400104 00B67463 BGEU a2, a1, .L5 ; 0x0040010C 0x00400108 0505 ADDI a0, a0, 1 0x0040010A 8082 RET .L5: 0x0040010C 8E0D SUB a2, a2, a1 0x0040010E 00B635B3 SLTU a1, a2, a1 0x00400112 8D0D SUB a0, a0, a1 0x00400114 050D ADDI a0, a0, 3 .L6: 0x00400116 8082 RET |
dsPIC30
Compiling for a dsPIC30 using the Microchip XC16 compiler, the code is highly respectable: 45 instructions total, that’s 135 code bytes. There are opportunities for improvement: the compiler didn’t code the Newton step the best way possible (the subtract with borrow can be eliminated by rearranging), but it’s not disastrous. The compiler also missed the fact the quotient correction’s “compare and subtract” can be combined, same as the Arm compiler missed it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | 000001d6 <_Div_U16_U16_Reciprocal_3i_8o_LUT>: 1d6: 80 01 eb clr.w w3 1d8: 01 00 e0 cp0.w w1 1da: 02 00 35 bra LT, 0x1e0 <.L2> 1dc: 81 01 df fbcl w1, w3 1de: e1 81 11 subr.w w3, #0x1, w3 000001e0 <.L2>: 1e0: 83 08 dd sl.w w1, w3, w1 1e2: 4c 09 de lsr.w w1, #0xc, w2 1e4: 68 01 51 sub.w w2, #0x8, w2 1e6: e4 1c 28 mov.w #0x81ce, w4 1e8: 64 42 79 mov.b [w4+w2], w4 1ea: 48 22 dd sl.w w4, #0x8, w4 1ec: 01 23 b8 mul.uu w4, w1, w6 1ee: 07 a2 b9 mul.ss w4, w7, w4 1f0: 80 02 eb clr.w w5 1f2: 60 02 12 subr.w w4, #0x0, w4 1f4: e0 82 1a subbr.w w5, #0x0, w5 1f6: 05 82 42 add.w w5, w5, w4 1f8: 01 23 b8 mul.uu w4, w1, w6 1fa: 07 a2 b9 mul.ss w4, w7, w4 1fc: 80 02 eb clr.w w5 1fe: 60 02 12 subr.w w4, #0x0, w4 200: e0 82 1a subbr.w w5, #0x0, w5 202: 05 82 42 add.w w5, w5, w4 204: 04 02 b8 mul.uu w0, w4, w4 206: 6f 81 11 subr.w w3, #0xf, w2 208: 02 29 de lsr.w w5, w2, w2 20a: 83 08 de lsr.w w1, w3, w1 20c: 02 00 e0 cp0.w w2 20e: 01 00 32 bra Z, 0x212 <.L3> 210: 02 01 e9 dec.w w2, w2 00000212 <.L3>: 212: 01 92 b9 mul.ss w2, w1, w4 214: 04 00 50 sub.w w0, w4, w0 216: 81 0f 50 sub.w w0, w1, [w15] 218: 09 00 39 bra NC, 0x22c <.L4> 21a: 02 01 e8 inc.w w2, w2 21c: 01 00 50 sub.w w0, w1, w0 21e: 81 0f 50 sub.w w0, w1, [w15] 220: 05 00 39 bra NC, 0x22c <.L4> 222: 02 01 e8 inc.w w2, w2 224: 01 00 50 sub.w w0, w1, w0 226: 81 0f 50 sub.w w0, w1, [w15] 228: 01 00 39 bra NC, 0x22c <.L4> 22a: 02 01 e8 inc.w w2, w2 0000022c <.L4>: 22c: 02 00 78 mov.w w2, w0 22e: 00 00 06 return |
Conclusion
Presented here is an algorithm that builds on the previous article’s division code and reduces the size of the lookup table whilst still performing well. The previous algorithm for Arm’s v7A required 66 code bytes and 256 data types, totaling 322 bytes. The algorithm here requires 124 code bytes and 8 data bytes, a total of 132 bytes…less than half the space whilst still providing excellent performance.
This style of division, using a compact lookup table and Newton iterations is deployed in emRun and emFloat, providing excellent performance with compact code size.
This is the final part of a four-part blog series. Read the first three installments here: