Looking for an efficient integer square root algorithm for ARM Thumb2

I am looking for a fast, integer only algorithm to find the square root (integer part thereof) of an unsigned integer. The code must have excellent performance on ARM Thumb 2 processors. It could be assembly language or C code.

Any hints welcome.


Integer Square Roots by Jack W. Crenshaw could be useful as another reference.

The C Snippets Archive also has an integer square root implementation. This one goes beyond just the integer result, and calculates extra fractional (fixed-point) bits of the answer. (Update: unfortunately, the C snippets archive is now defunct. The link points to the web archive of the page.) Here is the code from the C Snippets Archive:

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}

I settled on the following code. It's essentially from the Wikipedia article on square-root computing methods. But it has been changed to use stdint.h types uint32_t etc. Strictly speaking, the return type could be changed to uint16_t.

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}

The nice thing, I discovered, is that a fairly simple modification can return the "rounded" answer. I found this useful in a certain application for greater accuracy. Note that in this case, the return type must be uint32_t because the rounded square root of 232 - 1 is 216.

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}

If exact accuracy isn't required, I have a fast approximation for you, that uses 260 bytes of RAM (you could halve that, but don't).

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};

int fisqrt(int val)
{
    int cnt=0;
    int t=val;
    while (t) {cnt++;t>>=1;}
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;
}

Here's the code to generate the tables:

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");

Over the range 1 → 220, the maximum error is 11, and over the range 1 → 230, it's about 256. You could use larger tables and minimise this. It's worth mentioning that the error will always be negative - i.e. when it's wrong, the value will be LESS than the correct value.

You might do well to follow this with a refining stage.

The idea is simple enough: (ab)0.5 = a0.b × b0.5.

So, we take the input X = A×B where A = 2N and 1 ≤ B < 2

Then we have a lookup table for sqrt(2N), and a lookup table for sqrt(1 ≤ B < 2). We store the lookup table for sqrt(2N) as integer, which might be a mistake (testing shows no ill effects), and we store the lookup table for sqrt(1 ≤ B < 2) as 15-bit fixed-point.

We know that 1 ≤ sqrt(2N) < 65536, so that's 16-bit, and we know that we can really only multiply 16-bit × 15-bit on an ARM, without fear of reprisal, so that's what we do.

In terms of implementation, the while(t) {cnt++;t>>=1;} is effectively a count-leading-bits instruction (CLB), so if your version of the chipset has that, you're winning! Also, the shift instruction would be easy to implement with a bidirectional shifter, if you have one?

There's a Lg[N] algorithm for counting the highest set bit here.

In terms of magic numbers, for changing table sizes, THE magic number for ftbl2 is 32, though note that 6 (Lg[32]+1) is used for the shifting.


One common approach is bisection.

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

Something like that should work reasonably well. It makes log2(number) tests, doing log2(number) multiplies and divides. Since the divide is a divide by 2, you can replace it with a >>.

The terminating condition may not be spot on, so be sure to test a variety of integers to be sure that the division by 2 doesn't incorrectly oscillate between two even values; they would differ by more than 1.


It's not fast but it's small and simple:

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}

I find that most algorithms are based on simple ideas, but are implemented in a way more complicated manner than necessary. I've taken the idea from here: http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (by Ross M. Fosler) and made it into a very short C-function:

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res=0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

This compiles to 5 cycles/bit on my blackfin. I believe your compiled code will in general be faster if you use for loops instead of while loops, and you get the added benefit of deterministic time (although that to some extent depends on how your compiler optimizes the if statement.)