93 KiB
Bit Twiddling Hacks
A salute to https://graphics.stanford.edu/~seander/bithacks.html.
I do NOT own any of this. The following contents comes directly from the above link. I would also love to transfer the ownership to the original author, Sean Eron Anderson.
By Sean Eron Anderson seander@cs.stanford.edu |
---|
0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0 & 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110) |
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101)
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000
0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000 >> 32 ————————————————————————————————- 0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba
& 1111 1111 ————————————————————————————————- hgfe dcba
</font>
Note that the last two steps can be combined on some processors because
the registers can be accessed as bytes;
just multiply so that a register stores the upper 32 bits of the result
and the take the low byte. Thus, it may take only 6 operations.
<p>
Devised by Sean Anderson, July 13, 2001.
</p><p>
</p><hr>
<h3>
<a name="ReverseByteWith32Bits">
Reverse the bits in a byte with 7 operations (no 64-bit):
</a>
</h3>
```c
b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;
Make sure you assign or cast the result to an unsigned char to remove
garbage in the higher bits.Devised by Sean Anderson, July 13, 2001. Typo spotted and correction supplied by Mike Keith, January 3, 2002.
Reverse an N-bit quantity in parallel in 5 * lg(N) operations:
unsigned int v; // 32-bit word to reverse bit order
// swap odd and even bits
= ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1);
v // swap consecutive pairs
= ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2);
v // swap nibbles ...
= ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4);
v // swap bytes
= ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8);
v // swap 2-byte long pairs
= ( v >> 16 ) | ( v << 16); v
The following variation is also O(lg(N)), however it requires more operations to reverse v. Its virtue is in taking less slightly memory by computing the constants on the fly.
unsigned int s = sizeof(v) * CHAR_BIT; // bit size; must be power of 2
unsigned int mask = ~0;
while ((s >>= 1) > 0)
{
^= (mask << s);
mask = ((v >> s) & mask) | ((v << s) & ~mask);
v }
See Dr. Dobb’s Journal 1983, Edwin Freed’s article on Binary Magic Numbers for more information. The second variation was suggested by Ken Raeburn on September 13, 2005. Veldmeijer mentioned that the first version could do without ANDS in the last line on March 19, 2006.
Compute modulus division by 1 << s without a division operator
const unsigned int n; // numerator
const unsigned int s;
const unsigned int d = 1U << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ...
unsigned int m; // m will be n % d
= n & (d - 1); m
Compute modulus division by (1 << s) - 1 without a division operator
unsigned int n; // numerator
const unsigned int s; // s > 0
const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
unsigned int m; // n % d goes here.
for (m = n; n > d; n = m)
{
for (m = 0; n; n >>= s)
{
+= n & d;
m }
}
// Now m is a value from 0 to d, but since with modulus division
// we want m to be 0 when it is d.
= m == d ? 0 : m; m
This method of modulus division by an integer that is one less than a power of 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations, where N is the number of bits in the numerator. In other words, it takes at most O(N * lg(N)) time.
Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine
corrected me on June 17, 2004, I mistakenly commented that we could
alternatively assign m = ((m + 1) & d) - 1;
at the end.
Michael Miller spotted a typo in the code April 25, 2005.
Compute modulus division by (1 << s) - 1 in parallel without a division operator
// The following is for a word size of 32 bits!
static const unsigned int M[] =
{
0x00000000, 0x55555555, 0x33333333, 0xc71c71c7,
0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f,
0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff,
0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff,
0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff,
0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff,
0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff,
0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff
};
static const unsigned int Q[][6] =
{
{ 0, 0, 0, 0, 0, 0}, {16, 8, 4, 2, 1, 1}, {16, 8, 4, 2, 2, 2},
{15, 6, 3, 3, 3, 3}, {16, 8, 4, 4, 4, 4}, {15, 5, 5, 5, 5, 5},
{12, 6, 6, 6 , 6, 6}, {14, 7, 7, 7, 7, 7}, {16, 8, 8, 8, 8, 8},
{ 9, 9, 9, 9, 9, 9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11},
{12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14},
{15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17},
{18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20},
{21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23},
{24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26},
{27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29},
{30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31}
};
static const unsigned int R[][6] =
{
{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000},
{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001},
{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003},
{0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007},
{0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f},
{0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f},
{0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f},
{0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f},
{0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff},
{0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff},
{0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff},
{0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff},
{0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff},
{0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff},
{0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff},
{0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff},
{0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff},
{0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff},
{0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff},
{0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff},
{0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff},
{0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff},
{0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff},
{0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff},
{0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff},
{0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff},
{0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff},
{0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff},
{0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff},
{0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff},
{0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff},
{0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}
};
unsigned int n; // numerator
const unsigned int s; // s > 0
const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
unsigned int m; // n % d goes here.
= (n & M[s]) + ((n >> s) & M[s]);
m
for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++)
{
= (m >> *q) + (m & *r);
m }
= m == d ? 0 : m; // OR, less portably: m = m & -((signed)(m - d) >> s); m
This method of finding modulus division by an integer that is one
less than a power of 2 takes at most O(lg(N)) time, where N is the
number of bits in the numerator (32 bits, for the code above).
The number of operations is at most 12 + 9 * ceil(lg(N)).
It finds the result by summing the values in base (1 << s) in parallel. First every other base (1 << s) value is added to the previous one. Imagine that the result is written on a piece of paper. Cut the paper in half, so that half the values are on each cut piece. Align the values and sum them onto a new piece of paper. Repeat by cutting this paper in half (which will be a quarter of the size of the previous one) and summing, until you cannot cut further. After performing lg(N/s/2) cuts, we cut no more; just continue to add the values and put the result onto a new piece of paper as before, while there are at least two s-bit values.
Devised by Sean Anderson, August 20, 2001. A typo was spotted by Randy
E. Bryant on May 3, 2005 (after pasting the code, I had later added
“unsinged” to a variable declaration). As in the previous hack, I
mistakenly commented that we could alternatively assign m = ((m +
1) & d) - 1;
at the end, and Don Knuth corrected me on April
19, 2006 and suggested m = m & -((signed)(m - d) >>
s)
.
On June 18, 2009 Sean Irvine proposed a change that used ((n
>> s) & M[s])
instead of ((n & ~M[s])
>> s)
, which typically requires fewer operations because
the M[s] constant is already loaded.
Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)
unsigned int v; // 32-bit word to find the log base 2 of
unsigned int r = 0; // r will be lg(v)
while (v >>= 1) // unroll for more speed...
{
++;
r}
Find the integer log base 2 of an integer with an 64-bit IEEE float
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
t= (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF; r
From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal). This technique only takes 5 operations, but many CPUs are slow at manipulating doubles, and the endianess of the architecture must be accommodated.
Eric Cole sent me this on January 15, 2006. Evan Felix pointed out a typo on April 4, 2006. Vincent Lefèvre told me on July 9, 2008 to change the endian check to use the float’s endian, which could differ from the integer’s endian.
Find the log base 2 of an integer with a lookup table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
LT};
unsigned int v; // 32-bit word to find the log of
unsigned r; // r will be lg(v)
register unsigned int t, tt; // temporaries
if (tt = v >> 16)
{
= (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
r }
else
{
= (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
r }
The code above is tuned to uniformly distributed output
values.
If your inputs are evenly distributed across all 32-bit values,
then consider using the following:
if (tt = v >> 24)
{
= 24 + LogTable256[tt];
r }
else if (tt = v >> 16)
{
= 16 + LogTable256[tt];
r }
else if (tt = v >> 8)
{
= 8 + LogTable256[tt];
r }
else
{
= LogTable256[v];
r }
To initially generate the log table algorithmically:
[0] = LogTable256[1] = 0;
LogTable256for (int i = 2; i < 256; i++)
{
[i] = 1 + LogTable256[i / 2];
LogTable256}
[0] = -1; // if you want log(0) to return -1 LogTable256
Find the log base 2 of an N-bit integer in O(lg(N)) operations
unsigned int v; // 32-bit value to find the log2 of
const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};
const unsigned int S[] = {1, 2, 4, 8, 16};
int i;
register unsigned int r = 0; // result of log2(v) will go here
for (i = 4; i >= 0; i--) // unroll for speed...
{
if (v & b[i])
{
>>= S[i];
v |= S[i];
r }
}
// OR (IF YOUR CPU BRANCHES SLOWLY):
unsigned int v; // 32-bit value to find the log2 of
register unsigned int r; // result of log2(v) will go here
register unsigned int shift;
= (v > 0xFFFF) << 4; v >>= r;
r = (v > 0xFF ) << 3; v >>= shift; r |= shift;
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
shift |= (v >> 1);
r
// OR (IF YOU KNOW v IS A POWER OF 2):
unsigned int v; // 32-bit value to find the log2 of
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
0xFF00FF00, 0xFFFF0000};
register unsigned int r = (v & b[0]) != 0;
for (i = 4; i > 0; i--) // unroll for speed...
{
|= ((v & b[i]) != 0) << i;
r }
The second version was sent to me by Eric Cole on January 7, 2006. Andrew Shapira subsequently trimmed a few operations off of it and sent me his variation (above) on Sept. 1, 2007. The third variation was suggested to me by John Owens on April 24, 2002; it’s faster, but it is only suitable when the input is known to be a power of 2. On May 25, 2003, Ken Raeburn suggested improving the general case by using smaller numbers for b[], which load faster on some architectures (for instance if the word size is 16 bits, then only one load instruction may be needed). These values work for the general version, but not for the special-case version below it, where v is a power of 2; Glenn Slayden brought this oversight to my attention on December 12, 2003.
Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup
uint32_t v; // find the log base 2 of 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
|= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v
= MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; r
If you know that v is a power of 2, then you only need the following:
static const int MultiplyDeBruijnBitPosition2[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
= MultiplyDeBruijnBitPosition2[(uint32_t)(v * 0x077CB531U) >> 27]; r
Eric Cole devised this January 8, 2006 after reading about the entry below to round up to a power of 2 and the method below for computing the number of trailing bits with a multiply and lookup using a DeBruijn sequence. On December 10, 2009, Mark Dickinson shaved off a couple operations by requiring v be rounded up to one less than the next power of 2 rather than the power of 2.
Find integer log base 10 of an integer
unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
int r; // result goes here
int t; // temporary
static unsigned int const PowersOf10[] =
{1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000};
= (IntegerLogBase2(v) + 1) * 1233 >> 12; // (use a lg2 method from above)
t = t - (v < PowersOf10[t]); r
This method takes 6 more operations than IntegerLogBase2. It may be sped up (on machines with fast memory access) by modifying the log base 2 table-lookup method above so that the entries hold what is computed for t (that is, pre-add, -mulitply, and -shift). Doing so would require a total of only 9 operations to find the log base 10, assuming 4 tables were used (one for each byte of v).
Eric Cole suggested I add a version of this on January 7, 2006.
Find integer log base 10 of an integer the obvious way
unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
int r; // result goes here
= (v >= 1000000000) ? 9 : (v >= 100000000) ? 8 : (v >= 10000000) ? 7 :
r (v >= 1000000) ? 6 : (v >= 100000) ? 5 : (v >= 10000) ? 4 :
(v >= 1000) ? 3 : (v >= 100) ? 2 : (v >= 10) ? 1 : 0;
As a result, less than 2.6 operations are needed on average.
On April 18, 2007, Emanuel Hoogeveen suggested a variation on this where the conditions used divisions, which were not as fast as simple comparisons.
Find integer log base 2 of a 32-bit IEEE float
const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v)
int c; // 32-bit int c gets the result;
= *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c);
c = (c >> 23) - 127; c
The above is fast, but IEEE 754-compliant architectures utilize subnormal (also called denormal) floating point numbers. These have the exponent bits set to zero (signifying pow(2,-127)), and the mantissa is not normalized, so it contains leading zeros and thus the log2 must be computed from the mantissa. To accomodate for subnormal numbers, use the following:
const float v; // find int(log2(v)), where v > 0.0 && finite(v)
int c; // 32-bit int c gets the result;
int x = *(const int *) &v; // OR, for portability: memcpy(&x, &v, sizeof x);
= x >> 23;
c
if (c)
{
-= 127;
c }
else
{ // subnormal, so recompute using mantissa: c = intlog2(x) - 149;
register unsigned int t; // temporary
// Note that LogTable256 was defined <a href="http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup">earlier</a>
if (t = x >> 16)
{
= LogTable256[t] - 133;
c }
else
{
= (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;
c }
}
He proposed using memcpy for maximum portability or a union with a float and an int for better code generation than memcpy on some compilers.
Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)
const int r;
const float v; // find int(log2(pow((double) v, 1. / pow(2, r)))),
// where isnormal(v) and v > 0
int c; // 32-bit int c gets the result;
= *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c);
c = ((((c - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127; c
If r is 1, then we have c = int(log2(sqrt((double) v))). If r is 2, then we have c = int(log2(pow((double) v, 1./4))).
On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/7 left the type punning idiom (int )& undefined, and he suggested using memcpy.
Count the consecutive zero bits (trailing) on the right linearly
unsigned int v; // input to count trailing zero bits
int c; // output: c will count v's trailing zero bits,
// so if v is 1101000 (base 2), then c will be 3
if (v)
{
= (v ^ (v - 1)) >> 1; // Set v's trailing 0s to 1s and zero rest
v for (c = 0; v; c++)
{
>>= 1;
v }
}
else
{
= CHAR_BIT * sizeof(v);
c }
Jim Cole suggested I add a linear-time method for counting the trailing zeros on August 15, 2007. On October 22, 2007, Jason Cunningham pointed out that I had neglected to paste the unsigned modifier for v.
Count the consecutive zero bits (trailing) on the right in parallel
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c = 32; // c will be the number of zero bits on the right
&= -signed(v);
v if (v) c--;
if (v & 0x0000FFFF) c -= 16;
if (v & 0x00FF00FF) c -= 8;
if (v & 0x0F0F0F0F) c -= 4;
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;
The number of operations is at most 3 * lg(N) + 4, roughly, for N bit words.
Bill Burdick suggested an optimization, reducing the time from 4 * lg(N) on February 4, 2011.
Count the consecutive zero bits (trailing) on the right by binary search
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c; // c will be the number of zero bits on the right,
// so if v is 1101000 (base 2), then c will be 3
// NOTE: if 0 == v, then c = 31.
if (v & 0x1)
{
// special case for odd v (assumed to happen half of the time)
= 0;
c }
else
{
= 1;
c if ((v & 0xffff) == 0)
{
>>= 16;
v += 16;
c }
if ((v & 0xff) == 0)
{
>>= 8;
v += 8;
c }
if ((v & 0xf) == 0)
{
>>= 4;
v += 4;
c }
if ((v & 0x3) == 0)
{
>>= 2;
v += 2;
c }
-= v & 0x1;
c }
Matt Whitlock suggested this on January 25, 2006. Andrew Shapira shaved a couple operations off on Sept. 5, 2007 (by setting c=1 and unconditionally subtracting at the end).
Count the consecutive zero bits (trailing) on the right by casting to a float
unsigned int v; // find the number of trailing zeros in v
int r; // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
= (*(uint32_t *)&f >> 23) - 0x7f; r
The exponent of the 32-bit IEEE floating point representation is shifted down, and the bias is subtracted to give the position of the least significant 1 bit set in v. If v is zero, then the result is -127.
Count the consecutive zero bits (trailing) on the right with modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
= Mod37BitPosition[(-v & v) % 37]; r
It uses only 4 operations, however indexing into a table and performing modulus division may make it unsuitable for some situations. I came up with this independently and then searched for a subsequence of the table values, and found it was invented earlier by Reiser, according to Hacker’s Delight.
Count the consecutive zero bits (trailing) on the right with multiply and lookup
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
= MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27]; r
It requires one more operation than the earlier one involving modulus division, but the multiply may be faster. The expression (v & -v) extracts the least significant 1 bit from v. The constant 0x077CB531UL is a de Bruijn sequence, which produces a unique pattern of bits into the high 5 bits for each possible bit position that it is multiplied against. When there are no bits set, it returns 0.
More information can be found by reading the paper Using de Bruijn Sequences to Index 1 in a Computer Word by Charles E. Leiserson, Harald Prokof, and Keith H. Randall.
On October 8, 2005 Andrew Shapira suggested I add this. Dustin Spicuzza asked me on April 14, 2009 to cast the result of the multiply to a 32-bit type so it would work when compiled with 64-bit ints.
Round up to the next highest power of 2 by float casting
unsigned int const v; // Round this 32-bit value to the next highest power of 2
unsigned int r; // Put the result here. (So v=3 -> r=4; v=8 -> r=8)
if (v > 1)
{
float f = (float)v;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
= t << (t < v);
r }
else
{
= 1;
r }
Quick and dirty version, for domain of 1 < v < (1<<25):
float f = (float)(v - 1);
= 1U << ((*(unsigned int*)(&f) >> 23) - 126); r
On September 27, 2005 Andi Smithers suggested I include a technique for casting to floats to find the lg of a number for rounding up to a power of 2. Similar to the quick and dirty version here, his version worked with values less than (1<<25), due to mantissa rounding, but it used one more operation.
Round up to the next highest power of 2
unsigned int v; // compute the next highest power of 2 of 32-bit v
--;
v|= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v ++; v
Note that in the edge case where v is 0, it returns 0, which isn’t a power of 2; you might append the expression v += (v == 0) to remedy this if it matters. It would be faster by 2 operations to use the formula and the log base 2 method that uses a lookup table, but in some situations, lookup tables are not suitable, so the above code may be best. (On a Athlon™ XP 2100+ I’ve found the above shift-left and then OR code is as fast as using a single BSR assembly language instruction, which scans in reverse to find the highest set bit.) It works by copying the highest set bit to all of the lower bits, and then adding one, which results in carries that set all of the lower bits to 0 and one bit beyond the highest set bit to 1. If the original number was a power of 2, then the decrement will reduce it to one less, so that we round up to the same original value.
You might alternatively compute the next higher power of 2 in only 8 or 9 operations using a lookup table for floor(lg(v)) and then evaluating 1<<(1+floor(lg(v))); Atul Divekar suggested I mention this on September 5, 2010.
Devised by Sean Anderson, Sepember 14, 2001.
Pete Hart pointed me to
a
couple newsgroup posts by him and William Lewis in February of 1997,
where they arrive at the same algorithm.
Interleave bits the obvious way
unsigned short x; // Interleave bits of x and y, so that all of the
unsigned short y; // bits of x are in the even positions and y in the odd;
unsigned int z = 0; // z gets the resulting Morton Number.
for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
{
|= (x & 1U << i) << i | (y & 1U << i) << (i + 1);
z }
Interleave bits by table lookup
static const unsigned short MortonTable256[256] =
{
0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,
0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,
0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,
0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,
0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,
0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,
0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,
0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,
0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,
0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,
0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,
0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,
0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,
0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,
0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,
0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,
0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,
0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,
0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,
0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,
0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,
0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,
0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,
0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,
0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,
0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,
0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,
0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,
0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
};
unsigned short x; // Interleave bits of x and y, so that all of the
unsigned short y; // bits of x are in the even positions and y in the odd;
unsigned int z; // z gets the resulting 32-bit Morton Number.
= MortonTable256[y >> 8] << 17 |
z [x >> 8] << 16 |
MortonTable256[y & 0xFF] << 1 |
MortonTable256[x & 0xFF]; MortonTable256
Extending this same idea, four tables could be used, with two of them pre-shifted by 16 to the left of the previous two, so that we would only need 11 operations total.
Interleave bits with 64-bit multiply
In 11 operations, this version interleaves bits of two bytes (rather than shorts, as in the other versions), but many of the operations are 64-bit multiplies so it isn’t appropriate for all machines. The input parameters, x and y, should be less than 256.
unsigned char x; // Interleave bits of (8-bit) x and y, so that all of the
unsigned char y; // bits of x are in the even positions and y in the odd;
unsigned short z; // z gets the resulting 16-bit Morton Number.
= ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *
z 0x0102040810204081ULL >> 49) & 0x5555 |
((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *
0x0102040810204081ULL >> 48) & 0xAAAA;
Interleave bits by Binary Magic Numbers
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};
static const unsigned int S[] = {1, 2, 4, 8};
unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x
unsigned int y; // are in the even positions and bits from y in the odd;
unsigned int z; // z gets the resulting 32-bit Morton Number.
// x and y must initially be less than 65536.
= (x | (x << S[3])) & B[3];
x = (x | (x << S[2])) & B[2];
x = (x | (x << S[1])) & B[1];
x = (x | (x << S[0])) & B[0];
x
= (y | (y << S[3])) & B[3];
y = (y | (y << S[2])) & B[2];
y = (y | (y << S[1])) & B[1];
y = (y | (y << S[0])) & B[0];
y
= x | (y << 1); z
Determine if a word has a zero byte
// Fewer operations:
unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0
bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
The code above may be useful when doing a fast string copy in which a
word is copied at a time; it uses 5 operations.
On the other hand, testing for a null byte in the obvious ways (which
follow) have at least 7 operations (when counted in the most sparing
way), and at most 12.
// More operations:
bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000))
// OR:
unsigned char * p = (unsigned char *) &v;
bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);
Next the high bits of the original word are ORed with these values; thus, the high bit of a byte is set iff any bit in the byte was set.
Finally, we determine if any of these high bits are zero by ORing with ones everywhere except the high bits and inverting the result.
Extending to 64 bits is trivial; simply increase the constants to be 0x7F7F7F7F7F7F7F7F.
For an additional improvement, a fast pretest that requires only 4
operations may be performed to determine if the word may have a
zero byte.
The test also returns true if the high byte is 0x80, so there are
occasional false positives, but the slower and more reliable version
above may then be used on candidates for an overall increase in speed
with correct output.
bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
if (hasZeroByte) // or may just have 0x80 in the high byte
{
= ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
hasZeroByte }
There is yet a faster method — use
hasless
(v,
1), which is defined below; it works in 4 operations and requires no
subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)```
(v - 0x01010101UL), evaluates to a high bit set in any
The subexpression 0x80.
byte whenever the corresponding byte in v is zero or greater than -expression ~v & 0x80808080UL
The sub
evaluates to high bits set in bytes where the byte of v doesn't have its high (so the byte was less than 0x80). Finally, by ANDing these two
bit set -expressions the result is the high bits set where the bytes in v
sub, since the high bits set due to a value greater than 0x80
were zero-expression are masked off by the second.
in the first sub<p>
2, 2004.
Paul Messmer suggested the fast pretest improvement on October <code>hasless(v, 1)</code>
Juha Järvi later suggested 6, 2005, which
on April <a href="http://www.azillionmonkeys.com/qed/asmexample.html">Paul
he found on </a>; previously it was written in a newsgroup post
Hsieh's Assembly Lab27, 1987 by Alan Mycroft.
on April </p><p>
</p><hr>
<h3>
<a name="#ValueInWord">Determine if a word has a byte equal to n
</a>
</h3>
if any byte in a word has a specific value. To do so,
We may want to know
we can XOR the value to test with a word that has been filled with the . Because XORing a value with itself
byte values in which we're interested, we can pass the result to
results in a zero byte and nonzero otherwise<code>haszero</code>.
```c#define hasvalue(x,n) \
(haszero((x) ^ (~0UL/255 * (n))))
Stephen M Bennet suggested this on December 13, 2009 after reading the
entry for haszero
.
Determine if a word has a byte less than n
Test if a word x contains an unsigned byte with value < n. Specifically for n=1, it can be used to find a 0-byte by examining one long at a time, or any byte by XORing x with a mask first. Uses 4 arithmetic/logical operations when n is constant.
Requirements: x>=0; 0<=n<=128
#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)
To count the number of bytes in x that are less than n in 7 operations, use
#define countless(x,n) \
(((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)
Juha Järvi sent this clever technique to me on April 6, 2005. The
countless
macro was added by Sean Anderson on April 10,
2005, inspired by Juha’s countmore
, below.
Determine if a word has a byte greater than n
Test if a word x contains an unsigned byte with value > n.Uses 3 arithmetic/logical operations when n is constant.
Requirements: x>=0; 0<=n<=127
#define hasmore(x,n) (((x)+~0UL/255*(127-(n))|(x))&~0UL/255*128)
To count the number of bytes in x that are more than n in 6 operations, use:
#define countmore(x,n) \
(((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255)
The macro hasmore
was suggested by Juha Järvi on April 6,
2005, and he added countmore
on April 8, 2005.
Determine if a word has a byte between m and n
When m < n, this technique tests if a word x contains an unsigned byte value, such that m < value < n.It uses 7 arithmetic/logical operations when n and m are constant.
Note: Bytes that equal n can be reported by
likelyhasbetween
as false positives, so this should be
checked by character if a certain result is needed.
Requirements: x>=0; 0<=m<=127; 0<=n<=128
#define likelyhasbetween(x,m,n) \
((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
This technique would be suitable for a fast pretest. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is:
#define hasbetween(x,m,n) \
((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
To count the number of bytes in x that are between m and n (exclusive) in 10 operations, use:
#define countbetween(x,m,n) (hasbetween(x,m,n)/128%255)
Juha Järvi suggested likelyhasbetween
on April 6,
2005.
From there, Sean Anderson created hasbetween
and
countbetween
on April 10, 2005.
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we
want the next permutation of N 1 bits in a lexicographical sense.
For example, if N is 3 and the bit pattern is 00010011, the next
patterns would be 00010101, 00010110, 00011001,00011010, 00011100,
00100011, and so forth. The following is a fast way to compute the next
permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
= (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1)); w
Here is another version that tends to be slower because of its division operator, but it does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
= t | ((((t & -t) / (v & -v)) >> 1) - 1); w
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
A Belorussian translation (provided by Webhostingrating) is available.