Making a virtue out of sin

Jul 31st, 2017

Signal processing often requires us to perform a sin(). Indeed, we often have to perform a cos(), too! In fact, we often need to do a lot of them, and do them quickly too! In case this wasn’t clear, we’re talking about those wavy things:

sin

The C Math Library of course contains perfectly fine sin() and cos() functions. However, accepting arbitrary arguments and performing lots of error checking makes them kinda slow.

The quest was on to come up with something better. One could replace sin() and cos() with a big lookup table, and interpolate between entries. This works, if you don’t mind the vast quantities of memory thus wasted, or the algorithmic complexity of reducing the argument to the right quadrants, and so on.

Fortunately, better ways are possible with small polynomials, particularly if we can limit the argument a little bit. Since sin() and cos() are repeating endlessly, restricting the arguments to the repetition interval seems convenient.

How accurate should it be? Well, that depends. For double precision, the least significant bit falls at about 1E-16, so an algorithm that gets down to that level should be good enough even for most stuff.

For the sin() function, we get this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
double mysin(double x){
  const double A = -7.28638965935448382375e-18;
  const double B =  2.79164354009975374566e-15;
  const double C = -7.64479307785677023759e-13;
  const double D =  1.60588695928966278105e-10;
  const double E = -2.50521003012188316353e-08;
  const double F =  2.75573189892671884365e-06;
  const double G = -1.98412698371840334929e-04;
  const double H =  8.33333333329438515047e-03;
  const double I = -1.66666666666649732329e-01;
  const double J =  9.99999999999997848557e-01;
  register double x2 = x*x;
  return (((((((((A*x2+B)*x2+C)*x2+D)*x2+E)*x2+F)*x2+G)*x2+H)*x2+I)*x2+J)*x;
  }

For the cos() function, we get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
double mycos(double x){
  const double c1  =  3.68396216222400477886e-19;
  const double c2  = -1.55289318377801496607e-16;
  const double c3  =  4.77840439714556611532e-14;
  const double c4  = -1.14706678499029860238e-11;
  const double c5  =  2.08767534780769871595e-09;
  const double c6  = -2.75573191273279748439e-07;
  const double c7  =  2.48015873000796780048e-05;
  const double c8  = -1.38888888888779804960e-03;
  const double c9  =  4.16666666666665603386e-02;
  const double c10 = -5.00000000000000154115e-01;
  const double c11 =  1.00000000000000001607e+00;
  register double x2=x*x;
  return (((((((((c1*x2+c2)*x2+c3)*x2+c4)*x2+c5)*x2+c6)*x2+c7)*x2+c8)*x2+c9)*x2+c10)*x2+c11;
  }

Of course, often we want both sin() and cos(), and it is possible to get both, for the price of one, thanks to SSE2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void mysincos(double x,double* psin,double* pcos){
  const __m128d c1 =_mm_set_pd( 3.68396216222400477886e-19,-7.28638965935448382375e-18);
  const __m128d c2 =_mm_set_pd(-1.55289318377801496607e-16, 2.79164354009975374566e-15);
  const __m128d c3 =_mm_set_pd( 4.77840439714556611532e-14,-7.64479307785677023759e-13);
  const __m128d c4 =_mm_set_pd(-1.14706678499029860238e-11, 1.60588695928966278105e-10);
  const __m128d c5 =_mm_set_pd( 2.08767534780769871595e-09,-2.50521003012188316353e-08);
  const __m128d c6 =_mm_set_pd(-2.75573191273279748439e-07, 2.75573189892671884365e-06);
  const __m128d c7 =_mm_set_pd( 2.48015873000796780048e-05,-1.98412698371840334929e-04);
  const __m128d c8 =_mm_set_pd(-1.38888888888779804960e-03, 8.33333333329438515047e-03);
  const __m128d c9 =_mm_set_pd( 4.16666666666665603386e-02,-1.66666666666649732329e-01);
  const __m128d c10=_mm_set_pd(-5.00000000000000154115e-01, 9.99999999999997848557e-01);
  const __m128d c11=_mm_set_pd( 1.00000000000000001607e+00, 0.00000000000000000000e+00);
  register __m128d x1x1=_mm_set1_pd(x);
  register __m128d x2x2=_mm_mul_pd(x1x1,x1x1);
  register __m128d x1x2=_mm_unpacklo_pd(x1x1,x2x2);
  register __m128d rr=c1;
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c2);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c3);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c4);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c5);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c6);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c7);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c8);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c9);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c10);
  rr=_mm_add_pd(_mm_mul_pd(rr,x1x2),c11);
  _mm_storeh_pd(pcos,rr);
  _mm_storel_pd(psin,rr);
  }

This uses Intel Performance Primitives, which should work on a variety of C/C++ compilers. Of course, it goes without saying that if your code is to be remotely portable, you will need to surround this code with the appropriate preprocessor incantations so as to fall back on the vanilla implementation should the processor not be an x86 CPU but something else.

So, just how close are these functions to the C Math Library version? Well, GNUPlot to the rescue once more. Plotting the difference between the math library version and the polynomial version on the -pi to pi interval for the cos() and sin(), respectively:
cosdiff
sindiff
Yes, they look blocky! This is not a bug! They’re close enough that the error is in the last few significant bits of the result, so subtracting these numbers will be very, very close to zero.

Of course, these sin() and cos() replacements only work if the argument is in the -pi to pi interval. If you can ensure your argument is within this range, you’re done. If it isn’t, however, the following little inline will help:

1
2
3
inline double wrap(double x){ 
  return x-nearbyint(x*0.159154943091895335768883763373)*6.28318530717958647692528676656; 
  }

A call to nearbyint() compiles to a single instruction on the x86 CPU, and so its pretty fast.

The magic constants come from a nice paper on the topic, “Fast Polynomial Approximations to Sine and Cosine,” by Charles K Garrett, Feb 2012.

Tags:

Fast, Vectorizable Arc Tangent

Jul 30th, 2017

Arctangent(x) (or atan(x)) is an often-used function to obtain the angle given the slope. Closely related is atan2(y,x) which usually returns the full angle.
While the library implementation is usually not horrible in terms of performance, it can still be beneficial to have a special implementation so vectorization is possible.
I’m not going to give the vector version here, but a really fast scalar version. When inlined, chances are good your compiler may auto-vectorize them. However, implementation using Intel intrinsics should be straighforward if your compiler doesn’t auto-vectorize.
Thus, the version presented here may become 4x faster with SSE, 8x faster with AVX, and 16x faster with the new AVX512 ISA extension.
Regardless, the versions presented here are quite a bit faster than the library functions, even before being inlined.

The trick to obtaining high performance is to eliminate conditional branches from the code; thus, don’t expect proper error handling for NaN’s and so on. If that is important to you, stick with the library version. However, if you need speed, read on!

Lets start with atan(x). To get good accuracy, we use the following identities to reduce the argument, use a polynomial approximation, then reverse the effects of the argument reduction prior to delivering the final result.
First, a recap of some of the common identities:

  • atan(-x) = -atan(x)
  • atan(1/x) = 0.5*pi-atan(x), for x>0

The first one is easy:

1
double z=fabs(x)

Next, we use a nice trick to reduce the argument z down to the 0..1 range:

2
double a=fmin(z,1.0)/fmax(z,1.0);

This expression yields a=z if z<1, and a=1/z if z>1. It does this without branches, and as we know from previous articles, this is important as conditional branches don’t vectorize.

The next step, we run our a through a polynomial (more about that later).

3
4
double s=a*a;
double p=a+a*s*(c3+s*(c5+...c39));    // More details to follow

Now we need to reverse the argument reduction (2nd identity), for which we recall the handy blend() function from the previous article:

5
p=blend(1.0,z,0.5*pi-p,p);         // p = 1<z ? pi/2 - p : p

The final step is to reverse the sign removal (1st identity). We use the standard copysign() function for this:

6
result=copysign(p,x);              // result = 0<x ? p : -p

And we’re done!

Of course, we swept something under the rug, how to get the polynomial approximation for atan(x) on the limited interval for x=0…1.

The best mathematical technique is a so-called Chebyshev polynomial. These polynomials have the nice property of minimizing the maximum error between the true function and the approximation, over the entire interval. Thus, they’re a lot better than Taylor approximations, which are only accurate near a particular value for x.

However, due to numerical errors, the mathematically pure Chebyshev polynomial may not give the most accurate answer. Thankfully, some brilliant people at Inria developed a nice piece of software (http://sollya.gforge.inria.fr/) to calculate the coefficients for us, given a desired numerical precision.

For double precision, we want accuracy down to about 1E-16 (as we have only 16 digits). For single precision we should shoot for the 1E-6 or 1E-7 ballpark. Beyond that, we would include more polynomial terms but not actually improve numerical results.

So, without further ado, I present here the double precision version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Fast arctan(x) algorithm
// Error in the order of ~2.22E-16
double fastatan(double x){
  const double c3 =-3.3333333333331788272957396657147910445928573608398E-1;
  const double c5 = 1.9999999999746972956238266760919941589236259460449E-1;
  const double c7 =-1.4285714270985122587021010076568927615880966186523E-1;
  const double c9 = 1.1111110670649392007103273272150545381009578704834E-1;
  const double c11=-9.0909011195370925673131523581105284392833709716797e-2;
  const double c13= 7.6922118180920429075797528639668598771095275878906e-2;
  const double c15=-6.6658528038443493057840782967105042189359664916992e-2;
  const double c17= 5.8772701139760089028563072588440263643860816955566e-2;
  const double c19=-5.2390921524556287314222657869322574697434902191162e-2;
  const double c21= 4.6735478230248365949517364015264320187270641326904e-2;
  const double c23=-4.0917561483705074121264289033206296153366565704346e-2;
  const double c25= 3.4052860223616393531287371843063738197088241577148e-2;
  const double c27=-2.5807287359851206060001871378517535049468278884888e-2;
  const double c29= 1.6958625295544118433133107259891403373330831527710e-2;
  const double c31=-9.1701096131817233514382792236574459820985794067383e-3;
  const double c33= 3.8481862661788874928336934289063719916157424449921e-3;
  const double c35=-1.1612018409582773505878128261770143581088632345199e-3;
  const double c37= 2.2237585554124372289389044432539321860531345009804e-4;
  const double c39=-2.0191399088793571194207221441985211640712805092335e-5;
  double z=fabs(x);                                     // z=|x|
  double a=fmin(z,1.0)/fmax(z,1.0);                     // a = 1<z ? 1/z : z
  double s=a*a;                                         // a^2
  double p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39))))))))))))))))));
  p=blend(1.0,z,1.57079632679489661923132-p,p);         // r = 1<z ? pi/2 - p : p
  return copysign(p,x);
  }

Since single precision has much less accuracy, we may implement the that version with many fewer terms, and yet still match the library version quite closely.

Behold the single-precision version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Fast arctan(x) algorithm
// Error in the order of ~2.4E-7
float fastatanf(float x){
  const float c3 =-3.33319157361984252929687E-1f;
  const float c5 = 1.99689209461212158203125E-1f;
  const float c7 =-1.40156880021095275878906E-1f;
  const float c9 = 9.90508347749710083007812E-2f;
  const float c11=-5.93664981424808502197265E-2f;
  const float c13= 2.41728331893682479858398E-2f;
  const float c15=-4.67213569208979606628417E-3f;
  float z=fabsf(x);                                     // z=|x|
  float a=fminf(z,1.0f)/fmaxf(z,1.0f);                  // a = 1<z ? 1/z : z
  float s=a*a;                                          // a^2
  float p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15))))));
  p=blend(1.0f,z,1.57079632679489661923132f-p,p);       // r = 1<z ? pi/2 - p : p
  return copysignf(p,x);
  }

So, just how fast are these functions compared to the c-library equivalents?
Keeping in mind that the measurements often influence the result, I’ve clocked the non-inlined version of these functions at around 25 cycles (on 4 GHz Skylake processor). The library version of atan() was clocked around 64 cycles, while the library version of atanf() was a surprisingly good 33 cycles.
But keep in mind, with inlining and vectorization the new functions may become substantially faster still, depending on the vector length. Thus, for AVX we should be able to expect 8x speed improvement for the single-precision atanf(), or around 3.125 cycles per argument.

The other interesting function we can now implement, with very minor alterations, is atan2(y,x). Its my favorite since you get the full angle.

The code is very similar, but we need a bit more work to reverse the argument reduction. Fortunately, blend() can be used for all of these.

The full code for the double precision version is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Fast arctan(x) algorithm
// Error in the order of ~4.44E-16
double fastatan2(double y,double x){
  const double c3 =-3.3333333333331788272957396657147910445928573608398E-1;
  const double c5 = 1.9999999999746972956238266760919941589236259460449E-1;
  const double c7 =-1.4285714270985122587021010076568927615880966186523E-1;
  const double c9 = 1.1111110670649392007103273272150545381009578704834E-1;
  const double c11=-9.0909011195370925673131523581105284392833709716797e-2;
  const double c13= 7.6922118180920429075797528639668598771095275878906e-2;
  const double c15=-6.6658528038443493057840782967105042189359664916992e-2;
  const double c17= 5.8772701139760089028563072588440263643860816955566e-2;
  const double c19=-5.2390921524556287314222657869322574697434902191162e-2;
  const double c21= 4.6735478230248365949517364015264320187270641326904e-2;
  const double c23=-4.0917561483705074121264289033206296153366565704346e-2;
  const double c25= 3.4052860223616393531287371843063738197088241577148e-2;
  const double c27=-2.5807287359851206060001871378517535049468278884888e-2;
  const double c29= 1.6958625295544118433133107259891403373330831527710e-2;
  const double c31=-9.1701096131817233514382792236574459820985794067383e-3;
  const double c33= 3.8481862661788874928336934289063719916157424449921e-3;
  const double c35=-1.1612018409582773505878128261770143581088632345199e-3;
  const double c37= 2.2237585554124372289389044432539321860531345009804e-4;
  const double c39=-2.0191399088793571194207221441985211640712805092335e-5;
  double absx=fabs(x);                                  // |x|
  double absy=fabs(y);                                  // |y|
  double a=fmin(absx,absy)/fmax(absx,absy);             // a = |x|<|y| ? |x|/|y| : |y|/|x|
  double s=a*a;
  double r;
  r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39))))))))))))))))));
  r=blend(absy,absx,r,1.57079632679489661923132-r);
  r=blend(0.0,x,r,3.1415926535897932384626433832795028841971693993751-r);
  r=blend(0.0,y,r,-r);
  return r;
  }

The single-precision version atan2f(y,x) is almost identical, except for the number of terms in the polynomial:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Fast arctan(x) algorithm
// Error in the order of ~4.8E-7
float fastatan2f(float y,float x){
  const float c3 =-3.33319157361984252929687E-1f;
  const float c5 = 1.99689209461212158203125E-1f;
  const float c7 =-1.40156880021095275878906E-1f;
  const float c9 = 9.90508347749710083007812E-2f;
  const float c11=-5.93664981424808502197265E-2f;
  const float c13= 2.41728331893682479858398E-2f;
  const float c15=-4.67213569208979606628417E-3f;
  float absx=fabsf(x);                                  // |x|
  float absy=fabsf(y);                                  // |y|
  float a=fminf(absx,absy)/fmaxf(absx,absy);            // a = |x|<|y| ? |x|/|y| : |y|/|x|
  float s=a*a;
  float r;
  r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15))))));
  r=blend(absy,absx,r,1.57079632679489661923132f-r);
  r=blend(0.0f,x,r,3.1415926535897932384626433832795028841971693993751f-r);
  r=blend(0.0f,y,r,-r);
  return r;
  }

The new atan2() fares very well against the standard library implementation: 25 cycles vs 138 cycles. The single-precision version clocks in at 26 cycles vs 71 cycles for the standard library. All with the non-inlined versions.

For those for which these functions are still not fast enough, I have a couple of suggestions:

  • Use inlined versions of these functions.
  • Make sure your target architecture model instruction-set extensions include SSE4.1 (for blend) or better yet, AVX and AVX2. If you’re blessed with a brand-new AVX512 capable machine, make sure its ISA extension is enabled when you compile.
  • Rearranging polynomial evaluation using e.g. Estrin scheme may yield additional improvements.

The error w.r.t. the standard c-library is very minial, but keep in mind the code above does not handle floating point oddities like NaN’s and Infinity and such.

Just feed them nice numbers and you should be OK.

Tags:

Branch-Free Blend()

Jul 30th, 2017

Often, we want to evaluate simple expressions like:

1
result=(a<b) ? x : y

Clearly, this implies performing a conditional branch. This is not a huge problem, unless the code fragment absolutely needs to be as fast as possible.
A branch has several performance-impacting problems:

  • Due the uncertain outcome of the condition, processor will be wrong predicting path through the code due to having to evaluate the outcome of the condition.
  • Automatic vectorization will be difficult, as different vector columns will have different condition-outcomes.

If we descend into the Intel assembly realm, we find that comparisons of floating point numbers leaves an all-zeros or all-ones mask in the SSE or AVX register. Thus, we can make this code branch-free by combining the comparison with some logical masking:

1
2
3
__m128d result,c,a,b,x,y;
c=_mm_cmplt_sd(a,b);
result=_mm_or_pd(_mm_and_pd(c,x),_mm_andnot_pd(c,y));

Later Intel processors supporting SSE4.1 include a new _mm_blendv_pd() operation which will combine these three instructions into one single one:

1
2
3
__m128d result,c,a,b,x,y;
c=_mm_cmplt_sd(a,b);
result=_mm_blendv_pd(y,x,c);

To make this a little bit more palatable for the casual user who isn’t too familiar with Intel intrinsic programming, I’ve wrapped these into some more accessible form:

1
2
3
4
5
6
7
8
9
10
11
// Evaluate branch-free (a < b) ? x : y, if supported on processor
static inline double blend(double a,double b,double x,double y){
#if defined(__SSE4_1__)
  return _mm_cvtsd_f64(_mm_blendv_pd(_mm_set_sd(y),_mm_set_sd(x),_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b))));
#elif defined(__SSE2__)
  __m128d cc=_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b));
  return _mm_cvtsd_f64(_mm_or_pd(_mm_and_pd(cc,_mm_set_sd(x)),_mm_andnot_pd(cc,_mm_set_sd(y))));
#else
  return a<b ? x : y;
#endif
  }

And an equivalent version for single precision math:

1
2
3
4
5
6
7
8
9
10
11
// Evaluate branch-free (a < b) ? x : y, if supported on processor
static inline float blend(float a,float b,float x,float y){
#if defined(__SSE4_1__)
  return _mm_cvtss_f32(_mm_blendv_ps(_mm_set_ss(y),_mm_set_ss(x),_mm_cmplt_ps(_mm_set_ss(a),_mm_set_ss(b))));
#elif defined(__SSE__)
  __m128 cc=_mm_cmplt_ss(_mm_set_ss(a),_mm_set_ss(b));
  return _mm_cvtss_f32(_mm_or_ps(_mm_and_ps(cc,_mm_set_ss(x)),_mm_andnot_ps(cc,_mm_set_ss(y))));
#else
  return a<b ? x : y;
#endif
  }

When compiled with optimization, these little routines perform very well indeed. In addition, GCC is clever enough to auto-vectorize when these routines are in a loop.
Of course, for this to work, make sure these functions are declared in a header file.
On machines where _mm_blendv_pd() and _mm_blendv_ps() are available, the resulting output is only two instructions: the compare and the blend.

Tags:

Handling vector loop left-overs with masked loads and stores

Dec 4th, 2013

In the previous example we used a nice little trick which is not available until AVX came to the scene: masked loads and stores.
In this note we’ll go a little deeper into the use of masked loads and stores, and how it can greatly help in handling left-overs after vector loops, as well as dealing with data structures that are simply not a whole multiple of the natural vector size.

We start with a simple problem of adding two 3D vectors:

1
2
3
4
5
6
7
8
9
10
11
// Define a mask for double precision 3d-vector
#define MMM _mm256_set_epi64x(0,~0,~0,~0)
 
// We want to do c=a+b vector-sum
FXVec3d a,b,c;
 
// Set a and b somehow
 
// Use AVX intrinsics
_mm256_maskstore_pd(&c[0],MMM,_mm256_add_pd(_mm256_maskload_pd(&a[0],MMM),
                                            _mm256_maskload_pd(&b[0],MMM)));

This was pretty easy, right? Note that Intel defined those masked-loads and stores in such a way that the store locations are not touched; i.e. they’re not simply loaded and written back with the original values, but never loaded. This is important as you don’t want to incur segmentation violations when your vector happens to be the last thing in a memory page!

Next, we move on to a little more sophisticated use, the wrap-up of left-overs of a vector loop; note that with the masked load and store, we can typically perform the last operation in vector mode as well; we don’t have to resort to plain scalar code like you had to do with SSE.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void add_vectors(float* result,const float* src1,const float src2,int n){
  register __m256i mmmm;
  register __m256 aaaa,bbbb,rrrr;
  register int i=0;
 
  // Vector loop adds 8 paits of floats at a time
  while(i<n-8){
    aaaa=_mm256_loadu_ps(&a[i]);
    bbbb=_mm256_loadu_ps(&b[i]);
    rrrr=_mm256_add_ps(aaaa,bbbb);
    _mm256_storeu_ps(&result[i],rrrr);
    i+=8;
    }
 
  // Load the mask at index n-i; this should be in the range 0...8.
  mmmm=_mm256_castps_si256(_mm256_load_ps((const float*)mask8i[n-i]));
 
  // Use masked loads
  aaaa=_mm256_maskload_ps(&a[i],mmmm);
  bbbb=_mm256_maskload_ps(&b[i],mmmm);
 
  // Same vector operation as main loop
  rrrr=_mm256_add_ps(aaaa,bbbb);
 
  // Use masked store
  _mm256_maskstore_ps(&result[i],mmmm,rrrr);
  }

Note that the loop goes goes one vector short if n is a multiple of eight: since the mop-up code is executed unconditionally, we’d rather to this with actual payload, not with all data masked out.
Also note that we don’t have a special case for n==0. In the rare case that this happens, we will just execute the mop-up code with an all-zeroes mask!

Left to do is build a little array with mask values; due to the observation above, this will have 9 entries, not 8!

1
2
3
4
5
6
7
8
9
10
11
static __align(32) const int  mask8i[9][8]={
  { 0, 0, 0, 0, 0, 0, 0, 0},
  {-1, 0, 0, 0, 0, 0, 0, 0},
  {-1,-1, 0, 0, 0, 0, 0, 0},
  {-1,-1,-1, 0, 0, 0, 0, 0},
  {-1,-1,-1,-1, 0, 0, 0, 0},
  {-1,-1,-1,-1,-1, 0, 0, 0},
  {-1,-1,-1,-1,-1,-1, 0, 0},
  {-1,-1,-1,-1,-1,-1,-1, 0},
  {-1,-1,-1,-1,-1,-1,-1,-1}
  };

In the above, the __align() macro is fleshed out differently depending on your compiler; however it should ensure that the array is aligned to a multiple of 32 bytes (the size of an AVX vector).

Bottom line: the new AVX masked loads solve a real problem that was always a bit awkward to solve before; it allows mop-up code to be vectorized same as the main loop, which is important as you may want to ensure the last couple of numbers get “the same treatment” as the rest of them.

Tags:

FUN With AVX

Jul 24th, 2013

Over the last few years, Intel and AMD have added 256-bit vector-support to their processors. The support for these wider vectors is commonly known as AVX (Advanced Vector eXtension).

Since wider vectors also introduce more processor-state, in order to use these features its not enough to have a CPU capable of these AVX vectors, but also your operating system and compiler need to be aware of it.

For maximum portability, I recommend using the Intel Intrinsics. These are supported by GCC, LLVM, as well as late-model Microsoft and Intel compilers. The advantage of using Intrincics are:

  1. Its more easy to work with for the developer, since you can embed these in your regular C/C++ code.
  2. The “smarts” of the compiler regarding register-assignments, common subexpression elimination, and other data-flow analysis goodies are at your disposal.
  3. Target architecture setting of the compiler will automatically use the new VEX instruction encoding, even for code originally written with SSE in mind.

The matrix classes in FOX were originally vectorized for SSE (actually, SSE2/SSE3). Compiling with -mavx will automatically kick in the new VEX encoding for these same SSE intrinsics.  This is nice because AVX supports three-operand instuctions (of the form A = B op C)  rather than the old two-operand instructions (A = A op B).  This means you can typically make do with fewer registers, and quite possibly eliminate useless mov instructions. Your code will be correspondingly smaller and faster, with no work at all!

However, this of course does not fully exploit the new goodies AVX brings to the table.  The most obvious benefit is the wider vectors, which of course means you can work with twice as much data as before.

For example, given 4×4 double precision matrix a[4][4] and b[4][4], we can now add a and b like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  __m256d a,b,c;
  a=_mm256_loadu_pd(a[0]);
  b=_mm256_loadu_pd(b[0]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[0],c);
  a=_mm256_loadu_pd(a[1]);
  b=_mm256_loadu_pd(b[1]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[1],c);
  a=_mm256_loadu_pd(a[2]);
  b=_mm256_loadu_pd(b[2]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[2],c);
  a=_mm256_loadu_pd(a[3]);
  b=_mm256_loadu_pd(b[3]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[3],c);

This little code fragment performs 16 double precision adds in just 4 vector instructions!
Note unlike the old SSE code, you now declare vector variables as __m256 (float), __m256i (integer), or __m256d (double).
The penalty for accessing unaligned memory addresses is much less for AVX, and thus we can use unaligned loads and stores, at a very modest (and usually not measurable) speed penalty. If you really want to go all out, however, remember to align things to 32 bytes now, not 16 bytes like you did for SSE!

For FOX’s matrix classes, compatibility with existing end-user code requires that variables can not be relied upon to be aligned, and thus unaligned accesses are used throughout. This obviates the need for end-user code to be updated for alignment restrictions.

Many of the usual suspects from SSE have extended equivalents in the AVX world: _mm_add_ps() becomes _mm256_add_ps(), _mm_addsub_ps() becomes _mm256_addsub_ps(), and so on.

Detection of AVX on your CPU.

Detection of AVX can be done using the CPUID instruction. However, unlike SSE3 and SSE4x, the introduction of AVX not only added new instructions to the processor.  It also added new state, due to the wider vector registers. Consequently, just knowing that your CPU can do AVX isn’t enough.

You also need the operating system to support the extension, because the extra state in the processor must be saved and restored when the Operating System preempts your process. Consequently, executing AVX instructions on an Operating System which does not support it will likely result in a “Illegal Instruction” exception.  To put it bluntly, your program will core-dump!

Fortunately, Operating System support for AVX is now also available through CPUID. There are three steps involved:

  1. Check AVX support, using CPUID function code #1.  The ECX and EDX registers are used to return a number of feature bits, various extensions to the core x86 instruction set.  The one we’re looking for in this case is ECX bit #28. If on, we’ve got AVX in the hardware.
  2. Next, Intel recommends checking ECX bit #27. This feature bit represents the OSXSAVE feature. XSAVE is basically a faster way to save processor state; if not supported by the O.S. then AVX is likely not available.
  3. Finally, a new register is available in the CPU indicating the Operating System has enabled state-saving the full AVX state. Just like the processor tick counter, this register can be obtained using a new magic instruction: XGETBV. The XGETBV populates the EAX:EDX register pair with feature flags indicating processor state the Operating System is aware of. At this time, x86 processors support three processor-state subsets: x87 FPU state, SSE state, and AVX state.  This information is represented by three bits in the EAX register.  For AVX, bit #2 indicates the Operating System indeed saves AVX state and has enabled AVX instructions to be available.

All this sounds pretty complicated, unless you’re an assembly-language programmer.  So, to make life a bit easier, the FOX CPUID API’s have been updated to do some of this hard work for you.

To perform simple feature tests, use the new fxCPUFeatures() API. It returns bit-flags for most instruction-sets added on top of plain x86. In the case of AVX, it simply disables AVX, AVX2, FMA, XOP, and FMA4 if the operating system does not support the extended state.

More on AVX in subsequent posts.

 

Tags:

Fast Power of Two Test

Mar 6th, 2013

A quick and nifty check to see if a number is a power of two is to note that for numbers that are a power of two, you’ll get a number which is all ones. For example:

1000 
  -1
0111

Note that the resulting number has no bits in common with the original number; this property is only true for powers of two. For any other number, bits are left standing.
Thus, we can test for power-of-two-ness by:

1
2
3
inline bool ispowof2(int x){ 
  return (x&(x-1))==0; 
}

Which is probably the fastest possible way to do this.

Tags:

The Case for Negative Semaphore Counts

Feb 26th, 2013

The new FXParallel feature being worked on in FOX has a atomic variable for tracking completed tasks of a task-group.  This kind of works as follows:

1
2
3
if(atomicAdd(&counter,-1)==1){
  completion.post();
  }

In other words, when the specified number of tasks has completed, the counter reaches zero (atomicAdd() returns the previous value of the variable) and then we ping the semaphore.

Of course, behind the scenes, a semaphore is maintaining an atomic counter already! Maintaining our own atomic variable therefore would seem to be redundant.  Wouldn’t it be great if we could just initialize a semaphore to a negative value, and post() it until it reaches 1, and release the waiting thread from its call to wait()?

Sadly, you can’t initialize semaphores to a negative value.  But it would be great if you could, and (at least on Linux) it could be done with minimal effort. In fact, simply changing the semaphore variable to signed int and allowing it to be initialized to a negative value is really all it takes.

Tags:

Dell XPS15 Surgery

Feb 22nd, 2013
Comments Off

So I got this Dell XPS15 laptop, and decided to bump up the specs a bit.  A quick visit to NewEgg and I received a new 1TB laptop drive, two RAM sticks, and a nice 120GB SSD drive; the latter was especially nice since I needed some serious space to host both Fedora Linux and Windows 7 on this.

Now, this XPS15 is a great little machine, but service friendly it is not! Whereas my old clunker laptop had nice access panels for memory, drive, and batteries, the XPS15 needs deep surgery to get at these things.

Fortunately, a service manual exists that details the steps (and detailed they are!).  First, to get at the gizzards you will need a mini Torx #T5 screwdriver.  This is of course not something the average Joe has lying around, fortunately there are a few places that sell these.

After the Torx screwdrivers arrived, I managed to pry the bottom off. Along the way, discovered a secret button on the bottom, hidden under the rubber sheet, that apparently exists to diagnose battery charge levels.You’ll find this button under the front-left corner.

Once the bottom was off, another surprise awaited: in order to get to the second DIMM slot, you will need to remove the motherboard! And the motherboard won’t come off without removing the fan, the processor heat-pipe, the battery, the hard drive, the SSD, and detaching the miniscule connectors that hold all this together.

If this were a car, it would be French!

At any rate, once I got this far, of course I was committed! So off the system fan came, and the heat-pipe came, and the battery, and the hard drive, and the SSD, and all the miniscule connectors.

Finally, the hidden DIMM slot revealed itself, and the RAM was put in.  Now the challenge was to put humpty-dumpty together again. Needless to say, I was escpecially concerned about reinstalling the processor heat-sink. Some solvent was brought into action to clean off the remaining heat-conducting compound, and a fresh dollop of silver heat paste was applied.

The remaining dominoes quickly fell into place, except for one mysteriously missing screw neat the middle of the case.

Typical!  Usually, I have parts left over ;-)

After a few fretful moments, it dawned on me that one of the case screws for the bottom doubled as fastener for the motherboard itself! After a few more sweaty moments the bottom was back on, and the moment of truth was upon me.

Press the power button…. And it fired right up!

Moral of the story: modern hardware is not really designed for upgradability. So, unless you have nerves of steel, just buy a loaded version right of the bat and don’t bother upgrading it!

Tags:

WordPress Pressing Problems

Feb 22nd, 2013
Comments Off

Apparently, you can not load WordPress into your document root, set stuff up, and then change the domain name of your machine.

For some unfathomable reason, the original url (containing your domain name!) is getting hardwired into the database in lots of different places.

Using mysql admin tool, I’ve tried to track some of them down, and was partially successful.  Then I tried dumping the entire WP database to file, perform search and replace on the url, and load it back in.

You guessed it, no dice.

In sharp contrast, something like SMF forum just works like a charm.  No hardwired anything, all resources are relative to the base domain name, like it should be.

Why are domain names hardwired into WordPress database during install? Why is there no transparent way to change this (if there are indeed convincing reasons its necessary?).

Anyway, it turns out you can switch a theme and then ar least the urls in the new theme point to the correct place.  Heaven knows what will befall me should my site ever changes domain names…

Tags:

Conversion of Unsigned Int to Float

Aug 17th, 2011

Conversion of unsigned int to float appears to be difficult for x86-64 processor. However, signed int to float is directly supported in the hardware.

The GCC compiler is clever enough to use the hardware instruction CVTSI2SS. However in the case of unsigned int to float, it needs to treat numbers which would be negative in two’s complement notation differently. So the generated code contains a branch!

Doing this with random numbers means the branch can go either way, with 50% probability.  No amount of hardware branch prediction can help with this!

Rewriting the code to use signed integers instead of unsigned ones (basically, dropping the most significant bit) speeded up some critical piece of code by 10%.

A very noticeable improvement!

Tags: