## Making a virtue out of sin

Signal processing often requires us to perform a sin(). Indeed, we often have to perform a cos(), too! In fact, we often need to do a lot of them, and do them quickly too! In case this wasn’t clear, we’re talking about those wavy things:

The C Math Library of course contains perfectly fine sin() and cos() functions. However, accepting arbitrary arguments and performing lots of error checking makes them kinda slow.

The quest was on to come up with something better. One could replace sin() and cos() with a big lookup table, and interpolate between entries. This works, if you don’t mind the vast quantities of memory thus wasted, or the algorithmic complexity of reducing the argument to the right quadrants, and so on.

Fortunately, better ways are possible with small polynomials, particularly if we can limit the argument a little bit. Since sin() and cos() are repeating endlessly, restricting the arguments to the repetition interval seems convenient.

How accurate should it be? Well, that depends. For double precision, the least significant bit falls at about 1E-16, so an algorithm that gets down to that level should be good enough even for most stuff.

For the sin() function, we get this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | double mysin(double x){ const double A = -7.28638965935448382375e-18; const double B = 2.79164354009975374566e-15; const double C = -7.64479307785677023759e-13; const double D = 1.60588695928966278105e-10; const double E = -2.50521003012188316353e-08; const double F = 2.75573189892671884365e-06; const double G = -1.98412698371840334929e-04; const double H = 8.33333333329438515047e-03; const double I = -1.66666666666649732329e-01; const double J = 9.99999999999997848557e-01; register double x2 = x*x; return (((((((((A*x2+B)*x2+C)*x2+D)*x2+E)*x2+F)*x2+G)*x2+H)*x2+I)*x2+J)*x; } |

For the cos() function, we get:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | double mycos(double x){ const double c1 = 3.68396216222400477886e-19; const double c2 = -1.55289318377801496607e-16; const double c3 = 4.77840439714556611532e-14; const double c4 = -1.14706678499029860238e-11; const double c5 = 2.08767534780769871595e-09; const double c6 = -2.75573191273279748439e-07; const double c7 = 2.48015873000796780048e-05; const double c8 = -1.38888888888779804960e-03; const double c9 = 4.16666666666665603386e-02; const double c10 = -5.00000000000000154115e-01; const double c11 = 1.00000000000000001607e+00; register double x2=x*x; return (((((((((c1*x2+c2)*x2+c3)*x2+c4)*x2+c5)*x2+c6)*x2+c7)*x2+c8)*x2+c9)*x2+c10)*x2+c11; } |

Of course, often we want both sin() and cos(), and it is possible to get both, for the price of one, thanks to SSE2:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | void mysincos(double x,double* psin,double* pcos){ const __m128d c1 =_mm_set_pd( 3.68396216222400477886e-19,-7.28638965935448382375e-18); const __m128d c2 =_mm_set_pd(-1.55289318377801496607e-16, 2.79164354009975374566e-15); const __m128d c3 =_mm_set_pd( 4.77840439714556611532e-14,-7.64479307785677023759e-13); const __m128d c4 =_mm_set_pd(-1.14706678499029860238e-11, 1.60588695928966278105e-10); const __m128d c5 =_mm_set_pd( 2.08767534780769871595e-09,-2.50521003012188316353e-08); const __m128d c6 =_mm_set_pd(-2.75573191273279748439e-07, 2.75573189892671884365e-06); const __m128d c7 =_mm_set_pd( 2.48015873000796780048e-05,-1.98412698371840334929e-04); const __m128d c8 =_mm_set_pd(-1.38888888888779804960e-03, 8.33333333329438515047e-03); const __m128d c9 =_mm_set_pd( 4.16666666666665603386e-02,-1.66666666666649732329e-01); const __m128d c10=_mm_set_pd(-5.00000000000000154115e-01, 9.99999999999997848557e-01); const __m128d c11=_mm_set_pd( 1.00000000000000001607e+00, 0.00000000000000000000e+00); register __m128d x1x1=_mm_set1_pd(x); register __m128d x2x2=_mm_mul_pd(x1x1,x1x1); register __m128d x1x2=_mm_unpacklo_pd(x1x1,x2x2); register __m128d rr=c1; rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c2); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c3); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c4); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c5); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c6); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c7); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c8); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c9); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c10); rr=_mm_add_pd(_mm_mul_pd(rr,x1x2),c11); _mm_storeh_pd(pcos,rr); _mm_storel_pd(psin,rr); } |

This uses Intel Performance Primitives, which should work on a variety of C/C++ compilers. Of course, it goes without saying that if your code is to be remotely portable, you will need to surround this code with the appropriate preprocessor incantations so as to fall back on the vanilla implementation should the processor not be an x86 CPU but something else.

So, just how close are these functions to the C Math Library version? Well, GNUPlot to the rescue once more. Plotting the difference between the math library version and the polynomial version on the -pi to pi interval for the cos() and sin(), respectively:

Yes, they look blocky! This is not a bug! They’re close enough that the error is in the last few significant bits of the result, so subtracting these numbers will be very, very close to zero.

Of course, these sin() and cos() replacements only work if the argument is in the -pi to pi interval. If you can ensure your argument is within this range, you’re done. If it isn’t, however, the following little inline will help:

1 2 3 | inline double wrap(double x){ return x-nearbyint(x*0.159154943091895335768883763373)*6.28318530717958647692528676656; } |

A call to nearbyint() compiles to a single instruction on the x86 CPU, and so its pretty fast.

The magic constants come from a nice paper on the topic, “Fast Polynomial Approximations to Sine and Cosine,” by Charles K Garrett, Feb 2012.