Handling vector loop left-overs with masked loads and stores

Dec 4th, 2013

In the previous example we used a nice little trick which is not available until AVX came to the scene: masked loads and stores.
In this note we’ll go a little deeper into the use of masked loads and stores, and how it can greatly help in handling left-overs after vector loops, as well as dealing with data structures that are simply not a whole multiple of the natural vector size.

We start with a simple problem of adding two 3D vectors:

1
2
3
4
5
6
7
8
9
10
11
// Define a mask for double precision 3d-vector
#define MMM _mm256_set_epi64x(0,~0,~0,~0)
 
// We want to do c=a+b vector-sum
FXVec3d a,b,c;
 
// Set a and b somehow
 
// Use AVX intrinsics
_mm256_maskstore_pd(&c[0],MMM,_mm256_add_pd(_mm256_maskload_pd(&a[0],MMM),
                                            _mm256_maskload_pd(&b[0],MMM)));

This was pretty easy, right? Note that Intel defined those masked-loads and stores in such a way that the store locations are not touched; i.e. they’re not simply loaded and written back with the original values, but never loaded. This is important as you don’t want to incur segmentation violations when your vector happens to be the last thing in a memory page!

Next, we move on to a little more sophisticated use, the wrap-up of left-overs of a vector loop; note that with the masked load and store, we can typically perform the last operation in vector mode as well; we don’t have to resort to plain scalar code like you had to do with SSE.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void add_vectors(float* result,const float* src1,const float src2,int n){
  register __m256i mmmm;
  register __m256 aaaa,bbbb,rrrr;
  register int i=0;
 
  // Vector loop adds 8 paits of floats at a time
  while(i<n-8){
    aaaa=_mm256_loadu_ps(&a[i]);
    bbbb=_mm256_loadu_ps(&b[i]);
    rrrr=_mm256_add_ps(aaaa,bbbb);
    _mm256_storeu_ps(&result[i],rrrr);
    i+=8;
    }
 
  // Load the mask at index n-i; this should be in the range 0...8.
  mmmm=_mm256_castps_si256(_mm256_load_ps((const float*)mask8i[n-i]));
 
  // Use masked loads
  aaaa=_mm256_maskload_ps(&a[i],mmmm);
  bbbb=_mm256_maskload_ps(&b[i],mmmm);
 
  // Same vector operation as main loop
  rrrr=_mm256_add_ps(aaaa,bbbb);
 
  // Use masked store
  _mm256_maskstore_ps(&result[i],mmmm,rrrr);
  }

Note that the loop goes goes one vector short if n is a multiple of eight: since the mop-up code is executed unconditionally, we’d rather to this with actual payload, not with all data masked out.
Also note that we don’t have a special case for n==0. In the rare case that this happens, we will just execute the mop-up code with an all-zeroes mask!

Left to do is build a little array with mask values; due to the observation above, this will have 9 entries, not 8!

1
2
3
4
5
6
7
8
9
10
11
static __align(32) const int  mask8i[9][8]={
  { 0, 0, 0, 0, 0, 0, 0, 0},
  {-1, 0, 0, 0, 0, 0, 0, 0},
  {-1,-1, 0, 0, 0, 0, 0, 0},
  {-1,-1,-1, 0, 0, 0, 0, 0},
  {-1,-1,-1,-1, 0, 0, 0, 0},
  {-1,-1,-1,-1,-1, 0, 0, 0},
  {-1,-1,-1,-1,-1,-1, 0, 0},
  {-1,-1,-1,-1,-1,-1,-1, 0},
  {-1,-1,-1,-1,-1,-1,-1,-1}
  };

In the above, the __align() macro is fleshed out differently depending on your compiler; however it should ensure that the array is aligned to a multiple of 32 bytes (the size of an AVX vector).

Bottom line: the new AVX masked loads solve a real problem that was always a bit awkward to solve before; it allows mop-up code to be vectorized same as the main loop, which is important as you may want to ensure the last couple of numbers get “the same treatment” as the rest of them.

Tags:

Compressed Sparse Row Matrices with AVX2

Dec 3rd, 2013

A Compressed Sparse Row (CSR) matrix is a (typically very large) matrix containing large numbers of irregularly-located zeros. Thus they are a bit nastier to deal with than band-matrices.

A CSR matrix M of Nr x Nc is usually represented as three arrays, (E,C,S). The first array E[] contains the number of non-zero elements of the matrix M. The second array C[] is the same length as E[] and contains the column-number of each element. The last array S[] is of length Nr+1 and contains the position in E[] of the first element of row i; the last entry in S[] is just equal to the total number of elements in E[].

Now take for example this little matrix:

(0 1 0 0)
(0 2 3 0)
(4 0 5 6)
(0 0 0 7)

This can be represented in the CSR matrix form as the following three arrays:

E: (1 2 3 4 5 6 7)
C: (1 1 2 0 2 3 3)
S: (0 1 3 6 7)

Clearly, for small matrices the CSR representation doesn’t gain much; for very large sparse matrices, however, CSR represenation can be the difference between fitting in memory or being too large to handle.

The CSR representation lends itself to a fairly efficient loop for a matrix-vector multiply:

1
2
3
4
5
6
7
  for(r=0; r<Nr; r++){
    acc=0.0f;
    for(start=S[r],end=S[r+1]; start<end; ++start){
      acc+=E[start]*V[C[start]];
      }
    output[row]=acc;
    }

As you see, the array C[i] is used to determine which element of the vector to pluck out of V for multiplication with element E[i].

With the introduction of AVX2, we now have vector VGATHER vector instructions. The VGATHER can be used to fetch either 8 single-precision reals from 8 different addresses, or 4 double precision reals from 4 different addresses. At the same time, a mask can be provided to omit loading designated elements of the vector. The addresses can be either 32-bit or 64-bit, and can be scaled by a small constant. The values that are masked out can be filled with a “fallback” value; in our case we’re adding so we just use 0.0 for the masked-out values.

The VGATHER instruction is available directly from the C compiler through the use of “performance primitives”. This is the preferable way to use this as it embeds these special instructions into the compiler data-flow graph, and allows for possible loop unrollings and what not.

The CSR Matrix and vector multiply can take full advantage of the new GATHER instruction, as we shall see below.

The approach is to iterate over the row in chunks of 8 floats at a time (if you’re more into double precision then its chunks of 4, but the logic will be very similar). At the end of the inner loop we then use a number of horizontal adds to compute the final value to be written back to the output vector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  for(r=0; r<Nr; r++){
    accu=_mm256_set1_ps(0.0f);
 
    // Start the vectorized inner loop
    for(start=S[r],end=S[r+1]; start<end-8; start+=8){
 
      // Grab 8 floats from E
      aaaa=_mm256_loadu_ps(&E[start]);
 
      // Grab 8 column indexes from C
      iiii=_mm256_loadu_si256((const __m256i*)&C[start]);
 
      // Gather 8 floats V[C[start]], V[C[start+1]], ... V[C[start+7]]
      bbbb=_mm256_i32gather_ps(V,iiii,4);
 
      // Multiply them and add to result accu
      accu=_mm256_add_ps(_mm256_mul_ps(aaaa,bbbb),accu);
      }
 
    // We have up to 8 left-overs (end-start)<=8; load a mask based on the
    // number of left-overs and then proceed, this time using masked-loads instead
    mmmm=_mm256_loadu_si256((const __m256i*)mask8i[end-start]);
 
    // Load with mask; locations masked off are not touched
    aaaa=_mm256_maskload_ps(&E[start],mmmm);
 
    // Masked load of indexes; locations masked off are not touched
    iiii=_mm256_maskload_epi32(&C[start],mmmm);
 
    // Most complex form of VGATHER: replace masked values with 0, with mask mmmm
    bbbb=_mm256_mask_i32gather_ps(_mm256_set1_ps(0.0f),V,iiii,_mm256_castsi256_ps(mmmm),4);
 
    // Multiple and add as before
    accu=_mm256_add_ps(_mm256_mul_ps(aaaa,bbbb),accu);
 
    // Now we split the upper and lower AVX register, and do a number of horizontal adds
    rr=_mm_hadd_ps(_mm256_castps256_ps128(rrrr),_mm256_extractf128_ps(rrrr,1));
    rr=_mm_hadd_ps(rr,rr);
    rr=_mm_hadd_ps(rr,rr);
 
    // And write the result
    output[r]=_mm_cvtss_f32(rr);
    }

This code was quite a bit faster than the plain C code; not bad, not bad at all!

I’m curious if someone has further improvements and suggestions for this routine!

Tags:

FUN With AVX

Jul 24th, 2013

Over the last few years, Intel and AMD have added 256-bit vector-support to their processors. The support for these wider vectors is commonly known as AVX (Advanced Vector eXtension).

Since wider vectors also introduce more processor-state, in order to use these features its not enough to have a CPU capable of these AVX vectors, but also your operating system and compiler need to be aware of it.

For maximum portability, I recommend using the Intel Intrinsics. These are supported by GCC, LLVM, as well as late-model Microsoft and Intel compilers. The advantage of using Intrincics are:

  1. Its more easy to work with for the developer, since you can embed these in your regular C/C++ code.
  2. The “smarts” of the compiler regarding register-assignments, common subexpression elimination, and other data-flow analysis goodies are at your disposal.
  3. Target architecture setting of the compiler will automatically use the new VEX instruction encoding, even for code originally written with SSE in mind.

The matrix classes in FOX were originally vectorized for SSE (actually, SSE2/SSE3). Compiling with -mavx will automatically kick in the new VEX encoding for these same SSE intrinsics.  This is nice because AVX supports three-operand instuctions (of the form A = B op C)  rather than the old two-operand instructions (A = A op B).  This means you can typically make do with fewer registers, and quite possibly eliminate useless mov instructions. Your code will be correspondingly smaller and faster, with no work at all!

However, this of course does not fully exploit the new goodies AVX brings to the table.  The most obvious benefit is the wider vectors, which of course means you can work with twice as much data as before.

For example, given 4×4 double precision matrix a[4][4] and b[4][4], we can now add a and b like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  __m256d a,b,c;
  a=_mm256_loadu_pd(a[0]);
  b=_mm256_loadu_pd(b[0]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[0],c);
  a=_mm256_loadu_pd(a[1]);
  b=_mm256_loadu_pd(b[1]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[1],c);
  a=_mm256_loadu_pd(a[2]);
  b=_mm256_loadu_pd(b[2]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[2],c);
  a=_mm256_loadu_pd(a[3]);
  b=_mm256_loadu_pd(b[3]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[3],c);

This little code fragment performs 16 double precision adds in just 4 vector instructions!
Note unlike the old SSE code, you now declare vector variables as __m256 (float), __m256i (integer), or __m256d (double).
The penalty for accessing unaligned memory addresses is much less for AVX, and thus we can use unaligned loads and stores, at a very modest (and usually not measurable) speed penalty. If you really want to go all out, however, remember to align things to 32 bytes now, not 16 bytes like you did for SSE!

For FOX’s matrix classes, compatibility with existing end-user code requires that variables can not be relied upon to be aligned, and thus unaligned accesses are used throughout. This obviates the need for end-user code to be updated for alignment restrictions.

Many of the usual suspects from SSE have extended equivalents in the AVX world: _mm_add_ps() becomes _mm256_add_ps(), _mm_addsub_ps() becomes _mm256_addsub_ps(), and so on.

Detection of AVX on your CPU.

Detection of AVX can be done using the CPUID instruction. However, unlike SSE3 and SSE4x, the introduction of AVX not only added new instructions to the processor.  It also added new state, due to the wider vector registers. Consequently, just knowing that your CPU can do AVX isn’t enough.

You also need the operating system to support the extension, because the extra state in the processor must be saved and restored when the Operating System preempts your process. Consequently, executing AVX instructions on an Operating System which does not support it will likely result in a “Illegal Instruction” exception.  To put it bluntly, your program will core-dump!

Fortunately, Operating System support for AVX is now also available through CPUID. There are three steps involved:

  1. Check AVX support, using CPUID function code #1.  The ECX and EDX registers are used to return a number of feature bits, various extensions to the core x86 instruction set.  The one we’re looking for in this case is ECX bit #28. If on, we’ve got AVX in the hardware.
  2. Next, Intel recommends checking ECX bit #27. This feature bit represents the OSXSAVE feature. XSAVE is basically a faster way to save processor state; if not supported by the O.S. then AVX is likely not available.
  3. Finally, a new register is available in the CPU indicating the Operating System has enabled state-saving the full AVX state. Just like the processor tick counter, this register can be obtained using a new magic instruction: XGETBV. The XGETBV populates the EAX:EDX register pair with feature flags indicating processor state the Operating System is aware of. At this time, x86 processors support three processor-state subsets: x87 FPU state, SSE state, and AVX state.  This information is represented by three bits in the EAX register.  For AVX, bit #2 indicates the Operating System indeed saves AVX state and has enabled AVX instructions to be available.

All this sounds pretty complicated, unless you’re an assembly-language programmer.  So, to make life a bit easier, the FOX CPUID API’s have been updated to do some of this hard work for you.

To perform simple feature tests, use the new fxCPUFeatures() API. It returns bit-flags for most instruction-sets added on top of plain x86. In the case of AVX, it simply disables AVX, AVX2, FMA, XOP, and FMA4 if the operating system does not support the extended state.

More on AVX in subsequent posts.

 

Tags:

Making a virtue out of sin

Mar 20th, 2013

Signal processing often requires us to perform a sin(). Indeed, we often have to perform a cos(), too!  In fact, we often need to do a lot of them, and do them quickly too! In case this wasn’t clear, we’re talking about those wavy things:

sin

The C Math Library of course contains perfectly fine sin() and cos() functions.  However, accepting arbitrary arguments and performing lots of error checking makes them kinda slow.

The quest was on to come up with something better. One could replace sin() and cos() with a big lookup table, and interpolate between entries.  This works, if you don’t mind the vast quantities of memory thus wasted, or the algorithmic complexity of reducing the argument to the right quadrants, and so on.

Fortunately, better ways are possible with small polynomials, particularly if we can limit the argument a little bit. Since sin() and cos() are repeating endlessly, restricting the arguments to the repetition interval seems convenient.

How accurate should it be? Well, that depends. For double precision, the least significant bit falls at about 1E-16, so an algorithm that gets down to that level should be good enough even for most stuff.

For the sin() function, we get this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
double mysin(double x){
  const double A = -7.28638965935448382375e-18;
  const double B =  2.79164354009975374566e-15;
  const double C = -7.64479307785677023759e-13;
  const double D =  1.60588695928966278105e-10;
  const double E = -2.50521003012188316353e-08;
  const double F =  2.75573189892671884365e-06;
  const double G = -1.98412698371840334929e-04;
  const double H =  8.33333333329438515047e-03;
  const double I = -1.66666666666649732329e-01;
  const double J =  9.99999999999997848557e-01;
  register double x2 = x*x;
  return (((((((((A*x2+B)*x2+C)*x2+D)*x2+E)*x2+F)*x2+G)*x2+H)*x2+I)*x2+J)*x;
  }

For the cos() function, we get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
double mycos(double x){
  const double c1  =  3.68396216222400477886e-19;
  const double c2  = -1.55289318377801496607e-16;
  const double c3  =  4.77840439714556611532e-14;
  const double c4  = -1.14706678499029860238e-11;
  const double c5  =  2.08767534780769871595e-09;
  const double c6  = -2.75573191273279748439e-07;
  const double c7  =  2.48015873000796780048e-05;
  const double c8  = -1.38888888888779804960e-03;
  const double c9  =  4.16666666666665603386e-02;
  const double c10 = -5.00000000000000154115e-01;
  const double c11 =  1.00000000000000001607e+00;
  register double x2=x*x;
  return (((((((((c1*x2+c2)*x2+c3)*x2+c4)*x2+c5)*x2+c6)*x2+c7)*x2+c8)*x2+c9)*x2+c10)*x2+c11;
  }

Of course, often we want both sin() and cos(), and it is possible to get both, for the price of one, thanks to SSE2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void mysincos(double x,double* psin,double* pcos){
  const __m128d c1 =_mm_set_pd( 3.68396216222400477886e-19,-7.28638965935448382375e-18);
  const __m128d c2 =_mm_set_pd(-1.55289318377801496607e-16, 2.79164354009975374566e-15);
  const __m128d c3 =_mm_set_pd( 4.77840439714556611532e-14,-7.64479307785677023759e-13);
  const __m128d c4 =_mm_set_pd(-1.14706678499029860238e-11, 1.60588695928966278105e-10);
  const __m128d c5 =_mm_set_pd( 2.08767534780769871595e-09,-2.50521003012188316353e-08);
  const __m128d c6 =_mm_set_pd(-2.75573191273279748439e-07, 2.75573189892671884365e-06);
  const __m128d c7 =_mm_set_pd( 2.48015873000796780048e-05,-1.98412698371840334929e-04);
  const __m128d c8 =_mm_set_pd(-1.38888888888779804960e-03, 8.33333333329438515047e-03);
  const __m128d c9 =_mm_set_pd( 4.16666666666665603386e-02,-1.66666666666649732329e-01);
  const __m128d c10=_mm_set_pd(-5.00000000000000154115e-01, 9.99999999999997848557e-01);
  const __m128d c11=_mm_set_pd( 1.00000000000000001607e+00, 0.00000000000000000000e+00);
  register __m128d x1x1=_mm_set1_pd(x);
  register __m128d x2x2=_mm_mul_pd(x1x1,x1x1);
  register __m128d x1x2=_mm_unpacklo_pd(x1x1,x2x2);
  register __m128d rr=c1;
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c2);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c3);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c4);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c5);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c6);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c7);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c8);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c9);
  rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c10);
  rr=_mm_add_pd(_mm_mul_pd(rr,x1x2),c11);
  _mm_storeh_pd(pcos,rr);
  _mm_storel_pd(psin,rr);
  }

This uses Intel Performance Primitives, which should work on a variety of C/C++ compilers. Of course, it goes without saying that if your code is to be remotely portable, you will need to surround this code with the appropriate preprocessor incantations so as to fall back on the vanilla implementation should the processor not be an x86 CPU but something else.

So, just how close are these functions to the C Math Library version? Well, GNUPlot to the rescue once more.  Plotting the difference between the math library version and the polynomial version on the -pi to pi interval for the cos() and sin(), respectively:

cosdiff sindiff

 

Yes, they look blocky! This is not a bug! They’re close enough that the error is in the last few significant bits of the result, so subtracting these numbers will be very, very close to zero.

Of course, these sin() and cos() replacements only work if the argument is in the -pi to pi interval. If you can ensure your argument is within this range, you’re done.  If it isn’t, however, the following little inline will help:

1
2
3
inline double wrap(double x){ 
  return x-nearbyint(x*0.159154943091895335768883763373)*6.28318530717958647692528676656; 
  }

A call to nearbyint() compiles to a single instruction on the x86 CPU, and so its pretty fast.

The magic constants come from a nice paper on the topic, “Fast Polynomial Approximations to Sine and Cosine,” by Charles K Garrett, Feb 2012.

Tags:

Fast Power of Two Test

Mar 6th, 2013

A quick and nifty check to see if a number is a power of two is to note that for numbers that are a power of two, you’ll get a number which is all ones. For example:

1000 
  -1
0111

Note that the resulting number has no bits in common with the original number; this property is only true for powers of two. For any other number, bits are left standing.
Thus, we can test for power-of-two-ness by:

1
2
3
inline bool ispowof2(int x){ 
  return (x&(x-1))==0; 
}

Which is probably the fastest possible way to do this.

Tags:

The Case for Negative Semaphore Counts

Feb 26th, 2013

The new FXParallel feature being worked on in FOX has a atomic variable for tracking completed tasks of a task-group.  This kind of works as follows:

1
2
3
if(atomicAdd(&counter,-1)==1){
  completion.post();
  }

In other words, when the specified number of tasks has completed, the counter reaches zero (atomicAdd() returns the previous value of the variable) and then we ping the semaphore.

Of course, behind the scenes, a semaphore is maintaining an atomic counter already! Maintaining our own atomic variable therefore would seem to be redundant.  Wouldn’t it be great if we could just initialize a semaphore to a negative value, and post() it until it reaches 1, and release the waiting thread from its call to wait()?

Sadly, you can’t initialize semaphores to a negative value.  But it would be great if you could, and (at least on Linux) it could be done with minimal effort. In fact, simply changing the semaphore variable to signed int and allowing it to be initialized to a negative value is really all it takes.

Tags:

Dell XPS15 Surgery

Feb 22nd, 2013
Comments Off

So I got this Dell XPS15 laptop, and decided to bump up the specs a bit.  A quick visit to NewEgg and I received a new 1TB laptop drive, two RAM sticks, and a nice 120GB SSD drive; the latter was especially nice since I needed some serious space to host both Fedora Linux and Windows 7 on this.

Now, this XPS15 is a great little machine, but service friendly it is not! Whereas my old clunker laptop had nice access panels for memory, drive, and batteries, the XPS15 needs deep surgery to get at these things.

Fortunately, a service manual exists that details the steps (and detailed they are!).  First, to get at the gizzards you will need a mini Torx #T5 screwdriver.  This is of course not something the average Joe has lying around, fortunately there are a few places that sell these.

After the Torx screwdrivers arrived, I managed to pry the bottom off. Along the way, discovered a secret button on the bottom, hidden under the rubber sheet, that apparently exists to diagnose battery charge levels.You’ll find this button under the front-left corner.

Once the bottom was off, another surprise awaited: in order to get to the second DIMM slot, you will need to remove the motherboard! And the motherboard won’t come off without removing the fan, the processor heat-pipe, the battery, the hard drive, the SSD, and detaching the miniscule connectors that hold all this together.

If this were a car, it would be French!

At any rate, once I got this far, of course I was committed! So off the system fan came, and the heat-pipe came, and the battery, and the hard drive, and the SSD, and all the miniscule connectors.

Finally, the hidden DIMM slot revealed itself, and the RAM was put in.  Now the challenge was to put humpty-dumpty together again. Needless to say, I was escpecially concerned about reinstalling the processor heat-sink. Some solvent was brought into action to clean off the remaining heat-conducting compound, and a fresh dollop of silver heat paste was applied.

The remaining dominoes quickly fell into place, except for one mysteriously missing screw neat the middle of the case.

Typical!  Usually, I have parts left over ;-)

After a few fretful moments, it dawned on me that one of the case screws for the bottom doubled as fastener for the motherboard itself! After a few more sweaty moments the bottom was back on, and the moment of truth was upon me.

Press the power button…. And it fired right up!

Moral of the story: modern hardware is not really designed for upgradability. So, unless you have nerves of steel, just buy a loaded version right of the bat and don’t bother upgrading it!

Tags:

WordPress Pressing Problems

Feb 22nd, 2013
Comments Off

Apparently, you can not load WordPress into your document root, set stuff up, and then change the domain name of your machine.

For some unfathomable reason, the original url (containing your domain name!) is getting hardwired into the database in lots of different places.

Using mysql admin tool, I’ve tried to track some of them down, and was partially successful.  Then I tried dumping the entire WP database to file, perform search and replace on the url, and load it back in.

You guessed it, no dice.

In sharp contrast, something like SMF forum just works like a charm.  No hardwired anything, all resources are relative to the base domain name, like it should be.

Why are domain names hardwired into WordPress database during install? Why is there no transparent way to change this (if there are indeed convincing reasons its necessary?).

Anyway, it turns out you can switch a theme and then ar least the urls in the new theme point to the correct place.  Heaven knows what will befall me should my site ever changes domain names…

Tags:

Conversion of Unsigned Int to Float

Aug 17th, 2011

Conversion of unsigned int to float appears to be difficult for x86-64 processor. However, signed int to float is directly supported in the hardware.

The GCC compiler is clever enough to use the hardware instruction CVTSI2SS. However in the case of unsigned int to float, it needs to treat numbers which would be negative in two’s complement notation differently. So the generated code contains a branch!

Doing this with random numbers means the branch can go either way, with 50% probability.  No amount of hardware branch prediction can help with this!

Rewriting the code to use signed integers instead of unsigned ones (basically, dropping the most significant bit) speeded up some critical piece of code by 10%.

A very noticeable improvement!

Tags:

The Forum is Up!

Aug 17th, 2011

After scouring the net to find some decent-looking PHP-based Forum software, I eventually came across Simple Machines Forum (SMF).

So I messed with it all last night, and this night, and got it more or less functional.  The first few users have registered as of late this evening!

There are still unresolved issues:- how to push articles from the forum into the mailing list, and perhaps some way to pull the activity from the mailing list into the forum [less the spam, of course!].

But there are a still a few other pressing problems to address, before we will get around to look at this, such as getting email notifications to work [would be nice to get the article-notifications straight to the smart phone, for instance!].

Plus, we’re still scouting for some nicer themes.

 

Tags: