SSE instructions

For the past few weeks I have been experimenting with Streaming SIMD Extension instructions. It takes a while to write at this lower level, but I've been able to get my code to run 3 times faster!

Most processors since about 1999 contain eight 128 bit registers that can be used to do multiple computations in one instruction. 128 bits can hold 4 32-bit floating point numbers, and so there could be a theoretical 4x improvement in speed, as long as the same operation is done to each. For example, if you have two arrays a and b , and want to sum them, instead of:
float a[4]; float b[4]; float c[4];
c[0] = a[0]+b[0];
c[1] = a[1]+b[1];
c[2] = a[3]+b[2];
c[3] = a[3]+b[3];

, this can be expressed as:
__m128 a, b, c;
// ...
c = _mm_add_ps (a, b);
The _mm_add_ps "instrinsic" becomes 1 instruction on the cpu (addps) instead of 4. Similar code can be written with 32-bit integers, and a few other data types. (Technology like CUDA is faster, though). In case you are interested, here is an example of optimizing with SSE instructions.

The following code will draw a chaotic map (c1 and c2 are parameters). For this particular map, we know that initial points or "seeds" close to (0.0,0.0) are good starting points. Let's say, for fun, that we want this to run as fast as possible.
const int nseeds=4;
float xseeds[nseeds] = {0.0f, 0.0f, 0.0f, 0.0f};
float yseeds[nseeds] = {0.000001f, 0.000002f,-0.0000011f,-0.0000011f};
float x, y, x_, y_;
for (int seed=0; seed<nseeds; seed++)
  x=xseeds[seed]; y=yseeds[seed];
  // ... settling code ...
  for (int i=0; i<PixelsToDrawPerSeed; i++)
    x_ = c1*x - y*y; 
    y_ = c2*y + x*y;
    if (ISTOOBIG(x)||ISTOOBIG(y)) break; 
    //if escaping to infinity, there won't be more points drawn, so quit early.
    //quitting early is important, far slower without this.

    //scale to pixel coordinates, and plot point if it is within bounds. X1 is greatest x, and so on.
    int px = (int)(SIZE * ((x - X0) / (X1 - X0)));
    int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)));
    if (py >= 0 && py < SIZE && px>=0 && px<SIZE)
      arr[px+py*SIZE] = 1; //plot point at px, py.
Runtime for 1,310,720 different parameters is Test 1: 14,747ms, Test 2: 25,026ms. First, I added threading to to take advantage of the two cpu cores. I made sure to balance the work so that each thread is kept busy. This improved time to Test 1:7,452ms Test 2:12,617ms.

At first glance, it's not obvious how this could be optimized. I was a bit concerned about the (int) casts becoming a slow _ftol call, but looking at the generated assembly showed that this wasn't the case. (Read here,here for why casts can be slow, although with modern compilers and instruction sets this doesn't seem to be an issue).

In visual studio, I noticed a setting for Streaming SIMD Extensions 2 (/arch:SSE2). With this enabled, the compiler was able to vectorize some of the code (I verified in the asm that it was using the xmm registers), which resulted in an ok speed increase. Test 1:5,981ms Test 2:10,987ms

Can I do better?

Here is my first attempt at a vectorized version, moving all 4 points at once.
__m128 x128 = _mm_setr_ps( 0.0f, 0.0f, 0.0f, 0.0f);
__m128 y128 = _mm_setr_ps( 0.000001f, 0.000002f,-0.0000011f,-0.0000011f);
__m128 mConst_a = _mm_set1_ps(c1); //sets all 4 fields to c1
__m128 mConst_b = _mm_set1_ps(c2);
__m128 nextx128, tempx, tempy;

__m128i const_size = _mm_set1_epi32(SIZE), const_zero = _mm_set1_epi32(0);
// ... settling code ...
for (int i=0; i<PixelsToDrawPerSeed; i++)
  nextx128 = _mm_sub_ps(_mm_mul_ps(mConst_a, x128), _mm_mul_ps(y128,y128));
  y128 = _mm_mul_ps(y128, _mm_add_ps(x128, mConst_b)); 
  x128 = nextx128;

  if (ISTOOBIG(x128.m128_f32[0]) ||ISTOOBIG(x128.m128_f32[1]) ||
    ISTOOBIG(x128.m128_f32[2]) ||ISTOOBIG(x128.m128_f32[3]) ||
    ISTOOBIG(y128.m128_f32[0]) ||ISTOOBIG(y128.m128_f32[1]) ||
    ISTOOBIG(y128.m128_f32[2]) ||ISTOOBIG(y128.m128_f32[3] ))
  } //assume that if one is too big, all four will be too big, nearly always valid for this map

  //scale to pixel coordinates. assume that pixels are square, that X1-X0==Y1-Y0.
  //the following computes px = (int)(SIZE * ((x - X0) / (X1 - X0)));
  __m128 mConst_scale = _mm_set1_ps(SIZE / (X1 - X0));
  tempx = _mm_sub_ps(x128, _mm_set1_ps(X0));
  tempx = _mm_mul_ps(tempx, mConst_scale); 
  tempy = _mm_sub_ps(y128, _mm_set1_ps(Y0));
  tempy = _mm_mul_ps(tempy, mConst_scale); 
  __m128i xPt = _mm_cvttps_epi32(tempx); //cast to int32
  __m128i yPt = _mm_cvttps_epi32(tempy); 
  __m128i xInBounds =  _mm_and_si128(_mm_cmpgt_epi32(xPt, const_zero), _mm_cmplt_epi32(xPt, const_size));
  __m128i yInBounds =  _mm_and_si128(_mm_cmpgt_epi32(yPt, const_zero), _mm_cmplt_epi32(yPt, const_size));
  __m128i inBounds = _mm_and_si128(xInBounds, yInBounds);
  if (inBounds.m128i_i32[3] != 0)
    arr[xPt.m128i_i32[3] +yPt.m128i_i32[3]*SIZE] = 1; //plot point
  if (inBounds.m128i_i32[2] != 0)
    arr[xPt.m128i_i32[2] +yPt.m128i_i32[2]*SIZE] = 1; //plot point
  if (inBounds.m128i_i32[1] != 0)
    arr[xPt.m128i_i32[1] +yPt.m128i_i32[1]*SIZE] = 1; //plot point
  if (inBounds.m128i_i32[0] != 0)
    arr[xPt.m128i_i32[0] +yPt.m128i_i32[0]*SIZE] = 1; //plot point
It is convinient that conditions can be expressed in SSE with the and instruction and comparison instruction. If a comparison is true, the field is filled with 0xffffffff, and if false, 0x00000000. This version was faster. Test 1: 4,650ms Test 2: 7,293ms

It was clear from the generated assembly that the ISTOOBIG(x128.m128_f32[0]) ||ISTOOBIG(x128.m128_f32[1] was inefficient, as it had to move data out of the xmm registers before comparing. I then learned about _mm_movemask_ps, which does exactly what I want. It condenses the 128 bits into 4 bits of an integer in one step, much faster than moving out data four times, and so I could write
//instead of ISTOOBIG(x128.m128_f32[0])||ISTOOBIG(x128.m128_f32[1]) and so on
__m128 estimateMag = _mm_add_ps(x128,y128); //estimate magnitude as (x+y)(x+y)
estimateMag = _mm_mul_ps(estimateMag,estimateMag);
__m128 isTooBig = _mm_cmpgt_ps(estimateMag, constThreshold);
if (_mm_movemask_ps(isTooBig) != 0) //if any of the four fields are true
  return 0;

This increased speed dramatically! The score is now Test 1: 1,956ms, Test 2: 3,323ms! I later saw on this site assembly for a quick absolute value hack. I used the idea and rewrote as
__m128 estimateMag = _mm_add_ps(x128,y128); //estimate magnitude as abs(x+y)
estimateMag = _mm_andnot_ps(_mm_set1_ps(-0.0f), estimateMag); //credit: Concatenative language wiki
__m128 isTooBig = _mm_cmpgt_ps(estimateMag, constToobig);
if (_mm_movemask_ps(isTooBig) != 0) //if any of the four fields are true
  return 0;

A further optimization was to vectorize finding the array pixel location. Instead of "if (inBounds.m128i_i32[3] != 0)...", I could write:
//Make size a power of 2 so multiplication is a shift left. Example: SIZE=128,LOG2SIZE=7
__m128i pixel = _mm_add_epi32(xPt, _mm_slli_epi32(yPt, LOG2SIZE)); //pixel=xPt+yPt*size 
pixel = _mm_and_si128(inBounds, pixel); //draw out-of-bounds pixels at 0,0
arr[pixel.m128i_i32[3]] = 1; //plot point
arr[pixel.m128i_i32[2]] = 1; //plot point
arr[pixel.m128i_i32[1]] = 1; //plot point
arr[pixel.m128i_i32[0]] = 1; //plot point

Note that I've eliminated some conditionals. The trick is to mask out the point if inBounds is false with the line "pixel = _mm_and_si128(inBounds, pixel);". False is represented by 0x00000000, and so if inBounds is false, pixel is set to 0. All out-of-bounds pixels are drawn onto (0,0), but this pixel can be cleared later. Speed is improved to Test 1: 1,726ms, Test 2: 2,991ms.

If it were likely that a pixel would be out-of-bounds, I could change the order and check inbounds much earlier, before casting to int, and do things like checking if x is in bounds before calculating y. Because this case not occur often, it is not worth attention.

I notice some redundancy in "int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)))" and "pixel=x+y*size." In both, y is multiplied by a constant. I could eliminate a multiply with something like:
int py_times_size = (int)(SIZE*SIZE* ((y - Y0) / (Y1 - Y0)))
if py_times_size>0 and py_times_size<SIZE*SIZE and x>0...

This didn't work in sse vectors because of overflow with _mm_cvttps_epi32.

My final step is the least elegant: loop unrolling. Repeating the loop body 4 times gave the fastest results. (Faster because less overhead and maybe better scheduling, there is a trade off because of cache misses as size increases). There are additional savings because I only need one "isTooBig test" per every four iterations, and so I deleted all but one of the tests from the loop. It doesn't matter if the point is too big for a few iterations as long as we realize it within a few iterations. This final step also made a significant difference.

Our final times are Test 1: 1,346ms, Test 2: 2,285ms. This amount of optimization, by about a factor of 4, can be the difference in a real-time program. The code in this example, by the way, makes it possible to quickly draw this rather striking figure: