Most processors since about 1999 contain eight 128 bit registers that can be used to do multiple computations in one instruction. 128 bits can hold 4 32-bit floating point numbers, and so there could be a theoretical 4x improvement in speed, as long as the same operation is done to each. For example, if you have two arrays a and b , and want to sum them, instead of:

```
float a[4]; float b[4]; float c[4];
//...
c[0] = a[0]+b[0];
c[1] = a[1]+b[1];
c[2] = a[3]+b[2];
c[3] = a[3]+b[3];
```

, this can be expressed as:

```
__m128 a, b, c;
// ...
c = _mm_add_ps (a, b);
```

The _mm_add_ps "instrinsic" becomes 1 instruction on the cpu (addps) instead of 4. Similar code can be written with 32-bit integers, and a few other data types. (Technology like CUDA is faster, though). In case you are interested, here is an example of optimizing with SSE instructions.
The following code will draw a chaotic map (c1 and c2 are parameters). For this particular map, we know that initial points or "seeds" close to (0.0,0.0) are good starting points. Let's say, for fun, that we want this to run as fast as possible.

```
const int nseeds=4;
float xseeds[nseeds] = {0.0f, 0.0f, 0.0f, 0.0f};
float yseeds[nseeds] = {0.000001f, 0.000002f,-0.0000011f,-0.0000011f};
float x, y, x_, y_;
for (int seed=0; seed<nseeds; seed++)
{
x=xseeds[seed]; y=yseeds[seed];
// ... settling code ...
for (int i=0; i<PixelsToDrawPerSeed; i++)
{
x_ = c1*x - y*y;
y_ = c2*y + x*y;
x=x_;
y=y_;
if (ISTOOBIG(x)||ISTOOBIG(y)) break;
//if escaping to infinity, there won't be more points drawn, so quit early.
//quitting early is important, far slower without this.
//scale to pixel coordinates, and plot point if it is within bounds. X1 is greatest x, and so on.
int px = (int)(SIZE * ((x - X0) / (X1 - X0)));
int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)));
if (py >= 0 && py < SIZE && px>=0 && px<SIZE)
arr[px+py*SIZE] = 1; //plot point at px, py.
}
}
```

Runtime for 1,310,720 different parameters is Test 1: 14,747ms, Test 2: 25,026ms.
First, I added threading to to take advantage of the two cpu cores. I made sure to balance the work so that each thread is kept busy.
This improved time to Test 1:7,452ms Test 2:12,617ms.
At first glance, it's not obvious how this could be optimized. I was a bit concerned about the (int) casts becoming a slow _ftol call, but looking at the generated assembly showed that this wasn't the case. (Read here,here for why casts can be slow, although with modern compilers and instruction sets this doesn't seem to be an issue).

In visual studio, I noticed a setting for Streaming SIMD Extensions 2 (/arch:SSE2). With this enabled, the compiler was able to vectorize some of the code (I verified in the asm that it was using the xmm registers), which resulted in an ok speed increase. Test 1:5,981ms Test 2:10,987ms

Can I do better?

Here is my first attempt at a vectorized version, moving all 4 points at once.

```
__m128 x128 = _mm_setr_ps( 0.0f, 0.0f, 0.0f, 0.0f);
__m128 y128 = _mm_setr_ps( 0.000001f, 0.000002f,-0.0000011f,-0.0000011f);
__m128 mConst_a = _mm_set1_ps(c1); //sets all 4 fields to c1
__m128 mConst_b = _mm_set1_ps(c2);
__m128 nextx128, tempx, tempy;
__m128i const_size = _mm_set1_epi32(SIZE), const_zero = _mm_set1_epi32(0);
// ... settling code ...
for (int i=0; i<PixelsToDrawPerSeed; i++)
{
nextx128 = _mm_sub_ps(_mm_mul_ps(mConst_a, x128), _mm_mul_ps(y128,y128));
y128 = _mm_mul_ps(y128, _mm_add_ps(x128, mConst_b));
x128 = nextx128;
if (ISTOOBIG(x128.m128_f32[0]) ||ISTOOBIG(x128.m128_f32[1]) ||
ISTOOBIG(x128.m128_f32[2]) ||ISTOOBIG(x128.m128_f32[3]) ||
ISTOOBIG(y128.m128_f32[0]) ||ISTOOBIG(y128.m128_f32[1]) ||
ISTOOBIG(y128.m128_f32[2]) ||ISTOOBIG(y128.m128_f32[3] ))
{
break;
} //assume that if one is too big, all four will be too big, nearly always valid for this map
//scale to pixel coordinates. assume that pixels are square, that X1-X0==Y1-Y0.
//the following computes px = (int)(SIZE * ((x - X0) / (X1 - X0)));
__m128 mConst_scale = _mm_set1_ps(SIZE / (X1 - X0));
tempx = _mm_sub_ps(x128, _mm_set1_ps(X0));
tempx = _mm_mul_ps(tempx, mConst_scale);
tempy = _mm_sub_ps(y128, _mm_set1_ps(Y0));
tempy = _mm_mul_ps(tempy, mConst_scale);
__m128i xPt = _mm_cvttps_epi32(tempx); //cast to int32
__m128i yPt = _mm_cvttps_epi32(tempy);
__m128i xInBounds = _mm_and_si128(_mm_cmpgt_epi32(xPt, const_zero), _mm_cmplt_epi32(xPt, const_size));
__m128i yInBounds = _mm_and_si128(_mm_cmpgt_epi32(yPt, const_zero), _mm_cmplt_epi32(yPt, const_size));
__m128i inBounds = _mm_and_si128(xInBounds, yInBounds);
if (inBounds.m128i_i32[3] != 0)
arr[xPt.m128i_i32[3] +yPt.m128i_i32[3]*SIZE] = 1; //plot point
if (inBounds.m128i_i32[2] != 0)
arr[xPt.m128i_i32[2] +yPt.m128i_i32[2]*SIZE] = 1; //plot point
if (inBounds.m128i_i32[1] != 0)
arr[xPt.m128i_i32[1] +yPt.m128i_i32[1]*SIZE] = 1; //plot point
if (inBounds.m128i_i32[0] != 0)
arr[xPt.m128i_i32[0] +yPt.m128i_i32[0]*SIZE] = 1; //plot point
}
```

It is convinient that conditions can be expressed in SSE with the and instruction and comparison instruction.
If a comparison is true, the field is filled with 0xffffffff, and if false, 0x00000000.
This version was faster. Test 1: 4,650ms Test 2: 7,293ms
It was clear from the generated assembly that the ISTOOBIG(x128.m128_f32[0]) ||ISTOOBIG(x128.m128_f32[1] was inefficient, as it had to move data out of the xmm registers before comparing. I then learned about _mm_movemask_ps, which does exactly what I want. It condenses the 128 bits into 4 bits of an integer in one step, much faster than moving out data four times, and so I could write

```
//instead of ISTOOBIG(x128.m128_f32[0])||ISTOOBIG(x128.m128_f32[1]) and so on
__m128 estimateMag = _mm_add_ps(x128,y128); //estimate magnitude as (x+y)(x+y)
estimateMag = _mm_mul_ps(estimateMag,estimateMag);
__m128 isTooBig = _mm_cmpgt_ps(estimateMag, constThreshold);
if (_mm_movemask_ps(isTooBig) != 0) //if any of the four fields are true
return 0;
```

This increased speed dramatically! The score is now Test 1: 1,956ms, Test 2: 3,323ms! I later saw on this site assembly for a quick absolute value hack. I used the idea and rewrote as

```
__m128 estimateMag = _mm_add_ps(x128,y128); //estimate magnitude as abs(x+y)
estimateMag = _mm_andnot_ps(_mm_set1_ps(-0.0f), estimateMag); //credit: Concatenative language wiki
__m128 isTooBig = _mm_cmpgt_ps(estimateMag, constToobig);
if (_mm_movemask_ps(isTooBig) != 0) //if any of the four fields are true
return 0;
```

A further optimization was to vectorize finding the array pixel location. Instead of "if (inBounds.m128i_i32[3] != 0)...", I could write:

```
//Make size a power of 2 so multiplication is a shift left. Example: SIZE=128,LOG2SIZE=7
__m128i pixel = _mm_add_epi32(xPt, _mm_slli_epi32(yPt, LOG2SIZE)); //pixel=xPt+yPt*size
pixel = _mm_and_si128(inBounds, pixel); //draw out-of-bounds pixels at 0,0
arr[pixel.m128i_i32[3]] = 1; //plot point
arr[pixel.m128i_i32[2]] = 1; //plot point
arr[pixel.m128i_i32[1]] = 1; //plot point
arr[pixel.m128i_i32[0]] = 1; //plot point
```

Note that I've eliminated some conditionals. The trick is to mask out the point if inBounds is false with the line "pixel = _mm_and_si128(inBounds, pixel);". False is represented by 0x00000000, and so if inBounds is false, pixel is set to 0. All out-of-bounds pixels are drawn onto (0,0), but this pixel can be cleared later. Speed is improved to Test 1: 1,726ms, Test 2: 2,991ms.

If it were likely that a pixel would be out-of-bounds, I could change the order and check inbounds much earlier, before casting to int, and do things like checking if x is in bounds before calculating y. Because this case not occur often, it is not worth attention.

I notice some redundancy in "int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)))" and "pixel=x+y*size." In both, y is multiplied by a constant. I could eliminate a multiply with something like:

```
int py_times_size = (int)(SIZE*SIZE* ((y - Y0) / (Y1 - Y0)))
if py_times_size>0 and py_times_size<SIZE*SIZE and x>0...
pixel=x+py_times_size.
```

This didn't work in sse vectors because of overflow with _mm_cvttps_epi32.

My final step is the least elegant: loop unrolling. Repeating the loop body 4 times gave the fastest results. (Faster because less overhead and maybe better scheduling, there is a trade off because of cache misses as size increases). There are additional savings because I only need one "isTooBig test" per every four iterations, and so I deleted all but one of the tests from the loop. It doesn't matter if the point is too big for a few iterations as long as we realize it within a few iterations. This final step also made a significant difference.

Our final times are Test 1: 1,346ms, Test 2: 2,285ms. This amount of optimization, by about a factor of 4, can be the difference in a real-time program. The code in this example, by the way, makes it possible to quickly draw this rather striking figure:

## 5 comments:

乳交挑逗淫婦色情俱樂部成人色情成人影片視訊網愛聊天室免費色情網情色區火辣美女情色性愛成人影音聊天成人色情網站全裸美女圖片成人裸照sex辣妹裸體美女全裸圖成人區av女情色內容情色自拍貼圖成人色情網性愛論壇一對多性伴侶成人聊天室撫摸淫美成人論壇女人奶頭女生自慰影片台灣女優美女視訊一絲不掛一夜正妹成人影像巨乳大奶子情色性愛貼圖情色王國做愛視訊火辣情色台灣情色網情色聊天網性愛技巧淫娃情色成人巨乳辣妹性愛知識性經驗

人生不如意事，十常八九。 ..................................................

淫女情慾情色做愛限制級波霸口交18禁貼圖寫真視訊援交露點爆乳潮吹裸體裸照裸女愛愛無碼尋夢視訊聊天a漫a片a圖一夜情一葉情人妻激情情色寫真美女自拍辣妹自拍正妹自拍美女走光

我們不是因為快樂而歌唱，而是唱歌使我們快樂 ..................................................

你的部落格很棒,我期待更新喔 ........................................

Post a Comment