Saturday, 28 February 2009

Shear Warp Optimising

I tried out an optimisation I suggested in my last post, storing all the threads image buffers interlaced in memory rather than in seperate blocks. So the memory for two threads would look like this:

[thread 1 (0,0)] [thread 2 (0,0)] [thread 1 (1,0)] [thread 2 (1,0)] [thread 1 (2,0)] etc

In an test image where most voxels were empty, rendering time on 8 threads was reduced from 0.141 to 0.123. On a second test with most voxels being drawn, render time increased from 0.234 to 0.281. I believe this is due to the threads writing to the same cache line, as discussed here under "false sharing": http://www.ddj.com/embedded/212400410

On an image with more empty voxels false sharing has less effect, allowing the advantages of sharing cache lines for reading purposes to speed the rendering up.

I have also fixed multithreaded run length encoding, which on 8 threads reduced test skull render from 0.125 to 0.109. I will do more detailed test results in the future.

No comments: