Saturday 28 February 2009

Shear Warp Optimising

I tried out an optimisation I suggested in my last post, storing all the threads image buffers interlaced in memory rather than in seperate blocks. So the memory for two threads would look like this:

[thread 1 (0,0)] [thread 2 (0,0)] [thread 1 (1,0)] [thread 2 (1,0)] [thread 1 (2,0)] etc

In an test image where most voxels were empty, rendering time on 8 threads was reduced from 0.141 to 0.123. On a second test with most voxels being drawn, render time increased from 0.234 to 0.281. I believe this is due to the threads writing to the same cache line, as discussed here under "false sharing": http://www.ddj.com/embedded/212400410

On an image with more empty voxels false sharing has less effect, allowing the advantages of sharing cache lines for reading purposes to speed the rendering up.

I have also fixed multithreaded run length encoding, which on 8 threads reduced test skull render from 0.125 to 0.109. I will do more detailed test results in the future.

Thursday 26 February 2009

Multithreaded Shear Warp - Partially Complete

Earlier this week I refactored the shear warp code to make it more manageable. Rendering for each primary axis was done with seperate code, now all three situations are handled with the same code. There is no measurable performance loss for this.

I then started implementing the multithreaded version of the shear warp algorithm. Each thread renders concurrent slices of the volume to its own image buffer. This image shows the voxels rendered by one thread (with 8 threads running):



These are then composited together at the end. This required altering the rendering slightly in order to also write transparency to the image buffer, otherwise compositing would not be possible.

In a couple of quick tests, the render time of normal non-run-length encoded shear warp was reduced:
0.484 seconds -> 0.125 seconds
0.672 seconds -> 0.152 seconds

The main optimisation I wish to attempt is along similar lines to the 'alternate pixels' method of multithreading the raycasting. Instead of allocating each thread's image buffer seperately, I will do it as one block with alternate pixels belonging to each thread. This should mean the threads are working in roughly the same area of memory at any one time. Also when compositing the seperate images back together, the pixels in the same position on each buffer would all be concurrent in memory, which is ideal.

The work could be split between threads differently, for example alternating slices, rows or voxels. However doing this would mean implementing some measures to ensure that voxels are written in the correct order, and potentially making threads wait for others to complete, which would involve something along the lines of a depth buffer. This is likely to ruin any possible performance gain so for now I will concentrate on other more important tasks.

I could also potentially use threading when compositing the seperate image buffers into one.

What also needs looking at is that run-length encoded rendering breaks when in multithreaded mode:



Currently the RLE volume data object handles which voxel is to be read next, so with multiple threads sampling from it, it breaks. This should be simple to do once I get round to it.

I'm happy with the direction this is going in and the possible things I can do to improve it, drawing from my experience with multithreading the ray casting. Tomorrow's meeting will be spent discussing these results. Over the next week I aim to make a lot of progress, implementing these optimisations and possibly trying out some SIMD optimisations, though I haven't put much thought into where that could be used yet.

Saturday 21 February 2009

Run Length Encoding

I have finished implementing the run length encoding optimisation. For the skull, render times have been reduced from around 2 seconds to 1.3 seconds. The run length encoding gives the best improvement on images such as the skull where there is a lot of empty space that can be skipped. When most of the volume is only partially transparent there is less improvement.

Since I now have a decent interactive framerate even on my single core PC I will probably look at tweaking the image warp to get it correct before moving onto multithreading.

Thursday 19 February 2009

Basic Shear Warp - First Pass

I have done a first pass on the warp part of the shear warp algorithm, to skew the image back to its proper perspective. It works for the most part, but is still off at times especially at extreme angles.



For now I am happy enough with it while I work on run-length encoding and multithreading. Once the algorithm is running much faster (potentially at mostly interactive frame rates) I'll be able to see what's going on better and fix the warp then.

There are also obvious quality issues that I'll look at then too (mainly involving using a bilinear filter during the warp rather than reading the nearest pixel).

The speed of the algorithm is promising, currently a render such as the one above takes 2.3 seconds compared to 17.3 seconds for a raycasting version (single threaded).

I have begun to lay down the groundwork for RLE. I was planning on having it finished for this week's tutor meeting, which is tomorrow, but time has run out. The new plan is to have it done by the weekend. My plan for the meeting is just the usual checkup on progress.

Thursday 12 February 2009

More Shear Warp Progress

The algorithm now works for all three axes in both positive and negative directions.







The warp is not implemented yet, as I haven't completely got my head around how it should work yet. I tried some transformations out in a graphics program, and achieved correct-looking results using only 2D shears and scales, so it should be simple enough to work out how to apply that in code. These are the mockup versions:





I do not really have an agenda for the meeting tomorrow other than the usual discussion of results. Even in situations like this when I have no specific questions the meetings do always prove to be useful, and the weekly deadlines help to motivate me.

Wednesday 11 February 2009

Shear Warp Progress

After a bit more work the first part of the algorithm is on screen. The volume is rendered to a temporary image buffer, where every voxel maps to a single pixel. This buffer is currently just copied to the screen, eventually this step will include the warping part of the algorithm.



----------------------

Without any major problems the shear is now implemented. I also put lighting back in, so the shear can be seen much more easily.

The lit skull, looking directly along the X axis so there is no shear.



Rotated to the left and right, the shear can be seen working.





This is my maths scribblings for calculating how big the shear should be. The result was that the distance of the shear (r) is

width * (depth / sin(angle)) / (width / cos(angle))

Width and depth here are terms for figuring out the calculation here. In the actual code they could be the width/depth/height of the volume, depending which axis is being looked down and which shear is being calculated (horizontal or vertical in screen-space).



The next step is to implement the warp to display the volume properly, and also to correctly handle viewing the volume down both the positive and negative X axis. Then the Y and Z axes need to be implemented, which once X is done should be simple.

Friday 6 February 2009

Shear Warp Algorithm Started

My time this week has once again been biased towards the team project, which I will try harder to avoid next week.

Nevertheless I have made some progress with implementing the shear warp algorithm, the volume data is copied into three axis-aligned formats, and the rendering has been started. Nothing on screen as of yet though.

This week there is no meeting as I am away for the weekend. For next week's meeting I hope to have plenty to show of the new technique.