Earlier this week I refactored the shear warp code to make it more manageable. Rendering for each primary axis was done with seperate code, now all three situations are handled with the same code. There is no measurable performance loss for this.
I then started implementing the multithreaded version of the shear warp algorithm. Each thread renders concurrent slices of the volume to its own image buffer. This image shows the voxels rendered by one thread (with 8 threads running):
These are then composited together at the end. This required altering the rendering slightly in order to also write transparency to the image buffer, otherwise compositing would not be possible.
In a couple of quick tests, the render time of normal non-run-length encoded shear warp was reduced:
0.484 seconds -> 0.125 seconds
0.672 seconds -> 0.152 seconds
The main optimisation I wish to attempt is along similar lines to the 'alternate pixels' method of multithreading the raycasting. Instead of allocating each thread's image buffer seperately, I will do it as one block with alternate pixels belonging to each thread. This should mean the threads are working in roughly the same area of memory at any one time. Also when compositing the seperate images back together, the pixels in the same position on each buffer would all be concurrent in memory, which is ideal.
The work could be split between threads differently, for example alternating slices, rows or voxels. However doing this would mean implementing some measures to ensure that voxels are written in the correct order, and potentially making threads wait for others to complete, which would involve something along the lines of a depth buffer. This is likely to ruin any possible performance gain so for now I will concentrate on other more important tasks.
I could also potentially use threading when compositing the seperate image buffers into one.
What also needs looking at is that run-length encoded rendering breaks when in multithreaded mode:
Currently the RLE volume data object handles which voxel is to be read next, so with multiple threads sampling from it, it breaks. This should be simple to do once I get round to it.
I'm happy with the direction this is going in and the possible things I can do to improve it, drawing from my experience with multithreading the ray casting. Tomorrow's meeting will be spent discussing these results. Over the next week I aim to make a lot of progress, implementing these optimisations and possibly trying out some SIMD optimisations, though I haven't put much thought into where that could be used yet.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment