Thursday, 27 November 2008

Linking Parallel Processing and Volume Rendering

I feel I've got a good grasp of the ideas behind parallel processing over the past week. My primary aim for the personal meeting tomorrow morning will be to assess whether this is true by discussing the things in this post.

I wrote up the following notes today on how I would apply some parallel processing code design ideas to some volume rendering techniques:

-------------------

Splatting

Representations of each voxel are layered up in a back-to-front order on the image buffer to produce the final image.

The order in which voxels that overlap in image space are rendered is very important here. Dividing the voxels between processes based on their screen space position could be problematic, as overlapping voxels may get rendered by separate processes, and also there could be multiple processes needing to render to the same area of the image buffer.

Instead the voxels could be split up based on their depth from the viewpoint, in a Divide & Conquer fashion. Each process would render its voxels to its own individual image buffer, which would need to store both colour and opacity per pixel. These image buffers would then be composited together in back-to-front order, making use of the opacity data stored.

Each process would only write to its individual image buffer and would only need to access the set of voxels it is working on, so memory locking would not occur. The extra memory required for multiple image buffers could be a potential problem, although this is unlikely with today’s hardware unless the image size was extremely large.

The image composition step would also be something to watch out for, for large image sizes or large numbers of buffers to composite. If it proved to be an issue it would be possible to also divide this step between processes, for example one process composities the front half of the buffers while another composites the back half. Alternatively it could be kept in one process and done whilst the following image is rendered by other processes, however this would require a second set of image buffers for double-buffering.

The final point to consider here is load balancing. Each slice of the image to be processed should ideally contain the same number of voxels. In order to achieve this the shape of the data set needs to be taken into account: for example if viewing an axis-aligned cube of voxels from a direction of (1, 1, 1), if the voxels are split evenly by their depth, the majority of the voxels will be distributed in the central slices. The distribution of the voxels within the dataset could also be taken into account.

An alternate load balancing method would be to divide the data set into more slices than the number of processes, and delegate tasks in a Master & Worker style. This would greatly increase the number of image buffers to be allocated / composited however.

Shear Warp

The view matrix is sheared so that volume slices can be rendered orthogonally with optimisations such as run-length encoding.

In terms of the actual rendering this can be parallelized in a similar fashion to splatting. The volume can be split based on the depth of slices from the camera, and the images composited on top of each other at the end.

The additional steps here are the warping of the image at the end, and recalculation of run-length encoding if the opacity threshold/table is altered.

Since the warp is an operating affecting the entire image differently it could be a problem to parallelize it (I currently don’t know much about what the warp involves). A simpler option to keep all the processors working could be to do the warp alongside the rendering of the next image.

Run-length encoding could potentially be split up. Whilst calculation of it is very dependant on the surrounding voxels in a slice, presumably the RLE does not carry over from one slice to the next and so the slices can be calculated independently. Also note that the RLE needs to be calculated for all three versions of the data (one for each axis) so these can easily be parallelized also.

Marching Cubes

A mesh is created by splitting the volume into cubes and analysing the volume data inside each cube.

Generating the mesh for a cube is not dependent on any of the calculations for neighbouring cubes. Divide & Conquer can be used, the cubes can be divided between multiple processes, and the partial meshes created welded together at the end.

One of the optimisations for the standard version of the algorithm is to reuse values sampled from the volume data where the cubes meet. As long as the cubes given to each process are adjacent (i.e. a direct block from the volume, not an unordered set) this optimisation can still be done on the majority of cubes. Whilst it would be possible to communicate sampled values for cubes on the border between processes, this would far less efficient than just sampling the volume seperately per process.

Ray Casting

A ray is created for each pixel in the output image and traversed along, sampling the volume along the way.

The rays are completely independent, so parallelization should be straightforward. Each process could be given a portion of the final image buffer to work on, and once they are all complete the results combined together.

The rays will all be sampling from the same dataset. There should be little overlap of rays from different processes sampling the same data, as they naturally all pass through different parts of the volume. Even in the case of an overlap, the chances of both processes wanting to sample that data at the same time should be slim – so as far as my understanding goes I don’t think it should pose a problem.

The more crucial problem here in order to attain maximum rendering speed is load balancing. Areas of the volume that are highly opaque will enable early ray termination, which greatly reduces the number of samples required. Areas of the volume that feature a lot of low-opacity data will require the ray to traverse most/all of the way through the volume, which will be considerably slower.

The speed of rendering the image is limited by the longest time it takes for one of the processes to complete its part of the image, so all parts should take roughly the same time to process.

In order to achieve this, each process could report back the rendering time for its part of the image. The times for each part could be compared, and used on the next frame to govern how the image is divided up, for example if a section was rendered very quickly it can be expanded on the next frame, and another section reduced. This will work best when the viewpoint does not change a great deal between frames. If the viewpoint does change there will potentially be a drop in rendering time while the division of the image is reevaluated. The main advantage of dividing the image this way however is that it doesn’t require a lot of extra processing to evaluate how to divide the image, as the rendering needs to be done anyway.


-------------------

I also sketched up this very basic design (it feels overly simple really) for the parallelized ray tracing renderer, just to check if my general ideas are on the right track. I currently haven't had chance to look into the actual code-level implementation of threads, so precisely how memory is accessed or what can be shared is still unclear for me.

No comments: