Friday, 28 November 2008

Friday 28th November Meeting

The meeting today was very useful and went quickly.

The main thing I need to look at in preparation for doing parallelized code is how the hardware will work. For example, how does the memory cache work - is it a global one or is there one for each thread? Also, how will memory sharing work on an architecture with multiple CPUs (rather than simply multiple cores on the same CPU).

The cache findings will govern how the algorithms will work best. For example with the ray tracer, if the memory cache is global it would be best for each thread to be working on different pixels within the same area of the image, rather than each being assigned a completely seperate section of the image to work on.

This then adds the question of whether it would be faster to synchronise the threads in order to maintain the cache benefits, or let them work on their own without the synchronisation overhead but potentially getting out of sequence.

Something else to consider is how the volume data is stored. Storing small cubes of the volume sequentially rather than a straight linear storage of the data could prevent cache hits, as the cache will encompass the area of the volume surrounding the last sample point, rather than just a single row of data. I did store data like this for another project recently and saw a speed increase so it will be worth trying out here.

I will need to be careful with terminology used when it comes to writing the final report. So far I have not used overly consistent (or correct) terminology, for example a 'process' really describes a completely seperate application running rather than a thread of an existing application.

Thursday, 27 November 2008

Linking Parallel Processing and Volume Rendering

I feel I've got a good grasp of the ideas behind parallel processing over the past week. My primary aim for the personal meeting tomorrow morning will be to assess whether this is true by discussing the things in this post.

I wrote up the following notes today on how I would apply some parallel processing code design ideas to some volume rendering techniques:

-------------------

Splatting

Representations of each voxel are layered up in a back-to-front order on the image buffer to produce the final image.

The order in which voxels that overlap in image space are rendered is very important here. Dividing the voxels between processes based on their screen space position could be problematic, as overlapping voxels may get rendered by separate processes, and also there could be multiple processes needing to render to the same area of the image buffer.

Instead the voxels could be split up based on their depth from the viewpoint, in a Divide & Conquer fashion. Each process would render its voxels to its own individual image buffer, which would need to store both colour and opacity per pixel. These image buffers would then be composited together in back-to-front order, making use of the opacity data stored.

Each process would only write to its individual image buffer and would only need to access the set of voxels it is working on, so memory locking would not occur. The extra memory required for multiple image buffers could be a potential problem, although this is unlikely with today’s hardware unless the image size was extremely large.

The image composition step would also be something to watch out for, for large image sizes or large numbers of buffers to composite. If it proved to be an issue it would be possible to also divide this step between processes, for example one process composities the front half of the buffers while another composites the back half. Alternatively it could be kept in one process and done whilst the following image is rendered by other processes, however this would require a second set of image buffers for double-buffering.

The final point to consider here is load balancing. Each slice of the image to be processed should ideally contain the same number of voxels. In order to achieve this the shape of the data set needs to be taken into account: for example if viewing an axis-aligned cube of voxels from a direction of (1, 1, 1), if the voxels are split evenly by their depth, the majority of the voxels will be distributed in the central slices. The distribution of the voxels within the dataset could also be taken into account.

An alternate load balancing method would be to divide the data set into more slices than the number of processes, and delegate tasks in a Master & Worker style. This would greatly increase the number of image buffers to be allocated / composited however.

Shear Warp

The view matrix is sheared so that volume slices can be rendered orthogonally with optimisations such as run-length encoding.

In terms of the actual rendering this can be parallelized in a similar fashion to splatting. The volume can be split based on the depth of slices from the camera, and the images composited on top of each other at the end.

The additional steps here are the warping of the image at the end, and recalculation of run-length encoding if the opacity threshold/table is altered.

Since the warp is an operating affecting the entire image differently it could be a problem to parallelize it (I currently don’t know much about what the warp involves). A simpler option to keep all the processors working could be to do the warp alongside the rendering of the next image.

Run-length encoding could potentially be split up. Whilst calculation of it is very dependant on the surrounding voxels in a slice, presumably the RLE does not carry over from one slice to the next and so the slices can be calculated independently. Also note that the RLE needs to be calculated for all three versions of the data (one for each axis) so these can easily be parallelized also.

Marching Cubes

A mesh is created by splitting the volume into cubes and analysing the volume data inside each cube.

Generating the mesh for a cube is not dependent on any of the calculations for neighbouring cubes. Divide & Conquer can be used, the cubes can be divided between multiple processes, and the partial meshes created welded together at the end.

One of the optimisations for the standard version of the algorithm is to reuse values sampled from the volume data where the cubes meet. As long as the cubes given to each process are adjacent (i.e. a direct block from the volume, not an unordered set) this optimisation can still be done on the majority of cubes. Whilst it would be possible to communicate sampled values for cubes on the border between processes, this would far less efficient than just sampling the volume seperately per process.

Ray Casting

A ray is created for each pixel in the output image and traversed along, sampling the volume along the way.

The rays are completely independent, so parallelization should be straightforward. Each process could be given a portion of the final image buffer to work on, and once they are all complete the results combined together.

The rays will all be sampling from the same dataset. There should be little overlap of rays from different processes sampling the same data, as they naturally all pass through different parts of the volume. Even in the case of an overlap, the chances of both processes wanting to sample that data at the same time should be slim – so as far as my understanding goes I don’t think it should pose a problem.

The more crucial problem here in order to attain maximum rendering speed is load balancing. Areas of the volume that are highly opaque will enable early ray termination, which greatly reduces the number of samples required. Areas of the volume that feature a lot of low-opacity data will require the ray to traverse most/all of the way through the volume, which will be considerably slower.

The speed of rendering the image is limited by the longest time it takes for one of the processes to complete its part of the image, so all parts should take roughly the same time to process.

In order to achieve this, each process could report back the rendering time for its part of the image. The times for each part could be compared, and used on the next frame to govern how the image is divided up, for example if a section was rendered very quickly it can be expanded on the next frame, and another section reduced. This will work best when the viewpoint does not change a great deal between frames. If the viewpoint does change there will potentially be a drop in rendering time while the division of the image is reevaluated. The main advantage of dividing the image this way however is that it doesn’t require a lot of extra processing to evaluate how to divide the image, as the rendering needs to be done anyway.


-------------------

I also sketched up this very basic design (it feels overly simple really) for the parallelized ray tracing renderer, just to check if my general ideas are on the right track. I currently haven't had chance to look into the actual code-level implementation of threads, so precisely how memory is accessed or what can be shared is still unclear for me.

Parallel Processing Design Patterns

This is the notes I have written for a number of parallel processing design patterns. Information is largely from http://wing.cs.uiuc.edu/group/patterns/patterns.html, but my general understanding of the designs comes from various other sources.

Repositories
- Completely independent operations are applied to a centralized repository of data, with no restriction on order.
- The only restriction is that data can only be read/written by one UE at any one time.
- The simplest design. Components are completely seperated other than communication with the main repository.
- Noise is introduced to the results thanks to the non-deterministic nature of having no synchronisation between the operations. Results can be hard to reproduce.


Pipe and Filter
- Used for when multiple data sets must be processed using a sequence of ordered operations.
- The operations are split onto separate UEs. Parallelization comes into play when the first set of data is finished, and moves to the second operation. The first operation can then be ran on the second set of data.
- Eventually all operations will be running on a data set.
- The main problem to watch out for is that a delay in one operation will propagate throughout the entire system. The following operation will have nothing to do, and the previous operation cannot pass on the data. This means that the load of each UE must be balanced carefully so all operations take roughly the same amount of time.


Master and Worker
- The same computation must be performed on each data set, but completely independently of each other (an example of an Embarassingly Parallel problem)
- Parallelizing is easy, simply process multiple data sets at the same time.
- The master gives tasks to the workers once they are done with the previous task. All workers are almost always busy, which implies optimal load balancing.
- If the number of workers is large, the communication overhead for the master can become large. This can be combated by increasing the size of tasks, which will lead to a more coarse granulation (more processing per communication).


Replicable
- The same computation must be performed on each data set, completely seperately of each other. The computation requires a lot of reading from a set of global data.
- Similar to Master and Worker except for the reading of global data. The solution is similar, but in addition each UE stores a copy of this global data, in order that it can access it quickly without having to wait for other UEs to stop reading the data.
- Replicating the data may require a very large amount of memory, if the data set is large and the number of workers also large.


Divide & Conquer
- The operation required can be split into smaller suboperations and the results merged into the final solution later on. Suboperations are independent of each other, only the final solution is dependent on all of the suboperations.
- Ideally each suboperation should be of equal size to maintain load balance.
- Fits naturally to many problems such as sorting or searching, especially merge sort.


Geometric Decomposition
- Computation needs to be performed on a set of data, which are semi-independent from each other. For example, the data is in an array where the calculation is dependent on the neighbouring cells.
- The data is divided between processess, each only accessing their local data. On the data which is required by other processes is sent (in the array example, the cells on the border between processor’s areas). The network diagram would be a geometric shape related to how the array was split up.
- The communication overhead can be quite large, so care must be taken that this overhead doesn’t outweigh the processing advantage gained.

Tomorrow I will look into how these patterns could relate to some of the volume rendering techniques I have learnt about.

Tuesday, 25 November 2008

Parallel Processing - General Ideas Research

[I intended on posting this on the evening of Tuesday 25th November but saved a draft instead by accident. Here it is anyway]

I began reading about and looking into parallel processing design patterns yesterday evening and this evening. I don't have anything solid to show currently as most of this evening ended up with me and a friend discussing the problems / benefits of parallel processing, the future of it and the future of computing in general.

I've enjoyed generally getting into the ideas and learning more about it. I will aim to begin more formally putting design patterns etc together tomorrow.

Monday, 24 November 2008

Data loaded successfully

I tried loading the other two data sets on the Stanford Volume Data Archive, "CT Head" and "Bunny". Both suffered the same problems as the first one I tried, the general shape was definitely there but there were noise issues.

I then had a look at a different volume data source that I had noted down in my research: Pre-Integrated Volume Rendering. The third data set "CT Head" appears to be from the same scan as the "CT Head" from the Stanford Volume Data Archive. I downloaded it and was able to load it without any problems. That the page is called "pre-integrated" suggests that I was missing a step in resolving the data from Stanform into something I could use.

A screenshot of the successfully loaded CT Head.

Sunday, 23 November 2008

20th November Meeting

In the meeting on Thursday I found out that online papers such as IEEE and ACM can be accessed fine from on campus or through the university LIS website. This will be helpful for doing further research.

Writing the report was also discussed. It is ok for me to start writing the report now, in a first draft style that I can later go back to and piece together properly.

The tasks for the next week are to look into parallel processing design patterns and draw up some plans for the design of my program. Also I will continue the preliminary coding for loading and sampling a volume.

Thursday, 20 November 2008

Loading data, drawing to screen

Tonight's work involved loading a data set (MRbrain from the Stanford volume data archive), and getting a representation of a slice of it drawn to the screen.

The data is there in some form, but does not look correct. The black area is pesumably the part where some of the data was removed (see the info in the above link), but the rest is extremely noisy.

The next step will be to check and fix this problem, firstly by loading an alternate set of data to see how that turns out. I might try out the terracotta bunny (also on the above link), as the densities involved are very contrasting so errors should be clear.

This week's meeting is tomorrow. I intend to discuss the possibility of obtaining papers through the university (for example papers published through the IEEE), and also whether it is feasible to start writing parts of my final report soon, as I would like to spread out the writing as much as possible. In terms of code I don't have much to talk about, as at the moment it's just a case of working through it.