Sunday, 28 December 2008
Multithreading Begun
The first set is a scan of the inner ear:
This data set does not contain cube voxels, instead they are rectangles, longer in the Z axis. I added some code to take this into account.
The second new data set is a mathematical function. This contains none of the problems of a scan (innaccuracies, noise etc) so it is good for testing the rendering is correct. I used the colour table function to map different colours and transparencies to various values.
Finally this evening I got multithreading working at a basic level using the boost::threads library.
The red component of the image indicates which thread rendered those pixels, just for testing purposes - here there are 8 threads. This computer only has a single core CPU and so I can't fully test this yet, however I will have a good point to start from when I get back to university after the christmas holidays.
The next task is to implement features such as selecting how the pixels are divided up between threads (in blocks or alternate pixels), possibly some improvements to the lighting, and some better output of rendering times to make testing easier.
Tuesday, 23 December 2008
Back to work
After talking to my tutor I spent a bit longer attempting to optimise the basic rendering. When storing the volume data non-linearly the blocks are now power of two only, which allows for bitshifting instead of divides. Unfortunately it is still slower than linear storage.
I also tonight implemented interpolation between voxels when sampling. As expected rendering is quite a bit slower but image quality is improved quite a lot.
Some example render times:
Linear storage:
single sample: 18.8 seconds
interpolated sample: 34.3 seconds
Block storage (size 4)
single sample: 22.2 seconds
interpolated sample: 84.8 seconds
I will try to work when possible over the next couple of weeks. The main aims are to get round to researching the multi-CPU hardware details, and to multithread the ray caster.
Friday, 12 December 2008
A Bit Of Research & Tomorrow's Meeting
However, if multiple threads are accessing data in the same cache sector (the smallest amount of memory that the cache works with), this will cause one to stall whilst waiting for the other read operation to finish. Since this applies to a whole cache sector it can occur just when threads are reading data that is close together, not necessarily just the exact same data.
This suggests each thread should stick to its own seperate portion of the image, which will mean they are largely all working with different areas of the volume also. When I implement the multi-threading I will be able to test both methods and see if the results match this theory.
The final project meeting of the year is tomorrow. I would like to talk a bit about the actual coding of multithreading, including potential libraries to use (the Boost thread library looks promising), and discuss the results of the data storage tests in my previous post.
Links:
http://www.embedded.com/design/multicore/202805545?pgno=3
http://communities.intel.com/openport/community/embedded/multicore/multicore-blog/blog/2008/10/08/cache-efficiency-the-multicore-performance-linchpin-to-packet-processing-applications
"Optimisation"
When storing the volume data in a standard 3D array manner (all voxels in a row, then all rows in a layer, then all layers in a volume), a cache hit is likely to occur when marching along a ray through the volume.
If the ray is travelling directly along the X axis, the data it is reading will be concurrent in memory. However when it moves in any other direction each sample of the data will be spread out in memory, and will not be in the cache from the previous read.
The idea for the optimisation was to store the data in a similar fashion (voxels, rows, layers), but to do this seperately for small blocks of the volume. Then when moving through the volume a cache hit is far less likely, potentially occurring only on the boundary between the blocks.
Unfortunately implementing this actually caused an increase in rendering times. Here are some render times for identical images, using various sizes of block:
1 (original storage method):
12.863 seconds
12.293
12.293
2:
16.816
16.781
16.742
4:
16.641
16.727
16.664
8:
17.180
17.156
17.137
16:
17.238
17.266
17.977
32:
17.227
17.227
17.207
I would guess that the slow-down is due to the extra computation involved in computing the memory location of a voxel. Some optimisations could be made in this area, for example if the block size was limited to powers of two I could replace divides with bit-shifts. However this would also require the size of the volume data to be a power of two (or at least a multiple of the selected block size). I may do this at some point, however for now I think I need to focus more on the parallel processing since that is the main purpose of the project.
Wednesday, 10 December 2008
Lighting
The last day or two have been spent fixing problems with the colour table, and adding basic lighting.
I'm really pleased with the results, I can now render some impressive-looking images. It's a shame that it took me a lot longer than I originally expected to get to this stage, but I have been working hard so I can't be too disappointed with slow progress.
As previously mentioned the next tasks are to optimise the ray casting and to do further research on, and implement, multithreading.
Sunday, 7 December 2008
Colour Table
This is the colour table that maps voxel value to a colour and opacity. What I plan on implementing next is a small preview of slices of the volume, so it is easy to visualise how the colours will map onto the volume without having to actually render it.
After this, the next major feature is simple lighting, which hopefully won't take too long to do, and then to optimise the ray casting technique.
Thursday, 4 December 2008
GUI Started
The groundwork for the pop-up GUI is now there. This is available by right-clicking and will contain all the controls for colour tables, rendering setup/stats, loading volume data, etc.
Tomorrow I will begin work on the colour table (mapping different volume data values to different colours/opacities).
Going by my original schedule the single-core raytracing should be finished by the end of the week, which is now highly unlikely to happen. I think this is partly due to not focusing enough time on this module in the earlier weeks, and also due to underestimating the work involved in the framework code that doesn't directly contribute to raytracing.
Once the framework is there it will be easier to complete later parts of the project, and so the time should balance itself out. Nevertheless I will make a concerted effort over the coming weeks (especially in the run up to the christmas break) to make good progress.
Wednesday, 3 December 2008
Ray Casting - First Working Demo
It's extremely simple (and slow) at the moment. The next stages will be to look at some optimisations (in method, data storage, etc) and to improve general functionality of the program. Also I will be researching into multithreading more as stated in the previous post.
Friday, 28 November 2008
Friday 28th November Meeting
The main thing I need to look at in preparation for doing parallelized code is how the hardware will work. For example, how does the memory cache work - is it a global one or is there one for each thread? Also, how will memory sharing work on an architecture with multiple CPUs (rather than simply multiple cores on the same CPU).
The cache findings will govern how the algorithms will work best. For example with the ray tracer, if the memory cache is global it would be best for each thread to be working on different pixels within the same area of the image, rather than each being assigned a completely seperate section of the image to work on.
This then adds the question of whether it would be faster to synchronise the threads in order to maintain the cache benefits, or let them work on their own without the synchronisation overhead but potentially getting out of sequence.
Something else to consider is how the volume data is stored. Storing small cubes of the volume sequentially rather than a straight linear storage of the data could prevent cache hits, as the cache will encompass the area of the volume surrounding the last sample point, rather than just a single row of data. I did store data like this for another project recently and saw a speed increase so it will be worth trying out here.
I will need to be careful with terminology used when it comes to writing the final report. So far I have not used overly consistent (or correct) terminology, for example a 'process' really describes a completely seperate application running rather than a thread of an existing application.
Thursday, 27 November 2008
Linking Parallel Processing and Volume Rendering
I wrote up the following notes today on how I would apply some parallel processing code design ideas to some volume rendering techniques:
-------------------
Splatting
Representations of each voxel are layered up in a back-to-front order on the image buffer to produce the final image.
The order in which voxels that overlap in image space are rendered is very important here. Dividing the voxels between processes based on their screen space position could be problematic, as overlapping voxels may get rendered by separate processes, and also there could be multiple processes needing to render to the same area of the image buffer.
Instead the voxels could be split up based on their depth from the viewpoint, in a Divide & Conquer fashion. Each process would render its voxels to its own individual image buffer, which would need to store both colour and opacity per pixel. These image buffers would then be composited together in back-to-front order, making use of the opacity data stored.
Each process would only write to its individual image buffer and would only need to access the set of voxels it is working on, so memory locking would not occur. The extra memory required for multiple image buffers could be a potential problem, although this is unlikely with today’s hardware unless the image size was extremely large.
The image composition step would also be something to watch out for, for large image sizes or large numbers of buffers to composite. If it proved to be an issue it would be possible to also divide this step between processes, for example one process composities the front half of the buffers while another composites the back half. Alternatively it could be kept in one process and done whilst the following image is rendered by other processes, however this would require a second set of image buffers for double-buffering.
The final point to consider here is load balancing. Each slice of the image to be processed should ideally contain the same number of voxels. In order to achieve this the shape of the data set needs to be taken into account: for example if viewing an axis-aligned cube of voxels from a direction of (1, 1, 1), if the voxels are split evenly by their depth, the majority of the voxels will be distributed in the central slices. The distribution of the voxels within the dataset could also be taken into account.
An alternate load balancing method would be to divide the data set into more slices than the number of processes, and delegate tasks in a Master & Worker style. This would greatly increase the number of image buffers to be allocated / composited however.
Shear Warp
The view matrix is sheared so that volume slices can be rendered orthogonally with optimisations such as run-length encoding.
In terms of the actual rendering this can be parallelized in a similar fashion to splatting. The volume can be split based on the depth of slices from the camera, and the images composited on top of each other at the end.
The additional steps here are the warping of the image at the end, and recalculation of run-length encoding if the opacity threshold/table is altered.
Since the warp is an operating affecting the entire image differently it could be a problem to parallelize it (I currently don’t know much about what the warp involves). A simpler option to keep all the processors working could be to do the warp alongside the rendering of the next image.
Run-length encoding could potentially be split up. Whilst calculation of it is very dependant on the surrounding voxels in a slice, presumably the RLE does not carry over from one slice to the next and so the slices can be calculated independently. Also note that the RLE needs to be calculated for all three versions of the data (one for each axis) so these can easily be parallelized also.
Marching Cubes
A mesh is created by splitting the volume into cubes and analysing the volume data inside each cube.
Generating the mesh for a cube is not dependent on any of the calculations for neighbouring cubes. Divide & Conquer can be used, the cubes can be divided between multiple processes, and the partial meshes created welded together at the end.
One of the optimisations for the standard version of the algorithm is to reuse values sampled from the volume data where the cubes meet. As long as the cubes given to each process are adjacent (i.e. a direct block from the volume, not an unordered set) this optimisation can still be done on the majority of cubes. Whilst it would be possible to communicate sampled values for cubes on the border between processes, this would far less efficient than just sampling the volume seperately per process.
Ray Casting
A ray is created for each pixel in the output image and traversed along, sampling the volume along the way.
The rays are completely independent, so parallelization should be straightforward. Each process could be given a portion of the final image buffer to work on, and once they are all complete the results combined together.
The rays will all be sampling from the same dataset. There should be little overlap of rays from different processes sampling the same data, as they naturally all pass through different parts of the volume. Even in the case of an overlap, the chances of both processes wanting to sample that data at the same time should be slim – so as far as my understanding goes I don’t think it should pose a problem.
The more crucial problem here in order to attain maximum rendering speed is load balancing. Areas of the volume that are highly opaque will enable early ray termination, which greatly reduces the number of samples required. Areas of the volume that feature a lot of low-opacity data will require the ray to traverse most/all of the way through the volume, which will be considerably slower.
The speed of rendering the image is limited by the longest time it takes for one of the processes to complete its part of the image, so all parts should take roughly the same time to process.
In order to achieve this, each process could report back the rendering time for its part of the image. The times for each part could be compared, and used on the next frame to govern how the image is divided up, for example if a section was rendered very quickly it can be expanded on the next frame, and another section reduced. This will work best when the viewpoint does not change a great deal between frames. If the viewpoint does change there will potentially be a drop in rendering time while the division of the image is reevaluated. The main advantage of dividing the image this way however is that it doesn’t require a lot of extra processing to evaluate how to divide the image, as the rendering needs to be done anyway.
-------------------
I also sketched up this very basic design (it feels overly simple really) for the parallelized ray tracing renderer, just to check if my general ideas are on the right track. I currently haven't had chance to look into the actual code-level implementation of threads, so precisely how memory is accessed or what can be shared is still unclear for me.
Parallel Processing Design Patterns
Repositories
- Completely independent operations are applied to a centralized repository of data, with no restriction on order.
- The only restriction is that data can only be read/written by one UE at any one time.
- The simplest design. Components are completely seperated other than communication with the main repository.
- Noise is introduced to the results thanks to the non-deterministic nature of having no synchronisation between the operations. Results can be hard to reproduce.
Pipe and Filter
- Used for when multiple data sets must be processed using a sequence of ordered operations.
- The operations are split onto separate UEs. Parallelization comes into play when the first set of data is finished, and moves to the second operation. The first operation can then be ran on the second set of data.
- Eventually all operations will be running on a data set.
- The main problem to watch out for is that a delay in one operation will propagate throughout the entire system. The following operation will have nothing to do, and the previous operation cannot pass on the data. This means that the load of each UE must be balanced carefully so all operations take roughly the same amount of time.
Master and Worker
- The same computation must be performed on each data set, but completely independently of each other (an example of an Embarassingly Parallel problem)
- Parallelizing is easy, simply process multiple data sets at the same time.
- The master gives tasks to the workers once they are done with the previous task. All workers are almost always busy, which implies optimal load balancing.
- If the number of workers is large, the communication overhead for the master can become large. This can be combated by increasing the size of tasks, which will lead to a more coarse granulation (more processing per communication).
Replicable
- The same computation must be performed on each data set, completely seperately of each other. The computation requires a lot of reading from a set of global data.
- Similar to Master and Worker except for the reading of global data. The solution is similar, but in addition each UE stores a copy of this global data, in order that it can access it quickly without having to wait for other UEs to stop reading the data.
- Replicating the data may require a very large amount of memory, if the data set is large and the number of workers also large.
Divide & Conquer
- The operation required can be split into smaller suboperations and the results merged into the final solution later on. Suboperations are independent of each other, only the final solution is dependent on all of the suboperations.
- Ideally each suboperation should be of equal size to maintain load balance.
- Fits naturally to many problems such as sorting or searching, especially merge sort.
Geometric Decomposition
- Computation needs to be performed on a set of data, which are semi-independent from each other. For example, the data is in an array where the calculation is dependent on the neighbouring cells.
- The data is divided between processess, each only accessing their local data. On the data which is required by other processes is sent (in the array example, the cells on the border between processor’s areas). The network diagram would be a geometric shape related to how the array was split up.
- The communication overhead can be quite large, so care must be taken that this overhead doesn’t outweigh the processing advantage gained.
Tomorrow I will look into how these patterns could relate to some of the volume rendering techniques I have learnt about.
Tuesday, 25 November 2008
Parallel Processing - General Ideas Research
I began reading about and looking into parallel processing design patterns yesterday evening and this evening. I don't have anything solid to show currently as most of this evening ended up with me and a friend discussing the problems / benefits of parallel processing, the future of it and the future of computing in general.
I've enjoyed generally getting into the ideas and learning more about it. I will aim to begin more formally putting design patterns etc together tomorrow.
Monday, 24 November 2008
Data loaded successfully
I then had a look at a different volume data source that I had noted down in my research: Pre-Integrated Volume Rendering. The third data set "CT Head" appears to be from the same scan as the "CT Head" from the Stanford Volume Data Archive. I downloaded it and was able to load it without any problems. That the page is called "pre-integrated" suggests that I was missing a step in resolving the data from Stanform into something I could use.
A screenshot of the successfully loaded CT Head.
Sunday, 23 November 2008
20th November Meeting
Writing the report was also discussed. It is ok for me to start writing the report now, in a first draft style that I can later go back to and piece together properly.
The tasks for the next week are to look into parallel processing design patterns and draw up some plans for the design of my program. Also I will continue the preliminary coding for loading and sampling a volume.
Thursday, 20 November 2008
Loading data, drawing to screen
The data is there in some form, but does not look correct. The black area is pesumably the part where some of the data was removed (see the info in the above link), but the rest is extremely noisy.
The next step will be to check and fix this problem, firstly by loading an alternate set of data to see how that turns out. I might try out the terracotta bunny (also on the above link), as the densities involved are very contrasting so errors should be clear.
This week's meeting is tomorrow. I intend to discuss the possibility of obtaining papers through the university (for example papers published through the IEEE), and also whether it is feasible to start writing parts of my final report soon, as I would like to spread out the writing as much as possible. In terms of code I don't have much to talk about, as at the moment it's just a case of working through it.
Wednesday, 19 November 2008
Beginning of Code
I was hoping to get more done tonight than I have, but unfortunately some other work got in the way. The Visual Studio project is set up and I have a blank window on screen so tomorrow I can press on with the actual program structure.
I neglected to do a diary entry after the meeting on Friday 14th. It was interesting to learn more about other people's projects and useful to talk about mine a bit. The main thing that came out of it for me was that I don't really have anything new going on in my project at the moment, for example, could a volume rendering technique be improved by doing something differently. As I implement the techniques I will bear this in mind.
Friday, 14 November 2008
Further Volume Rendering Research
Splatting
Object-order technique. Every voxel is projected onto the image plane. Each one is then rendered, in back to front order, to produce the final image.
Each voxel’s contribution to the image is based on a gaussian (or sometimes linear) distribution, based on the voxel’s colour and opacity. The size of the splat should be large enough that gaps do not appear between neighbouring voxels, and small enough to avoid excessive overlapping which leads to blurring. The size can also be altered per voxel in the case of a perspective projection.
- Fast rendering technique, but generally results in a lower-quality image
- Every voxel is taken into account in the final image, compared to ray tracing which may miss details if the rays are far apart or the step it too large.
Shear Warp
The view matrix is created using a shear, so axis-aligned voxel slices remain aligned with the image plane. To achieve perspective, the slices can also be scaled.
These slices are then efficiently rendered. Run length encoding is used to easily skip areas of transparent voxels. The RLE is calculated in a preprocess step. The preprocessing is done for the volume in the 3 principle axes, so the data does not need to be transposed at runtime.
Finally this image is warped to counteract the shearing effect. The resulting image is as if rendered from the originally desired viewpoint.
- Very fast because of the optimisations.
- Shearing / warping can reduce the image quality, especially at an angle of 45 degrees which is the worst case (requires the most extreme warp).
- There is no loss of small details due to undersampling
- More memory usage due to the need to store the volume data multiple times
- 3D Volume is only traversed once
- If the opacity threshold/table is changed, the RLE must be recalculated. Alternatively an alternate version of the algorithm can be used.
Marching Cubes
8 samples are taken around a location in the volume (equivalent to the 8 vertices in a cube). Each sample is determined to be either inside or outside the volume (the sample value is less than or greater than the threshold value).
The 8 binary values are used to lookup one of 256 (28) configurations of polygons, which describe the mesh surface within the sampled cube. Vertices that lie on the edge of the cube can be weighted between that edge’s vertices based on the two sample values. These polygons are then inserted into the mesh.
The algorithm continues across the entire volume to produce the complete mesh. The size of the sample cube can be altered to achieve a higher resolution mesh.
- If the sample size is too large, the algorithm can produce false positives (bridging a gap between two surfaces that are not connected) or false negatives (creating a gap between two connected surfaces, or missing a small surface entirely).
Marching Tetrahedrons
This is similar to the Marching Cubes algorithm. It was originally developed in order to circumvent a patent on Marching Cubes which was in effect at the time. The theory is essentially the same, but using tetrahedrons instead of cubes. This results in a higher number of sample points and therefore a more detailed mesh. This is at the cost of a longer processing time and higher memory usage.
Main sources:
Splatting:
http://www.ifi.uzh.ch/vmml/files/Seminar/Huebner_Handouts.pdf
Marching Cubes:
http://local.wasp.uwa.edu.au/~pbourke/geometry/polygonise/
I also read:
Frequency Domain Volume Rendering
Takashi Totsuka and Marc Levoy, Proc. SIGGRAPH '93, Anaheim, California, August, 1993, pp. 271-278.
http://www-graphics.stanford.edu/papers/fvr/
My understanding of the technique, involving Fourier transforms, is currently very limited however.
I have prepared some notes and a handout for the Friday presentation. I intend to give an overview of Ray Casting, Splatting, Shear Warp, and Marching Cubes. I will then state which algorithm I have chosen to implement first:
The first technique will be ray tracing, because:
- It produces high quality images so it is a useful technique.
- It is one of the slower techniques so speeding it up is desired (either for faster rendering or to render a better quality image in the same time).
- It appears to be fairly easy to implement, so I will have more time to work on the rendering program at the same time as the technique.
- It should also be quite straightforward to parallelize, which is currently desirable as I still need to learn the syntax and methods for parallel processing.
I have not currently decided which technique to implement second, I think that will be better decided in a few weeks once I have a greater understanding from implementing ray tracing. The second technique will be a more in-depth one that is less directly parallelizable, possibly Shear Warp, Marching Cubes or another technique found through further research.
Currently I don't have any problems to address in the next meeting, I am quite happy to begin work on the programming next week.
Monday, 10 November 2008
Initial Parallel Processing Research
Important things to consider in parallel processing code design are:
* Load Balancing - Spreading processing out as evenly as possible between all the available processors, to avoid any of them being idle.
* Granularity - The balance between processing data and communicating with other processors. Low granularity (little communication) is more efficient for an individual processor in terms of the amount of data processed, but is more likely to create idle time as processors wait for each other to communicate.
* Data Dependency - Identifying which data needs to be used in some way by more than one processor, and how to handle it. Data could be sent between processors when ready (distributed memory), or the processors must synchronise reading/writing operations on the same area of memory (shared memory).
The next task is to research Volume Rendering further and consider the methods in relation to Parallel Processing.
Friday, 7 November 2008
Initial Research & First Individual Meeting
Research
Began looking into volume rendering techniques:
Two main types:
Direct Volume Rendering and Surface Fitting.
- SF – A 3D surface is modelled based on the data, and then that is rendered.
- SF – doesn’t take into account that materials may be partially transparent
- SF – can produce false positives (artefacts not in the data set) and false negatives (discarding small details from the data set)
- DVR – Volume data is rendered directly into screen space.
- DVR – Works especially well for data sets with non-solid features such as liquids and gases.
- In DVR the entire data set must be used to render each image, so rendering times are usually much higher than for SF.
- If extra things need to be displayed – for example a 3D primitive indicating an area of interest – in SF it could be rendered directly along with the 3D surface, whereas in a DVR technique the primitive must be accomodated for specially.
Colour selection:
For CT scan data each data point represents the density of the material at that location. The colour and opacity of a point can be based on that density, via a colour table. An opacity of zero can be used to completely hide material of a certain density, to reduce the amount of data shown.
Ray Casting:
Cast a ray for every pixel in the output image. At evenly spaced intervals, sample the 3D data for RGBA. Build up the colour of the pixel by combining the samples.
- Settings are very simple, direct quality/speed tradeoff: number of rays per pixel / number of samples per ray.
- When the sample is between data points, the colour/opacity is linearly interpolated.
- Rays are independent of each other. Should scale very well for parallel processing.
- Can included shadowing calculations using rays from the sample point to the light source. Often not desired, for example in medical imaging.
- Optimisation methods: Space Partitioning; Adaptive Termination of ray
Splatting:
For every element of volume data, a voxel is projected onto the 2D plane. This is done in back-to-front order to build up the final image.
Other techniques: V-Buffer, Shear Warp. Contour Tracking, Marching Cubes, Opaque Cubes, Dividing Cubes, Marching Tetrahedra (Surface Fitting)
Primary information sources:
Scientific Visualisation Tutorials
Brief description of Ray Casting and Splatting
http://www.cc.gatech.edu/scivis/tutorial/linked/volrend.html
Volume Visualisation With Ray Casting
In depth description of Ray Casting, including lighting and optimisation. Brief discussion of problems with BSP Voxels and Marching Cubes. Some more volume rendering links.
http://web.cs.wpi.edu/~matt/courses/cs563/talks/powwie/p1/ray-cast.htm
Volume Visualisation and Rendering
Description of a wide variety of DVR and SF techniques, and description of some techniques for colouring volume data.
http://www.siggraph.org/education/materials/HyperVis/vistech/volume/volume.htm
Found potential sources for volume data to render:
The Stanford volume data archive – 3 data sets
http://www-graphics.stanford.edu/data/voldata/
“Sample Datasets” two thirds down the page.
http://www.sph.sc.edu/comd/rorden/render.html
A variety of volume data sets.
http://www.vis.uni-stuttgart.de/~engel/pre-integrated/data.html
Meeting
Issues identified in meeting:
- Went over project diary details, what to include, when to post.
- If I use any 3rd party libraries at the programming stage, make sure to pay attention and follow any licence and usage restrictions.
Plan for next week:
- Continue research into volume rendering techniques
- Choose which techniques to use, and why. Also think about what features need to be included in the program.
- Begin research into general parallel processing principles
- Prepare for the small 10-15 minute presentation to other FYP students next friday.