After finally sitting down and putting some thought into how to use SIMD with the Shear Warp algorithm, it doesn't seem that applicable. Unlike Ray Casting where there was additional processing going on, this algorithm basically consists of copying each voxel to it's pixel in the output image, handling the transparency value of the voxel.
The warp part of the algorithm involves a few calculations, however compared to generating the image it only adds a very small amount of time to the render, so I didn't feel it was worth spending time on.
I implemented some SIMD code to aid the calculations involved in writing a transparent voxel to the output image, but only succeeded in slowing things down.
I looked into processing 4 voxels at once. This would involve doing calculations for the red, green and blue of each pixel at the same time. This offers no benefits over processing single voxels though, as there are no common calculations happening to all four.
On Monday, I will start the report. My plan has altered slightly after the tutor meeting to include design sections for the two algorithms and the program. Also I missed out the Evaluation section.
Sunday 8 March 2009
Thursday 5 March 2009
Polishing
I spent today fixing various bugs and generally improving the program, to get it to the point where I am happy to call the coding finished. The only remaining thing is to try some SIMD on the Shear Warp algorithm over the next couple of days. I have been getting render times in the region of 0.05 seconds on some volumes, however transparency still causes slowdown so if I can speed up the general calculation with SIMD it will be helpful.
On Monday I will begin the report, and will ideally get around 4000 words written over the week, although it is hard to plan it out exactly.
I have written a list of the sections I intend to include in the report, to talk about with my tutor tomorrow.
Abstract
Introduction
Overview of Volume Rendering Techniques
Overview of Parallel Processing
Implementation of Ray Casting
- Basic
- Parallel Processing
- SIMD
Implementation of Shear Warp
- Basic
- Parallel Processing
- SIMD
Implementation of Rendering Program
Conclusion
References
Appendices
- Render time test results
On Monday I will begin the report, and will ideally get around 4000 words written over the week, although it is hard to plan it out exactly.
I have written a list of the sections I intend to include in the report, to talk about with my tutor tomorrow.
Abstract
Introduction
Overview of Volume Rendering Techniques
Overview of Parallel Processing
Implementation of Ray Casting
- Basic
- Parallel Processing
- SIMD
Implementation of Shear Warp
- Basic
- Parallel Processing
- SIMD
Implementation of Rendering Program
Conclusion
References
Appendices
- Render time test results
Tuesday 3 March 2009
RLE Fixed
My work so far tonight has seen some nice results.
Earlier I reported that enabled the RLE to work both forwards and backwards had slowed it down considerably. As it turns out I had simply broken it, which I have now fixed.
On the above image, a standard single-thread render took 1.32 seconds. With RLE enabled, it took 0.64 seconds. This is slower than previous times quoted for the skull, as this render contains a lot of non-empty voxels.
In order to improve this further I have added a setting where voxels that are surrounded by other solid voxels and therefore are not visible are considered empty, and not rendered at all. This lengthens the pre-processing of the volume, though not by an unacceptable amount. The rendering time for the above image was reduced further to 0.13 seconds.
I have not been able to test it on a multi threaded render yet.
I also added a linear interpolation sampling when warping and scaling the render to fill the screen. This helps to remove some artifacts and give a generally smoother (though blurry) appearance. Render time with this option enabled increased slightly to 0.15 seconds.
Earlier I reported that enabled the RLE to work both forwards and backwards had slowed it down considerably. As it turns out I had simply broken it, which I have now fixed.
On the above image, a standard single-thread render took 1.32 seconds. With RLE enabled, it took 0.64 seconds. This is slower than previous times quoted for the skull, as this render contains a lot of non-empty voxels.
In order to improve this further I have added a setting where voxels that are surrounded by other solid voxels and therefore are not visible are considered empty, and not rendered at all. This lengthens the pre-processing of the volume, though not by an unacceptable amount. The rendering time for the above image was reduced further to 0.13 seconds.
I have not been able to test it on a multi threaded render yet.
I also added a linear interpolation sampling when warping and scaling the render to fill the screen. This helps to remove some artifacts and give a generally smoother (though blurry) appearance. Render time with this option enabled increased slightly to 0.15 seconds.
Saturday 28 February 2009
Shear Warp Optimising
I tried out an optimisation I suggested in my last post, storing all the threads image buffers interlaced in memory rather than in seperate blocks. So the memory for two threads would look like this:
[thread 1 (0,0)] [thread 2 (0,0)] [thread 1 (1,0)] [thread 2 (1,0)] [thread 1 (2,0)] etc
In an test image where most voxels were empty, rendering time on 8 threads was reduced from 0.141 to 0.123. On a second test with most voxels being drawn, render time increased from 0.234 to 0.281. I believe this is due to the threads writing to the same cache line, as discussed here under "false sharing": http://www.ddj.com/embedded/212400410
On an image with more empty voxels false sharing has less effect, allowing the advantages of sharing cache lines for reading purposes to speed the rendering up.
I have also fixed multithreaded run length encoding, which on 8 threads reduced test skull render from 0.125 to 0.109. I will do more detailed test results in the future.
[thread 1 (0,0)] [thread 2 (0,0)] [thread 1 (1,0)] [thread 2 (1,0)] [thread 1 (2,0)] etc
In an test image where most voxels were empty, rendering time on 8 threads was reduced from 0.141 to 0.123. On a second test with most voxels being drawn, render time increased from 0.234 to 0.281. I believe this is due to the threads writing to the same cache line, as discussed here under "false sharing": http://www.ddj.com/embedded/212400410
On an image with more empty voxels false sharing has less effect, allowing the advantages of sharing cache lines for reading purposes to speed the rendering up.
I have also fixed multithreaded run length encoding, which on 8 threads reduced test skull render from 0.125 to 0.109. I will do more detailed test results in the future.
Thursday 26 February 2009
Multithreaded Shear Warp - Partially Complete
Earlier this week I refactored the shear warp code to make it more manageable. Rendering for each primary axis was done with seperate code, now all three situations are handled with the same code. There is no measurable performance loss for this.
I then started implementing the multithreaded version of the shear warp algorithm. Each thread renders concurrent slices of the volume to its own image buffer. This image shows the voxels rendered by one thread (with 8 threads running):
These are then composited together at the end. This required altering the rendering slightly in order to also write transparency to the image buffer, otherwise compositing would not be possible.
In a couple of quick tests, the render time of normal non-run-length encoded shear warp was reduced:
0.484 seconds -> 0.125 seconds
0.672 seconds -> 0.152 seconds
The main optimisation I wish to attempt is along similar lines to the 'alternate pixels' method of multithreading the raycasting. Instead of allocating each thread's image buffer seperately, I will do it as one block with alternate pixels belonging to each thread. This should mean the threads are working in roughly the same area of memory at any one time. Also when compositing the seperate images back together, the pixels in the same position on each buffer would all be concurrent in memory, which is ideal.
The work could be split between threads differently, for example alternating slices, rows or voxels. However doing this would mean implementing some measures to ensure that voxels are written in the correct order, and potentially making threads wait for others to complete, which would involve something along the lines of a depth buffer. This is likely to ruin any possible performance gain so for now I will concentrate on other more important tasks.
I could also potentially use threading when compositing the seperate image buffers into one.
What also needs looking at is that run-length encoded rendering breaks when in multithreaded mode:
Currently the RLE volume data object handles which voxel is to be read next, so with multiple threads sampling from it, it breaks. This should be simple to do once I get round to it.
I'm happy with the direction this is going in and the possible things I can do to improve it, drawing from my experience with multithreading the ray casting. Tomorrow's meeting will be spent discussing these results. Over the next week I aim to make a lot of progress, implementing these optimisations and possibly trying out some SIMD optimisations, though I haven't put much thought into where that could be used yet.
I then started implementing the multithreaded version of the shear warp algorithm. Each thread renders concurrent slices of the volume to its own image buffer. This image shows the voxels rendered by one thread (with 8 threads running):
These are then composited together at the end. This required altering the rendering slightly in order to also write transparency to the image buffer, otherwise compositing would not be possible.
In a couple of quick tests, the render time of normal non-run-length encoded shear warp was reduced:
0.484 seconds -> 0.125 seconds
0.672 seconds -> 0.152 seconds
The main optimisation I wish to attempt is along similar lines to the 'alternate pixels' method of multithreading the raycasting. Instead of allocating each thread's image buffer seperately, I will do it as one block with alternate pixels belonging to each thread. This should mean the threads are working in roughly the same area of memory at any one time. Also when compositing the seperate images back together, the pixels in the same position on each buffer would all be concurrent in memory, which is ideal.
The work could be split between threads differently, for example alternating slices, rows or voxels. However doing this would mean implementing some measures to ensure that voxels are written in the correct order, and potentially making threads wait for others to complete, which would involve something along the lines of a depth buffer. This is likely to ruin any possible performance gain so for now I will concentrate on other more important tasks.
I could also potentially use threading when compositing the seperate image buffers into one.
What also needs looking at is that run-length encoded rendering breaks when in multithreaded mode:
Currently the RLE volume data object handles which voxel is to be read next, so with multiple threads sampling from it, it breaks. This should be simple to do once I get round to it.
I'm happy with the direction this is going in and the possible things I can do to improve it, drawing from my experience with multithreading the ray casting. Tomorrow's meeting will be spent discussing these results. Over the next week I aim to make a lot of progress, implementing these optimisations and possibly trying out some SIMD optimisations, though I haven't put much thought into where that could be used yet.
Saturday 21 February 2009
Run Length Encoding
I have finished implementing the run length encoding optimisation. For the skull, render times have been reduced from around 2 seconds to 1.3 seconds. The run length encoding gives the best improvement on images such as the skull where there is a lot of empty space that can be skipped. When most of the volume is only partially transparent there is less improvement.
Since I now have a decent interactive framerate even on my single core PC I will probably look at tweaking the image warp to get it correct before moving onto multithreading.
Since I now have a decent interactive framerate even on my single core PC I will probably look at tweaking the image warp to get it correct before moving onto multithreading.
Thursday 19 February 2009
Basic Shear Warp - First Pass
I have done a first pass on the warp part of the shear warp algorithm, to skew the image back to its proper perspective. It works for the most part, but is still off at times especially at extreme angles.
For now I am happy enough with it while I work on run-length encoding and multithreading. Once the algorithm is running much faster (potentially at mostly interactive frame rates) I'll be able to see what's going on better and fix the warp then.
There are also obvious quality issues that I'll look at then too (mainly involving using a bilinear filter during the warp rather than reading the nearest pixel).
The speed of the algorithm is promising, currently a render such as the one above takes 2.3 seconds compared to 17.3 seconds for a raycasting version (single threaded).
I have begun to lay down the groundwork for RLE. I was planning on having it finished for this week's tutor meeting, which is tomorrow, but time has run out. The new plan is to have it done by the weekend. My plan for the meeting is just the usual checkup on progress.
For now I am happy enough with it while I work on run-length encoding and multithreading. Once the algorithm is running much faster (potentially at mostly interactive frame rates) I'll be able to see what's going on better and fix the warp then.
There are also obvious quality issues that I'll look at then too (mainly involving using a bilinear filter during the warp rather than reading the nearest pixel).
The speed of the algorithm is promising, currently a render such as the one above takes 2.3 seconds compared to 17.3 seconds for a raycasting version (single threaded).
I have begun to lay down the groundwork for RLE. I was planning on having it finished for this week's tutor meeting, which is tomorrow, but time has run out. The new plan is to have it done by the weekend. My plan for the meeting is just the usual checkup on progress.
Subscribe to:
Posts (Atom)