Sunday, 8 March 2009

Shear Warp + SIMD

After finally sitting down and putting some thought into how to use SIMD with the Shear Warp algorithm, it doesn't seem that applicable. Unlike Ray Casting where there was additional processing going on, this algorithm basically consists of copying each voxel to it's pixel in the output image, handling the transparency value of the voxel.

The warp part of the algorithm involves a few calculations, however compared to generating the image it only adds a very small amount of time to the render, so I didn't feel it was worth spending time on.

I implemented some SIMD code to aid the calculations involved in writing a transparent voxel to the output image, but only succeeded in slowing things down.

I looked into processing 4 voxels at once. This would involve doing calculations for the red, green and blue of each pixel at the same time. This offers no benefits over processing single voxels though, as there are no common calculations happening to all four.

On Monday, I will start the report. My plan has altered slightly after the tutor meeting to include design sections for the two algorithms and the program. Also I missed out the Evaluation section.

Thursday, 5 March 2009

Polishing

I spent today fixing various bugs and generally improving the program, to get it to the point where I am happy to call the coding finished. The only remaining thing is to try some SIMD on the Shear Warp algorithm over the next couple of days. I have been getting render times in the region of 0.05 seconds on some volumes, however transparency still causes slowdown so if I can speed up the general calculation with SIMD it will be helpful.

On Monday I will begin the report, and will ideally get around 4000 words written over the week, although it is hard to plan it out exactly.

I have written a list of the sections I intend to include in the report, to talk about with my tutor tomorrow.

Abstract
Introduction
Overview of Volume Rendering Techniques
Overview of Parallel Processing
Implementation of Ray Casting
- Basic
- Parallel Processing
- SIMD
Implementation of Shear Warp
- Basic
- Parallel Processing
- SIMD
Implementation of Rendering Program
Conclusion
References
Appendices
- Render time test results

Tuesday, 3 March 2009

RLE Fixed

My work so far tonight has seen some nice results.

Earlier I reported that enabled the RLE to work both forwards and backwards had slowed it down considerably. As it turns out I had simply broken it, which I have now fixed.



On the above image, a standard single-thread render took 1.32 seconds. With RLE enabled, it took 0.64 seconds. This is slower than previous times quoted for the skull, as this render contains a lot of non-empty voxels.

In order to improve this further I have added a setting where voxels that are surrounded by other solid voxels and therefore are not visible are considered empty, and not rendered at all. This lengthens the pre-processing of the volume, though not by an unacceptable amount. The rendering time for the above image was reduced further to 0.13 seconds.

I have not been able to test it on a multi threaded render yet.

I also added a linear interpolation sampling when warping and scaling the render to fill the screen. This helps to remove some artifacts and give a generally smoother (though blurry) appearance. Render time with this option enabled increased slightly to 0.15 seconds.

Saturday, 28 February 2009

Shear Warp Optimising

I tried out an optimisation I suggested in my last post, storing all the threads image buffers interlaced in memory rather than in seperate blocks. So the memory for two threads would look like this:

[thread 1 (0,0)] [thread 2 (0,0)] [thread 1 (1,0)] [thread 2 (1,0)] [thread 1 (2,0)] etc

In an test image where most voxels were empty, rendering time on 8 threads was reduced from 0.141 to 0.123. On a second test with most voxels being drawn, render time increased from 0.234 to 0.281. I believe this is due to the threads writing to the same cache line, as discussed here under "false sharing": http://www.ddj.com/embedded/212400410

On an image with more empty voxels false sharing has less effect, allowing the advantages of sharing cache lines for reading purposes to speed the rendering up.

I have also fixed multithreaded run length encoding, which on 8 threads reduced test skull render from 0.125 to 0.109. I will do more detailed test results in the future.

Thursday, 26 February 2009

Multithreaded Shear Warp - Partially Complete

Earlier this week I refactored the shear warp code to make it more manageable. Rendering for each primary axis was done with seperate code, now all three situations are handled with the same code. There is no measurable performance loss for this.

I then started implementing the multithreaded version of the shear warp algorithm. Each thread renders concurrent slices of the volume to its own image buffer. This image shows the voxels rendered by one thread (with 8 threads running):



These are then composited together at the end. This required altering the rendering slightly in order to also write transparency to the image buffer, otherwise compositing would not be possible.

In a couple of quick tests, the render time of normal non-run-length encoded shear warp was reduced:
0.484 seconds -> 0.125 seconds
0.672 seconds -> 0.152 seconds

The main optimisation I wish to attempt is along similar lines to the 'alternate pixels' method of multithreading the raycasting. Instead of allocating each thread's image buffer seperately, I will do it as one block with alternate pixels belonging to each thread. This should mean the threads are working in roughly the same area of memory at any one time. Also when compositing the seperate images back together, the pixels in the same position on each buffer would all be concurrent in memory, which is ideal.

The work could be split between threads differently, for example alternating slices, rows or voxels. However doing this would mean implementing some measures to ensure that voxels are written in the correct order, and potentially making threads wait for others to complete, which would involve something along the lines of a depth buffer. This is likely to ruin any possible performance gain so for now I will concentrate on other more important tasks.

I could also potentially use threading when compositing the seperate image buffers into one.

What also needs looking at is that run-length encoded rendering breaks when in multithreaded mode:



Currently the RLE volume data object handles which voxel is to be read next, so with multiple threads sampling from it, it breaks. This should be simple to do once I get round to it.

I'm happy with the direction this is going in and the possible things I can do to improve it, drawing from my experience with multithreading the ray casting. Tomorrow's meeting will be spent discussing these results. Over the next week I aim to make a lot of progress, implementing these optimisations and possibly trying out some SIMD optimisations, though I haven't put much thought into where that could be used yet.

Saturday, 21 February 2009

Run Length Encoding

I have finished implementing the run length encoding optimisation. For the skull, render times have been reduced from around 2 seconds to 1.3 seconds. The run length encoding gives the best improvement on images such as the skull where there is a lot of empty space that can be skipped. When most of the volume is only partially transparent there is less improvement.

Since I now have a decent interactive framerate even on my single core PC I will probably look at tweaking the image warp to get it correct before moving onto multithreading.

Thursday, 19 February 2009

Basic Shear Warp - First Pass

I have done a first pass on the warp part of the shear warp algorithm, to skew the image back to its proper perspective. It works for the most part, but is still off at times especially at extreme angles.



For now I am happy enough with it while I work on run-length encoding and multithreading. Once the algorithm is running much faster (potentially at mostly interactive frame rates) I'll be able to see what's going on better and fix the warp then.

There are also obvious quality issues that I'll look at then too (mainly involving using a bilinear filter during the warp rather than reading the nearest pixel).

The speed of the algorithm is promising, currently a render such as the one above takes 2.3 seconds compared to 17.3 seconds for a raycasting version (single threaded).

I have begun to lay down the groundwork for RLE. I was planning on having it finished for this week's tutor meeting, which is tomorrow, but time has run out. The new plan is to have it done by the weekend. My plan for the meeting is just the usual checkup on progress.

Thursday, 12 February 2009

More Shear Warp Progress

The algorithm now works for all three axes in both positive and negative directions.







The warp is not implemented yet, as I haven't completely got my head around how it should work yet. I tried some transformations out in a graphics program, and achieved correct-looking results using only 2D shears and scales, so it should be simple enough to work out how to apply that in code. These are the mockup versions:





I do not really have an agenda for the meeting tomorrow other than the usual discussion of results. Even in situations like this when I have no specific questions the meetings do always prove to be useful, and the weekly deadlines help to motivate me.

Wednesday, 11 February 2009

Shear Warp Progress

After a bit more work the first part of the algorithm is on screen. The volume is rendered to a temporary image buffer, where every voxel maps to a single pixel. This buffer is currently just copied to the screen, eventually this step will include the warping part of the algorithm.



----------------------

Without any major problems the shear is now implemented. I also put lighting back in, so the shear can be seen much more easily.

The lit skull, looking directly along the X axis so there is no shear.



Rotated to the left and right, the shear can be seen working.





This is my maths scribblings for calculating how big the shear should be. The result was that the distance of the shear (r) is

width * (depth / sin(angle)) / (width / cos(angle))

Width and depth here are terms for figuring out the calculation here. In the actual code they could be the width/depth/height of the volume, depending which axis is being looked down and which shear is being calculated (horizontal or vertical in screen-space).



The next step is to implement the warp to display the volume properly, and also to correctly handle viewing the volume down both the positive and negative X axis. Then the Y and Z axes need to be implemented, which once X is done should be simple.

Friday, 6 February 2009

Shear Warp Algorithm Started

My time this week has once again been biased towards the team project, which I will try harder to avoid next week.

Nevertheless I have made some progress with implementing the shear warp algorithm, the volume data is copied into three axis-aligned formats, and the rendering has been started. Nothing on screen as of yet though.

This week there is no meeting as I am away for the weekend. For next week's meeting I hope to have plenty to show of the new technique.

Thursday, 29 January 2009

SIMD rendering implemented

Over the last week I have implemented a couple of SIMD versions of the ray casting as suggested by my tutor. In one method SIMD is used to process the vector and colour calculations of a single ray/pixel (for example simultaneous x, y, z operations). In the second, it is used to process the calculations for four pixels at once (simulations x, x, x, x operations, etc).

I got them working without too much hassle. The quad pixel method contained an unforeseen problem, that the early ray termination optimisation can occur for the rays at different times. Rather than introduce overhead for checking which rays have finished, I carried all the rays on until all could be stopped. This revealed a bug with my colour calculations, where the opacity went above 1.0, but this was easily remedied.

I then did some thorough testing of the variable settings currently available in the renderer (previous tests were only quick ones, these are a lot more reliable). All tests are a render of a solid lit skull from diagonal viewpoint.

First: number of worker threads

-thread count-

1
9.32813
9.34375
9.32813

2
4.97656
4.96094
4.99219

3
3.29297
3.32031
3.33594

4
2.52734
2.52734
2.54297

5
1.99609
2.02734
2.04688

6
1.69922
1.70313
1.69922

7
1.46875
1.46875
1.45313

8
1.29297
1.27734
1.28125

9
1.58984
1.58984
1.57422

10
1.49609
1.48438
1.52734


As seen before 8 threads is the optimal number on an 8 core PC.

Next: SIMD usage in the renderer, and how the image pixels are divided between threads.

-simd and delegation type, with 2 threads-

concurrent pixels, no simd
4.99219
5.03906
5.11719

concurrent pixels, single simd
4.88281
5.08594
4.88281

concurrent pixels, quad simd
4.36719
4.33984
4.36719

alternate pixels, no simd
4.92969
4.78906
4.74219

alternate pixels, single simd
4.61719
4.64844
4.64844

alternate pixels (alternate groups of 4), quad simd
4.11719
4.14844
4.12109


Again as I have seen before, alternate pixels is generally faster than giving each thread a whole block to itself. Single SIMD gives a small decrease in time over normal rendering, whilst quad SIMD shows a more significant speed up. It is possible (and probably likely) that my SIMD code could be faster, however for now at least I consider it a success and will move on.

Next: filter type when sampling from the volume data.

-sample filter type, with 2 threads-

point sample
5.03906
5.00781
4.99219

trilinear interpolation
8.12891
8.16016
8.09766

simd trilinear interpolation
8.25391
8.22266
8.25


Trilinear is much slower but this is acceptable for the vast increase in visual quality. My use of SIMD calculations in the trilinear interpolation makes things slower, but this is not really a surprise as this type of calculation is not suited to SIMD.

Next a test of what should be the fastest way my program can render the image:

-8 threads, alternate pixels, quad simd, trilinear interpolation-

7.10156
7.06641
7.09766


Point sampling would be faster but the image quality would not be useful for a medical purpose. I'm quite happy with just over 7 seconds for a nice looking render of a complex volume.

Finally I played around with low quality rendering possibilities. By degrading image quality significantly with point sampling and a large ray step size, times in the range of 0.6-0.25 seconds can be achieved on the simpler volumes (such as the inner ear and the maths function). I consider these to be just about interactive framerates. The image quality is low, but still gives a good impression of the volume with its lighting and colouring present. By downsampling the volume to a smaller size (and eighth or lower, depending on the size of the original volume) it could be rendered very quickly as a "preview" of the render. I am not really intending to cover the area of low quality rendering, but it is interesting as a side note.

As with last week I do not have anything specific to ask in the meeting, simply a discussion of my recent progress.

Thursday, 22 January 2009

More SIMD

Firstly I did a quick test to check the speed increase of SIMD on when adding vectors of 4 floats. At first SIMD was actually taking longer, but that turned out to be a side effect of debug mode. Results were better in release mode:

to add 100,000,000 vectors:
0.797 seconds (normal)
0.328 seconds (SIMD)


SIMD was about 2.4 times faster.

Next I used SIMD to aid with the calculation of the trilinear interpolation of the volume data sample, as it involves quite a few operations to blend the 8 values together. In a test, a render of 44.3 seconds was reduced to 39.8 seconds. Not a massive reduction but certainly better than nothing.

This shows that SIMD speedups are certainly possible, and that my methods for using it are correct. After tomorrow's meeting I will see how well this works on the multi-core PCs in uni, and then make use of it in the actual rendering code, the aim being to have looked at it to a decent extent before next week.

In terms of the actual meeting I don't have a lot to discuss this week. I'm pretty happy with my understanding of SIMD and the results achieved so far so it's just a case of continuing as I am.

SIMD Initial Implementation

In the previous meeting my tutor pointed out that SIMD is in fact just a form of parallel processing and is perfectly relevant to the project. My main aim for this week is to look at SIMD, and hopefully get some of it running to achieve a speed increase. Then next week I will make what feels like a long overdue start on the second volume rendering technique.

SIMD comes in different forms on various architectures but the most common one seems to be SSE (Streaming SIMD Extensions), originally introduced by Intel and now supported on (most/all?) AMD chips.

The first tutorial I found was this one:
http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html

After spending a while trying to get the assembly to not crash I continued searching for other tutorials, and found this one:
http://www.codeproject.com/KB/recipes/sseintro.aspx

This one resulted in success. I added two vectors together in a small testbed program. Not having to use assembly (which I am in no way familiar with) is a nice bonus and helps me feel more positive about it. Tomorrow I will begin integrating some SIMD in the actual volume renderer.

To finish this post, a picture of the recently improved lighting calculations. It now uses two light sources from opposite directions (based on the same dot product), rather than one, and has higher contrast. The lighting is all precalculated before rendering so there is no extra cost to the rendering time.

Friday, 16 January 2009

Week's update

I'm disappointed in the amount of work achieved on the project this week. I mainly focused on the team project, and then have been busy sorting various things out. Next week I will probably limit the team project to just wednesday and try to do more on this project.

I tested non-linear (block) storage of volume data, which I hadn't done yet:

With 6 threads:
Block size............Time
Linear................3.837
2.......................4.587
4.......................5.132
8.......................4.164
16......................4.257
32......................4.914


Blocks are still slower than linear. I optimised the code for sampling from the volume data but only very slightly. I will look more closely at it soon, as in theory block storage should be faster, although I don't want to spend too much time on something that isn't directly related to my project specification.

I checked the Windows Task Manager as was suggested in last week's meeting to confirm that all 8 CPU cores were being maxed out in multithreaded rendering, and they indeed are.

I switched the parallel raycaster to call thread->join() on all of the worker threads, instead of manually polling whether all the threads have finished yet.

I upped the maximum number of threads that can be set in the program to 16, to try more threads, also suggested in last week's meeting. Time A is the threads rendering blocks, time B is the threads rendering alternate pixels.

Threads.....Time A....Time B
1...........7.85......N/A
2...........4.17......4.53
3...........3.05......3.04
4...........2.40......2.01
5...........1.91......1.64
6...........1.64......1.35
7...........1.39......1.16
8...........1.23......1.07
9...........1.39......1.36
10..........1.36......1.26
11..........1.21......1.15
12..........1.15......1.09
13..........1.20......1.16
14..........1.21......1.09
15..........1.28......1.13
16..........1.19......1.12

The times for running more threads than there are CPU cores are not as slow as I expected. Here there are only 2 threads per core so I guess the overhead of splitting time between them is small. I did a quick test of 160 threads and as expected it was very slow.

The times seem generally fastest around the 8 thread mark, compared to 6 threads in a previous test. This could be down to the improved method of waiting for threads to finish. It is hard to draw any real conclusions from this test however as it is only a single test and fluctuations have not been averaged out.

I had a look into SIMD processing. I'm not sure how much performance there is to gain from it, as I imagine the vast majority of the time in this program is down to reading the volume data rather than the processing involved in marching along the rays. I will discuss the idea further in the meeting tomorrow, as mentioned earlier I don't want to spend a lot of time on something not related to my specification when time is starting to look a bit on the short side.

Finally I also had a look into memory alignment to do with the cache, also suggested in the previous meeting, but I'm not sure where it could be applied to best effect and also want to discuss this further in tomorrow's meeting.

Thursday, 8 January 2009

Multithreading Test

Today I went into the university labs to test the multithreaded version of the renderer on an 8-core machine.

Short parallel ray casting test: solid skull with lighting. Volume samples are interpolated. Blocks/Alternate Pixels refer to how the image is divided up between threads. In Blocks mode each thread gets a share of consecutive pixels, in Alternate Pixels each consecutive pixel is handled by a different thread.

Threads........Blocks...........Alternate Pixels
1................15.60................-
2................8.51................7.97
3................5.91................5.35
4................6.55................4.04
5................5.24................3.27
6................3.17................2.72
7................3.49................3.40
8................3.71................3.07


This was a quick test with only one render per result so anomalies are possible.

In each case 6 threads was the optimal amount. This makes sense: the thread count is the number of worker threads so additionally there is the master thread waiting for all the worker threads to finish. On top of this there will almost certainly be a small amount of processing going on in other applications or the OS, which could account for the 8th thread.

These results show that a program can actually run on all 8 cores at once, seemingly sharing a single cache, at least on the specific setup in these computers. This is something I was unsure about, and will be worth discussing in the meeting tomorrow.

An optimisation that needs looking at is to partly remove the idea of a master thread, and give that thread a share of the work to do also. Once it has finished its share it can then start checking if the other threads have finished.

Looking at the best result, 2.72 seconds compared to the single thread time of 15.60 seconds, I have currently achieved a speedup of 5.73 times. This is using 6 worker threads so the predicted near-linear speedup of the raycasting method is looking to be correct.

Comparing results between Blocks and Alternate Pixels, Alternate Pixels is always faster. This shows that the cache is being shared between all the cores rather than individual, and since they should all be working in the same area of the image as they progress through it, cache hits are low. Individual caches would benefit the Blocks method as each thread is working on its own consecutive pixels.

So the plan for the meeting tomorrow is to generally discuss these results, and to come up with a plan for what to do next.

(A note on multithreading that I meant to add to an earlier post: in terms of coding, the multithreading turned out to be much easier than I was expecting. Once I twigged that a mutex isn't actually related to the memory you're using it to lock, the coding was quick and simple.)