Thursday 29 January 2009

SIMD rendering implemented

Over the last week I have implemented a couple of SIMD versions of the ray casting as suggested by my tutor. In one method SIMD is used to process the vector and colour calculations of a single ray/pixel (for example simultaneous x, y, z operations). In the second, it is used to process the calculations for four pixels at once (simulations x, x, x, x operations, etc).

I got them working without too much hassle. The quad pixel method contained an unforeseen problem, that the early ray termination optimisation can occur for the rays at different times. Rather than introduce overhead for checking which rays have finished, I carried all the rays on until all could be stopped. This revealed a bug with my colour calculations, where the opacity went above 1.0, but this was easily remedied.

I then did some thorough testing of the variable settings currently available in the renderer (previous tests were only quick ones, these are a lot more reliable). All tests are a render of a solid lit skull from diagonal viewpoint.

First: number of worker threads

-thread count-

1
9.32813
9.34375
9.32813

2
4.97656
4.96094
4.99219

3
3.29297
3.32031
3.33594

4
2.52734
2.52734
2.54297

5
1.99609
2.02734
2.04688

6
1.69922
1.70313
1.69922

7
1.46875
1.46875
1.45313

8
1.29297
1.27734
1.28125

9
1.58984
1.58984
1.57422

10
1.49609
1.48438
1.52734


As seen before 8 threads is the optimal number on an 8 core PC.

Next: SIMD usage in the renderer, and how the image pixels are divided between threads.

-simd and delegation type, with 2 threads-

concurrent pixels, no simd
4.99219
5.03906
5.11719

concurrent pixels, single simd
4.88281
5.08594
4.88281

concurrent pixels, quad simd
4.36719
4.33984
4.36719

alternate pixels, no simd
4.92969
4.78906
4.74219

alternate pixels, single simd
4.61719
4.64844
4.64844

alternate pixels (alternate groups of 4), quad simd
4.11719
4.14844
4.12109


Again as I have seen before, alternate pixels is generally faster than giving each thread a whole block to itself. Single SIMD gives a small decrease in time over normal rendering, whilst quad SIMD shows a more significant speed up. It is possible (and probably likely) that my SIMD code could be faster, however for now at least I consider it a success and will move on.

Next: filter type when sampling from the volume data.

-sample filter type, with 2 threads-

point sample
5.03906
5.00781
4.99219

trilinear interpolation
8.12891
8.16016
8.09766

simd trilinear interpolation
8.25391
8.22266
8.25


Trilinear is much slower but this is acceptable for the vast increase in visual quality. My use of SIMD calculations in the trilinear interpolation makes things slower, but this is not really a surprise as this type of calculation is not suited to SIMD.

Next a test of what should be the fastest way my program can render the image:

-8 threads, alternate pixels, quad simd, trilinear interpolation-

7.10156
7.06641
7.09766


Point sampling would be faster but the image quality would not be useful for a medical purpose. I'm quite happy with just over 7 seconds for a nice looking render of a complex volume.

Finally I played around with low quality rendering possibilities. By degrading image quality significantly with point sampling and a large ray step size, times in the range of 0.6-0.25 seconds can be achieved on the simpler volumes (such as the inner ear and the maths function). I consider these to be just about interactive framerates. The image quality is low, but still gives a good impression of the volume with its lighting and colouring present. By downsampling the volume to a smaller size (and eighth or lower, depending on the size of the original volume) it could be rendered very quickly as a "preview" of the render. I am not really intending to cover the area of low quality rendering, but it is interesting as a side note.

As with last week I do not have anything specific to ask in the meeting, simply a discussion of my recent progress.

Thursday 22 January 2009

More SIMD

Firstly I did a quick test to check the speed increase of SIMD on when adding vectors of 4 floats. At first SIMD was actually taking longer, but that turned out to be a side effect of debug mode. Results were better in release mode:

to add 100,000,000 vectors:
0.797 seconds (normal)
0.328 seconds (SIMD)


SIMD was about 2.4 times faster.

Next I used SIMD to aid with the calculation of the trilinear interpolation of the volume data sample, as it involves quite a few operations to blend the 8 values together. In a test, a render of 44.3 seconds was reduced to 39.8 seconds. Not a massive reduction but certainly better than nothing.

This shows that SIMD speedups are certainly possible, and that my methods for using it are correct. After tomorrow's meeting I will see how well this works on the multi-core PCs in uni, and then make use of it in the actual rendering code, the aim being to have looked at it to a decent extent before next week.

In terms of the actual meeting I don't have a lot to discuss this week. I'm pretty happy with my understanding of SIMD and the results achieved so far so it's just a case of continuing as I am.

SIMD Initial Implementation

In the previous meeting my tutor pointed out that SIMD is in fact just a form of parallel processing and is perfectly relevant to the project. My main aim for this week is to look at SIMD, and hopefully get some of it running to achieve a speed increase. Then next week I will make what feels like a long overdue start on the second volume rendering technique.

SIMD comes in different forms on various architectures but the most common one seems to be SSE (Streaming SIMD Extensions), originally introduced by Intel and now supported on (most/all?) AMD chips.

The first tutorial I found was this one:
http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html

After spending a while trying to get the assembly to not crash I continued searching for other tutorials, and found this one:
http://www.codeproject.com/KB/recipes/sseintro.aspx

This one resulted in success. I added two vectors together in a small testbed program. Not having to use assembly (which I am in no way familiar with) is a nice bonus and helps me feel more positive about it. Tomorrow I will begin integrating some SIMD in the actual volume renderer.

To finish this post, a picture of the recently improved lighting calculations. It now uses two light sources from opposite directions (based on the same dot product), rather than one, and has higher contrast. The lighting is all precalculated before rendering so there is no extra cost to the rendering time.

Friday 16 January 2009

Week's update

I'm disappointed in the amount of work achieved on the project this week. I mainly focused on the team project, and then have been busy sorting various things out. Next week I will probably limit the team project to just wednesday and try to do more on this project.

I tested non-linear (block) storage of volume data, which I hadn't done yet:

With 6 threads:
Block size............Time
Linear................3.837
2.......................4.587
4.......................5.132
8.......................4.164
16......................4.257
32......................4.914


Blocks are still slower than linear. I optimised the code for sampling from the volume data but only very slightly. I will look more closely at it soon, as in theory block storage should be faster, although I don't want to spend too much time on something that isn't directly related to my project specification.

I checked the Windows Task Manager as was suggested in last week's meeting to confirm that all 8 CPU cores were being maxed out in multithreaded rendering, and they indeed are.

I switched the parallel raycaster to call thread->join() on all of the worker threads, instead of manually polling whether all the threads have finished yet.

I upped the maximum number of threads that can be set in the program to 16, to try more threads, also suggested in last week's meeting. Time A is the threads rendering blocks, time B is the threads rendering alternate pixels.

Threads.....Time A....Time B
1...........7.85......N/A
2...........4.17......4.53
3...........3.05......3.04
4...........2.40......2.01
5...........1.91......1.64
6...........1.64......1.35
7...........1.39......1.16
8...........1.23......1.07
9...........1.39......1.36
10..........1.36......1.26
11..........1.21......1.15
12..........1.15......1.09
13..........1.20......1.16
14..........1.21......1.09
15..........1.28......1.13
16..........1.19......1.12

The times for running more threads than there are CPU cores are not as slow as I expected. Here there are only 2 threads per core so I guess the overhead of splitting time between them is small. I did a quick test of 160 threads and as expected it was very slow.

The times seem generally fastest around the 8 thread mark, compared to 6 threads in a previous test. This could be down to the improved method of waiting for threads to finish. It is hard to draw any real conclusions from this test however as it is only a single test and fluctuations have not been averaged out.

I had a look into SIMD processing. I'm not sure how much performance there is to gain from it, as I imagine the vast majority of the time in this program is down to reading the volume data rather than the processing involved in marching along the rays. I will discuss the idea further in the meeting tomorrow, as mentioned earlier I don't want to spend a lot of time on something not related to my specification when time is starting to look a bit on the short side.

Finally I also had a look into memory alignment to do with the cache, also suggested in the previous meeting, but I'm not sure where it could be applied to best effect and also want to discuss this further in tomorrow's meeting.

Thursday 8 January 2009

Multithreading Test

Today I went into the university labs to test the multithreaded version of the renderer on an 8-core machine.

Short parallel ray casting test: solid skull with lighting. Volume samples are interpolated. Blocks/Alternate Pixels refer to how the image is divided up between threads. In Blocks mode each thread gets a share of consecutive pixels, in Alternate Pixels each consecutive pixel is handled by a different thread.

Threads........Blocks...........Alternate Pixels
1................15.60................-
2................8.51................7.97
3................5.91................5.35
4................6.55................4.04
5................5.24................3.27
6................3.17................2.72
7................3.49................3.40
8................3.71................3.07


This was a quick test with only one render per result so anomalies are possible.

In each case 6 threads was the optimal amount. This makes sense: the thread count is the number of worker threads so additionally there is the master thread waiting for all the worker threads to finish. On top of this there will almost certainly be a small amount of processing going on in other applications or the OS, which could account for the 8th thread.

These results show that a program can actually run on all 8 cores at once, seemingly sharing a single cache, at least on the specific setup in these computers. This is something I was unsure about, and will be worth discussing in the meeting tomorrow.

An optimisation that needs looking at is to partly remove the idea of a master thread, and give that thread a share of the work to do also. Once it has finished its share it can then start checking if the other threads have finished.

Looking at the best result, 2.72 seconds compared to the single thread time of 15.60 seconds, I have currently achieved a speedup of 5.73 times. This is using 6 worker threads so the predicted near-linear speedup of the raycasting method is looking to be correct.

Comparing results between Blocks and Alternate Pixels, Alternate Pixels is always faster. This shows that the cache is being shared between all the cores rather than individual, and since they should all be working in the same area of the image as they progress through it, cache hits are low. Individual caches would benefit the Blocks method as each thread is working on its own consecutive pixels.

So the plan for the meeting tomorrow is to generally discuss these results, and to come up with a plan for what to do next.

(A note on multithreading that I meant to add to an earlier post: in terms of coding, the multithreading turned out to be much easier than I was expecting. Once I twigged that a mutex isn't actually related to the memory you're using it to lock, the coding was quick and simple.)