Thursday, 29 January 2009

SIMD rendering implemented

Over the last week I have implemented a couple of SIMD versions of the ray casting as suggested by my tutor. In one method SIMD is used to process the vector and colour calculations of a single ray/pixel (for example simultaneous x, y, z operations). In the second, it is used to process the calculations for four pixels at once (simulations x, x, x, x operations, etc).

I got them working without too much hassle. The quad pixel method contained an unforeseen problem, that the early ray termination optimisation can occur for the rays at different times. Rather than introduce overhead for checking which rays have finished, I carried all the rays on until all could be stopped. This revealed a bug with my colour calculations, where the opacity went above 1.0, but this was easily remedied.

I then did some thorough testing of the variable settings currently available in the renderer (previous tests were only quick ones, these are a lot more reliable). All tests are a render of a solid lit skull from diagonal viewpoint.

First: number of worker threads

-thread count-

1
9.32813
9.34375
9.32813

2
4.97656
4.96094
4.99219

3
3.29297
3.32031
3.33594

4
2.52734
2.52734
2.54297

5
1.99609
2.02734
2.04688

6
1.69922
1.70313
1.69922

7
1.46875
1.46875
1.45313

8
1.29297
1.27734
1.28125

9
1.58984
1.58984
1.57422

10
1.49609
1.48438
1.52734


As seen before 8 threads is the optimal number on an 8 core PC.

Next: SIMD usage in the renderer, and how the image pixels are divided between threads.

-simd and delegation type, with 2 threads-

concurrent pixels, no simd
4.99219
5.03906
5.11719

concurrent pixels, single simd
4.88281
5.08594
4.88281

concurrent pixels, quad simd
4.36719
4.33984
4.36719

alternate pixels, no simd
4.92969
4.78906
4.74219

alternate pixels, single simd
4.61719
4.64844
4.64844

alternate pixels (alternate groups of 4), quad simd
4.11719
4.14844
4.12109


Again as I have seen before, alternate pixels is generally faster than giving each thread a whole block to itself. Single SIMD gives a small decrease in time over normal rendering, whilst quad SIMD shows a more significant speed up. It is possible (and probably likely) that my SIMD code could be faster, however for now at least I consider it a success and will move on.

Next: filter type when sampling from the volume data.

-sample filter type, with 2 threads-

point sample
5.03906
5.00781
4.99219

trilinear interpolation
8.12891
8.16016
8.09766

simd trilinear interpolation
8.25391
8.22266
8.25


Trilinear is much slower but this is acceptable for the vast increase in visual quality. My use of SIMD calculations in the trilinear interpolation makes things slower, but this is not really a surprise as this type of calculation is not suited to SIMD.

Next a test of what should be the fastest way my program can render the image:

-8 threads, alternate pixels, quad simd, trilinear interpolation-

7.10156
7.06641
7.09766


Point sampling would be faster but the image quality would not be useful for a medical purpose. I'm quite happy with just over 7 seconds for a nice looking render of a complex volume.

Finally I played around with low quality rendering possibilities. By degrading image quality significantly with point sampling and a large ray step size, times in the range of 0.6-0.25 seconds can be achieved on the simpler volumes (such as the inner ear and the maths function). I consider these to be just about interactive framerates. The image quality is low, but still gives a good impression of the volume with its lighting and colouring present. By downsampling the volume to a smaller size (and eighth or lower, depending on the size of the original volume) it could be rendered very quickly as a "preview" of the render. I am not really intending to cover the area of low quality rendering, but it is interesting as a side note.

As with last week I do not have anything specific to ask in the meeting, simply a discussion of my recent progress.

No comments: