Thursday, 8 January 2009

Multithreading Test

Today I went into the university labs to test the multithreaded version of the renderer on an 8-core machine.

Short parallel ray casting test: solid skull with lighting. Volume samples are interpolated. Blocks/Alternate Pixels refer to how the image is divided up between threads. In Blocks mode each thread gets a share of consecutive pixels, in Alternate Pixels each consecutive pixel is handled by a different thread.

Threads........Blocks...........Alternate Pixels
1................15.60................-
2................8.51................7.97
3................5.91................5.35
4................6.55................4.04
5................5.24................3.27
6................3.17................2.72
7................3.49................3.40
8................3.71................3.07


This was a quick test with only one render per result so anomalies are possible.

In each case 6 threads was the optimal amount. This makes sense: the thread count is the number of worker threads so additionally there is the master thread waiting for all the worker threads to finish. On top of this there will almost certainly be a small amount of processing going on in other applications or the OS, which could account for the 8th thread.

These results show that a program can actually run on all 8 cores at once, seemingly sharing a single cache, at least on the specific setup in these computers. This is something I was unsure about, and will be worth discussing in the meeting tomorrow.

An optimisation that needs looking at is to partly remove the idea of a master thread, and give that thread a share of the work to do also. Once it has finished its share it can then start checking if the other threads have finished.

Looking at the best result, 2.72 seconds compared to the single thread time of 15.60 seconds, I have currently achieved a speedup of 5.73 times. This is using 6 worker threads so the predicted near-linear speedup of the raycasting method is looking to be correct.

Comparing results between Blocks and Alternate Pixels, Alternate Pixels is always faster. This shows that the cache is being shared between all the cores rather than individual, and since they should all be working in the same area of the image as they progress through it, cache hits are low. Individual caches would benefit the Blocks method as each thread is working on its own consecutive pixels.

So the plan for the meeting tomorrow is to generally discuss these results, and to come up with a plan for what to do next.

(A note on multithreading that I meant to add to an earlier post: in terms of coding, the multithreading turned out to be much easier than I was expecting. Once I twigged that a mutex isn't actually related to the memory you're using it to lock, the coding was quick and simple.)

No comments: