Acorncomputer wrote:Hi Paul
It does all seem to come back to the same thing each time - software that can make efficient use of multi core processors.
It is very difficult, especially with real-time systems such as a simulator. There are a lot of issues to consider.
If you have one thread running on one core, it will happily process away. If you have two threads running on one core, it will time slice between the two (at one point a time slice was typically 50ms - that could lead to stuttering performance in itself depending on the nature of the application).
If you can have threads happily processing independent of each other on their own core, then two cores are twice as fast as one, and four cores are four times faster than one. However, more often than not you need to share some data and resources between the processing threads. Where that data is modifiable it needs to be protected with a thread synchronisation lock. This is where two cores are not going to perform twice as fast as one - when they are fighting each other for resources.
A while ago I ripped a simple spin lock out of the Linux kernel source code and did some user application space tests. With a spin lock (mutex), only one thread is allowed to lock the mutex at one time. The other threads will simply spin on it (continuously process against the lock) until it is free.
On a dual CPU machine, with one thread I got 1,000,000 lock/unlock cycles per second, or about 40 cycles per operation. With two threads running in contention against the lock, I got 100,000 lock/unlock cycles per second (400 cycles per operation). It was slower by a factor of ten! (and using twice the CPU!) This is due to cache-line ping-pong; the caches have to synchronise the state of the spin lock between each other when they are under contention. Memory cache speeds and bus speeds will have an effect here. Also, a thread can be time sliced out whilst it holds a lock, locking out all the other scheduled processing threads.
In short, multi-processing systems which frequently pass through contended thread synchronisation locks will be slower then their single threaded counterparts. Throwing more CPU's at the problem may just increase CPU usage with no performance benefit!
EDIT: I was performing the test on a 400MHz CPU, so at 40 cycles per operation the uncontended case would have been 10,000,000 operations per second. The 100,000 contended figure may still have been true though!