The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: Mark Jones on April 21, 2005, 08:45:38 PM

Title: CPU & FPU concurrency
Post by: Mark Jones on April 21, 2005, 08:45:38 PM
 Way back in the day, the FPU was a separate chip and not necessarily included with the PC. I remember buying a 386 and installing a little square 387 myself, boy what fun! Windows 3.11 just screamed after that, I could not believe the difference! Surely, the 387 must be concurrently working alongside the CPU... right?

But can the FPU really run concurrently with the CPU? Since learning a little MASM32 and the Win32 flat memory model, true concurency seems like an improbability. Granted, I know little of FPU programming yet, but both the CPU and FPU seem to execute instructions from (effectively) the same linear memory in one thread. Are the FPU instructions heavily cached? Because that would create the possiblity of an underflow/overflow condition and I can't say I've ever seen an error like that involving the FPU.

Could someone please shed some light on the mysteries of x86 concurrency? Thanks. :)
Title: Re: CPU & FPU concurrency
Post by: Mark_Larson on April 21, 2005, 09:52:32 PM
Quote from: Mark Jones on April 21, 2005, 08:45:38 PM
Way back in the day, the FPU was a separate chip and not necessarily included with the PC. I remember buying a 386 and installing a little square 387 myself, boy what fun! Windows 3.11 just screamed after that, I could not believe the difference! Surely, the 387 must be concurrently working alongside the CPU... right?

But can the FPU really run concurrently with the CPU? Since learning a little MASM32 and the Win32 flat memory model, true concurency seems like an improbability. Granted, I know little of FPU programming yet, but both the CPU and FPU seem to execute instructions from (effectively) the same linear memory in one thread. Are the FPU instructions heavily cached? Because that would create the possiblity of an underflow/overflow condition and I can't say I've ever seen an error like that involving the FPU.

Could someone please shed some light on the mysteries of x86 concurrency? Thanks. :)

  You can execute FP and ALU code in parallel.  If that is what you are asking?  That's a standard optimization trick to get code to execute in parallel.  However it does not apply to MMX/SSE/SSE2 ( you can mix ALU with MMX/SSE/SSE2 and it will run in parallel).  You don't have to do anything "special" to do it.

Title: Re: CPU & FPU concurrency
Post by: Mark Jones on April 22, 2005, 02:08:22 PM
Quote from: Mark_Larson on April 21, 2005, 09:52:32 PM
  You can execute FP and ALU code in parallel.  If that is what you are asking?  That's a standard optimization trick to get code to execute in parallel.  However it does not apply to MMX/SSE/SSE2 ( you can mix ALU with MMX/SSE/SSE2 and it will run in parallel).  You don't have to do anything "special" to do it.

How do you program the two processors in parallel then, from one thread, without creating some kind of underflow or overflow? When coding, do you have to say to yourself "okay, well the FPU will take 50 clock cycles to complete this instruction, so we'll start that and then keep the CPU busy for 49 clocks?" Does code execution not wait for FPU commands to complete? Sorry if I didn't sound noob-ish enough. :toothy
Title: Re: CPU & FPU concurrency
Post by: hutch-- on April 22, 2005, 02:24:24 PM
Mark,

Threads are more OS based and when you run processes in different threads, they do not automatically synchronise. Within a single thread you can mix x87 and integer instructions without any problems but MMX and FP use the same registers so they can't be without a very sever penalty in performance terms.
Title: Re: CPU & FPU concurrency
Post by: raymond on April 23, 2005, 02:52:07 AM
Quote"okay, well the FPU will take 50 clock cycles to complete this instruction, so we'll start that and then keep the CPU busy for 49 clocks?"

That is exactly what you can do. I've just run a test timing 10,000,000 cycles of loading an integer on the FPU and getting its square root. 317 ms
Timing 10,000,000 cycles of extracting the square root of the same integer using CPU instructions was 855 ms. When I start extracting the square root with CPU commands while the FPU is busy extracting its own square root, the timing for 10,000,000 cycles was still 855 ms.

However, the CPU mul instructions (and maybe the div also) may be using the same hardware as the FPU for mul instructions. My previous tests on this seem to indicate that those may not run in parallel.

Raymond
Title: Re: CPU & FPU concurrency
Post by: Mark Jones on May 20, 2005, 02:47:45 AM
Has anyone considered using the GPU for added concurrency?

http://www.gpgpu.org/
Title: Re: CPU & FPU concurrency
Post by: Mark_Larson on May 20, 2005, 03:51:27 AM
Quote from: Mark Jones on May 20, 2005, 02:47:45 AM
Has anyone considered using the GPU for added concurrency?

http://www.gpgpu.org/

Yes.  I was looking at doing a raytracer that used the GPU and the CPU in parallel to get the equivalent of dual processors.  Didn't finish it though.