News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

CPU & FPU concurrency

Started by Mark Jones, April 21, 2005, 08:45:38 PM

Previous topic - Next topic

Mark Jones

 Way back in the day, the FPU was a separate chip and not necessarily included with the PC. I remember buying a 386 and installing a little square 387 myself, boy what fun! Windows 3.11 just screamed after that, I could not believe the difference! Surely, the 387 must be concurrently working alongside the CPU... right?

But can the FPU really run concurrently with the CPU? Since learning a little MASM32 and the Win32 flat memory model, true concurency seems like an improbability. Granted, I know little of FPU programming yet, but both the CPU and FPU seem to execute instructions from (effectively) the same linear memory in one thread. Are the FPU instructions heavily cached? Because that would create the possiblity of an underflow/overflow condition and I can't say I've ever seen an error like that involving the FPU.

Could someone please shed some light on the mysteries of x86 concurrency? Thanks. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark_Larson

Quote from: Mark Jones on April 21, 2005, 08:45:38 PM
Way back in the day, the FPU was a separate chip and not necessarily included with the PC. I remember buying a 386 and installing a little square 387 myself, boy what fun! Windows 3.11 just screamed after that, I could not believe the difference! Surely, the 387 must be concurrently working alongside the CPU... right?

But can the FPU really run concurrently with the CPU? Since learning a little MASM32 and the Win32 flat memory model, true concurency seems like an improbability. Granted, I know little of FPU programming yet, but both the CPU and FPU seem to execute instructions from (effectively) the same linear memory in one thread. Are the FPU instructions heavily cached? Because that would create the possiblity of an underflow/overflow condition and I can't say I've ever seen an error like that involving the FPU.

Could someone please shed some light on the mysteries of x86 concurrency? Thanks. :)

  You can execute FP and ALU code in parallel.  If that is what you are asking?  That's a standard optimization trick to get code to execute in parallel.  However it does not apply to MMX/SSE/SSE2 ( you can mix ALU with MMX/SSE/SSE2 and it will run in parallel).  You don't have to do anything "special" to do it.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark Jones

Quote from: Mark_Larson on April 21, 2005, 09:52:32 PM
  You can execute FP and ALU code in parallel.  If that is what you are asking?  That's a standard optimization trick to get code to execute in parallel.  However it does not apply to MMX/SSE/SSE2 ( you can mix ALU with MMX/SSE/SSE2 and it will run in parallel).  You don't have to do anything "special" to do it.

How do you program the two processors in parallel then, from one thread, without creating some kind of underflow or overflow? When coding, do you have to say to yourself "okay, well the FPU will take 50 clock cycles to complete this instruction, so we'll start that and then keep the CPU busy for 49 clocks?" Does code execution not wait for FPU commands to complete? Sorry if I didn't sound noob-ish enough. :toothy
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

hutch--

Mark,

Threads are more OS based and when you run processes in different threads, they do not automatically synchronise. Within a single thread you can mix x87 and integer instructions without any problems but MMX and FP use the same registers so they can't be without a very sever penalty in performance terms.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

raymond

Quote"okay, well the FPU will take 50 clock cycles to complete this instruction, so we'll start that and then keep the CPU busy for 49 clocks?"

That is exactly what you can do. I've just run a test timing 10,000,000 cycles of loading an integer on the FPU and getting its square root. 317 ms
Timing 10,000,000 cycles of extracting the square root of the same integer using CPU instructions was 855 ms. When I start extracting the square root with CPU commands while the FPU is busy extracting its own square root, the timing for 10,000,000 cycles was still 855 ms.

However, the CPU mul instructions (and maybe the div also) may be using the same hardware as the FPU for mul instructions. My previous tests on this seem to indicate that those may not run in parallel.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark Jones

Has anyone considered using the GPU for added concurrency?

http://www.gpgpu.org/
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark_Larson

Quote from: Mark Jones on May 20, 2005, 02:47:45 AM
Has anyone considered using the GPU for added concurrency?

http://www.gpgpu.org/

Yes.  I was looking at doing a raytracer that used the GPU and the CPU in parallel to get the equivalent of dual processors.  Didn't finish it though.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm