News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Saving FPU registers: fxam, tag word etc

Started by jj2007, June 27, 2009, 10:34:51 PM

Previous topic - Next topic

jj2007

I wanted to do something simple: Save up to three valid FPU registers before entering a routine using the FPU, and restoring them after. However, it turns out to be very tricky. Here are the timings (Celeron M) for a skeleton:

Saving 3 - 2 - 1 - 0 valid FPU registers:

27      cycles for FpAlgo1, mem check after
197     cycles for FpAlgo1, mem check after
366     cycles for FpAlgo1, mem check after
534     cycles for FpAlgo1, mem check after

55      cycles for FpAlgo2, mem check before
282     cycles for FpAlgo2, mem check before
264     cycles for FpAlgo2, mem check before
247     cycles for FpAlgo2, mem check before

137     cycles for FpAlgo3, tag word
132     cycles for FpAlgo3, tag word
126     cycles for FpAlgo3, tag word
114     cycles for FpAlgo3, tag word

34      cycles for FpAlgo4, fxam
171     cycles for FpAlgo4, fxam
161     cycles for FpAlgo4, fxam
151     cycles for FpAlgo4, fxam

33      cycles for FpAlgo5, fucom
199     cycles for FpAlgo5, fucom
180     cycles for FpAlgo5, fucom
173     cycles for FpAlgo5, fucom


The bottleneck is the decision whether to save or not a register. For doing that, one must check if ST(0) is empty.

Apparently, a mere test if ST is empty costs about 100 cycles. Any ideas?

[attachment deleted by admin]

dedndave

it is surprising it takes that many clock cycles
these "condition codes" or states for each ST register are stored internally
it must have to work its' way through a maze or something to get out - lol
i used to have a nice big diagram of the insides of an FPU - wish i knew where it was

raymond

I don't know if you used fstenv or fsave to access the tag word. The former should be faster.

I also don't know what you checked in the tag word. The number of free registers would be indicated by its lower bits. You then save and restore only the number of registers which must become free to perform your operations.

For example, if you need 3 free registers and only 2 are indicated as free, you would then need to save and restore only 1 register which you know is not free (regardless of what it contains). There's no need to store 3 registers if 5 others are already free!!
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

ToutEnMasm

Hello,
perhaps
Quote
.data
   SauveFpu        FLOATING_SAVE_AREA <>   
.code
   FNSAVE  SauveFpu   ;save the whole registers + FINIT
   FRSTOR SauveFpu   
with
Quote
FLOATING_SAVE_AREA   STRUCT
   ControlWord DWORD ?
   StatusWord DWORD ?
   TagWord DWORD ?
   ErrorOffset DWORD ?
   ErrorSelector DWORD ?
   DataOffset DWORD ?
   DataSelector DWORD ?
   RegisterArea BYTE SIZE_OF_80387_REGISTERS dup (?)
   Cr0NpxState DWORD ?
FLOATING_SAVE_AREA      ENDS



and extract the registers from RegisterArea


jj2007

Quote from: raymond on June 28, 2009, 05:08:23 AM
I don't know if you used fstenv or fsave to access the tag word. The former should be faster.

I used fstenv in FpAlgo3, 3*fxam in algo 4 and 3*fucom in algo 5.

Quote
I also don't know what you checked in the tag word. The number of free registers would be indicated by its lower bits. You then save and restore only the number of registers which must become free to perform your operations.

For example, if you need 3 free registers and only 2 are indicated as free, you would then need to save and restore only 1 register which you know is not free (regardless of what it contains). There's no need to store 3 registers if 5 others are already free!!

That's true, but saving a reg is not the bottleneck. Fstenv costs over 100 cycles. For comparison, the "rude" alternative ...

ffree st(7)
fldz
ffree st(7)
fldz
ffree st(7)
fldz
fstp st
fstp st
fstp st


... costs 9 cycles, 3 per trashed register

Quote from: ToutEnMasm on June 28, 2009, 05:45:57 AM
.data
   SauveFpu        FLOATING_SAVE_AREA <>   
.code
   FNSAVE  SauveFpu   ;save the whole registers + FINIT
   FRSTOR SauveFpu   

>280 cycles

raymond

Quoteand extract the registers from RegisterArea

And that part is completely useless in this context because the intended procedure then has no use for that data. The only intent is to avoid trashing any current data, which would be taken care of with the fsave and frstor instructions.

Quotethe "rude" alternative ...

For the "rude" alternative, there is no need to ffree each register and reload them with fldz. Simply use fstp st. The FPU doesn't mind if the register is already free and no exception would be raised if it is.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

jj2007

Quote from: raymond on June 28, 2009, 08:59:13 PM
For the "rude" alternative, there is no need to ffree each register and reload them with fldz. Simply use fstp st. The FPU doesn't mind if the register is already free and no exception would be raised if it is.

Sorry, my snippet was probably too short. What I meant is that, assuming some other code has already filled all 8 FPU regs, you have to ffree st(7) for each fld, otherwise ST will be BAD. This case is not very common, though. And a good question would be if letting it crash with a BAD ST isn't perhaps the better option, at least you see that you tried to trash a valid ST(7)...

Edit: The difference is marginal, apparently ffree needs exactly one cycle:
596     cycles for 300*fld/fstp
898     cycles for 300*ffree/fld/fstp

The question is really how realistic it is to assume that some other code wants a printout in a stage where the FPU is full... ::)

Jimg

As a general purpose routine, I'm assuming the user might want to print out any fpu register for debugging purposes, and would be upset if anything about the fpu changed.

raymond

Maybe a tree is preventing me from seeing the forest. What I can't understand is that a user of a program CANNOT change the code. Whatever a user may want must be supplied by the code written by the programmer (unless you are building a library of functions used by other programmers such as the Fpulib).

If your code cannot be used by other applications, that code (i.e. the programmer) should be fully aware if FPU registers are free or not. Such code should not have any need for additional instructions to verify the status of the FPU registers. Other applications running concurrently with your program and not interacting with it CANNOT have any effect on the content of the FPU no more than on the ALU registers. Your program has its own environment. With a multitasking OS, that's why switching tasks on a single CPU box can take some 15 ms because the entire environment must be preserved and restored for each task.

If the code uses functions from other libraries, the programmer must know if such functions have any effect on the content of the FPU and take appropriate action. For example, under WinXP it is known that some functions may trash the registers by using MMX code (although it leaves them all free). In such case, any data left in registers MUST be preserved before calling such function, but there is no need to free registers for other computations after calling such functions.

Could someone enlighten me on the actual purpose of this exercise.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

that is what i was wondering Ray
i was thinking to myself - what a pain in the arse to use the FPU - lol
but, i thought it was a n00b question, so i have been more or less trying to learn from this thread
when the OS switches tasks, it saves the state of the FPU as well as the CPU regs ?
that makes me curious - how many clock cycles does it take to switch between tasks ?

EDIT
that raises the next question - lol
can we control how often the OS switches between tasks?

raymond

Quotehow many clock cycles does it take to switch between tasks ?

I've got an old single-core P4 Model 1 running at 1.5 GHz. One of my hobbies is solving math problems and I like to time my algos with the GetTickCount function. Some of my algos may require less than 1 millisec to solve the problem and I may occasionally loop it a number of times to get a more accurate timing just for curiosity.

If I happen to exceed my allocated time slice at the wrong time, the return timing generally jumps to 15 ms. Repeating it several times may return several values of 15 ms interspersed with some values of 0 ms. At 1.5 GHz, 15 ms is a long time so I don't really know if it's due only to task switching between the other resident programs or if it's due to excessive time taken by some of the background apps.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

i like the math too, Ray
i wrote some good stuff for the 8088 with no 8087
i also wrote some nice code for the 8087
things have changed a lot since then - lol
i am a bit lost for now, but i learn quickly

jj2007

Quote from: raymond on June 29, 2009, 02:06:37 AM
Maybe a tree is preventing me from seeing the forest. ... Could someone enlighten me on the actual purpose of this exercise.

I am working on a (still somewhat buggy) implementation of a BASIC-style print Str$(MyRealVar) routine called tentatively float$. It uses the FPU, and should leave it intact in case other parts of the code have put valid entries inside. Masm32 lib FloatToStr, which does a similar job, trashes ST(5)...ST(7), and I wanted to avoid that in my algo. But it seems very costly, in terms of cycles.

Quote from: raymond on June 29, 2009, 03:45:25 AMAt 1.5 GHz, 15 ms is a long time so I don't really know if it's due only to task switching
GetTickCount has a granularity of about 16 ms. Task switching takes only nanoseconds.

MichaelW

For GetTickCount I get a resolution of 10ms.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    invoke Sleep, 3000

    xor edi, edi

    REPEAT 100
      invoke GetTickCount
      mov ebx, eax
      .WHILE eax == ebx
        invoke GetTickCount
      .ENDW
      sub eax, ebx
      add edi, eax
    ENDM
    mov eax, edi
    mov ecx, 100
    xor edx, edx
    div ecx
    print ustr$(eax),13,10

    timer_begin 1, HIGH_PRIORITY_CLASS
      invoke GetTickCount
      mov ebx, eax
      .WHILE eax == ebx
        invoke GetTickCount
      .ENDW
    timer_end
    print ustr$(eax),13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


eschew obfuscation

jj2007

Quote from: MichaelW on June 29, 2009, 08:17:38 AM
For GetTickCount I get a resolution of 10ms.

15 for my Prescott - with one core, it should be 10ms indeed.

There is a detailed article here, saying a switch costs roughly 2,600 cycles = 1,000 nanoseconds = 1 microsecond = 0.001 ms.