News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How do you save the Flags in your code?

Started by MazeGen, May 26, 2005, 11:08:37 AM

Previous topic - Next topic

MazeGen

Hello,
I was thinking about the quickest way - I use an assembler so why not to use the best way, right?  :wink

I got used to PUSHF/POPF under DOS, but according to Fog's optimization manual, it needs 4 uops and next 4 uops for the microcode (P4). POPFD needs 4 uops and next 8 uops for the microcode. It seems that PUSHFD/POPFD is not the best way.

By contrast, LAHF and SAHF both needs only one uop, no microcode (I know they don't save the OF and DF, but I rarely need them). But I often need the EAX value, so it has to be PUSHed/POPed additionaly.

Finally, we have two constructs:


; 1.
PUSHFD
...
POPFD



; 2.
PUSH EAX
LAHF
...
SAHF
POP EAX ; avoid partial register stall by writing to the whole register


What do you think about them?
The latter one seems quicker to me...

Phil

Wow ... The things we learn by reading, thinking, and exploring! I'm fairly new to the world of 32-bit 80x86 programming but I think you have a fine idea! If we're going to do assembly then, why *not* do it right!

I've attached a zip file that contains MichaelW's timers.asm and pushtime.asm that contains the instructions you'd asked about inside timer loops and here are the results. I'm not sure why it says zero for the PUSH/LAHF cycles. Clearly it takes some time to do the operations. You might try adding one or two more instructions from your actual code until you see at least some cycles for the second case. Anyway, I thought I'd pass this along so you can play with it. These results came from a 996MHz P3:

C:\ASM\EXAMPLES>PUSHTIME
21 PUSHFD cycles
0 PUSH/LAHF cycles

You'll need to build the example code as a console application in order to use it.


[attachment deleted by admin]

QvasiModo

Note that the two pieces of code do not perform the same task. The second only uses the lowest byte of the flags register.

MichaelW

Hi Phil,

I modified your code so it would return a more realistic cycle count. The macros return the count for a single pass through the block of code, so for code that takes only a few cycles to execute the count is smaller than the accumulated timing inaccuracies.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000
    REPEAT_COUNT EQU 10

    ; Reality check
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        add   eax,1
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" add reg,imm cycles * 10",13,10)
   
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        ; 1.
        PUSHFD
        ; ...
        POPFD
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" PUSHFD cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        ; 2.
        PUSH EAX
        LAHF
        ; ...
        SAHF
        POP EAX ; avoid partial register stall by writing to the whole register
      ENDM       
    counter_end
    print ustr$(eax)
    print chr$(" PUSH/LAHF cycles * 10",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


The results on my P3:

5 add reg,imm cycles * 10
222 PUSHFD cycles * 10
28 PUSH/LAHF cycles * 10

eschew obfuscation

Phil

MichaelW: Great! I hadn't figured out how to do that yet. It's also nice to have your code displayed so the person who'd originally asked about it can see how your timers work before downloading the zip. Great code there. Thanks again for taking the time to put it all together and keeping it tuned up!

MazeGen: Thanks for asking this question about saving flags. I've also got an application that I need to tweak quite a bit and I think your suggestion might help me considerably!

MazeGen

Phil: Thanks for your effort with measuring the timings!

QuasiModo: You're right - as I said, I never change DF, and rarely need OF, since I work most of the time with unsigned arithmetic.

Michael: Again thanks for your ready-to-assemble sample. I did never look at your "code timing macros" thread, but now I see it is veeery useful  :thumbu

Didn't expect before that LAHF/SAHF can be several times faster than PUSHFD/POPFD  :eek

MazeGen

Eh, when I've tried the measuring again a again on AMD Athlon XP 1800, I've always got very different results. I really have no idea why...

Quote
6 add reg,imm cycles * 10
142 PUSHFD cycles * 10
274 PUSH/LAHF cycles * 10

6 add reg,imm cycles * 10
144 PUSHFD cycles * 10
115 PUSH/LAHF cycles * 10

6 add reg,imm cycles * 10
145 PUSHFD cycles * 10
488 PUSH/LAHF cycles * 10

6 add reg,imm cycles * 10
144 PUSHFD cycles * 10
185 PUSH/LAHF cycles * 10

6 add reg,imm cycles * 10
145 PUSHFD cycles * 10
336 PUSH/LAHF cycles * 10

In general, PUSH EAX/LAHF version seems to be rather slower than PUSHFD/POPFD.

Here is also result from Pentium M 1400. I've also tried more measurings, but all of them were very similar:

Quote
5 add reg,imm cycles * 10
219 PUSHFD cycles * 10
46 PUSH/LAHF cycles * 10

MichaelW

Hi MazeGen,

For the Athlon, try increasing REPEAT_COUNT. If that does not stabilize the returned counts you could try REALTIME_PRIORITY_CLASS. But note that I now avoid posting code that uses REALTIME_PRIORITY_CLASS because I have had Windows (2000 SP4 running on a very reliable system) hang when the test loop took too long to run. And when I do use REALTIME_PRIORITY_CLASS, for any test, I take care to save everything to disk first.
eschew obfuscation

MazeGen

Hi Michael,
you're right :U Thanks!

Check the results now (no need for REALTIME_PRIORITY_CLASS):

Quote
100 add reg,imm cycles * 100
1530 PUSHFD cycles * 100
652 PUSH/LAHF cycles * 100

103 add reg,imm cycles * 100
1534 PUSHFD cycles * 100
655 PUSH/LAHF cycles * 100

104 add reg,imm cycles * 100
1534 PUSHFD cycles * 100
652 PUSH/LAHF cycles * 100

104 add reg,imm cycles * 100
1534 PUSHFD cycles * 100
647 PUSH/LAHF cycles * 100

102 add reg,imm cycles * 100
1519 PUSHFD cycles * 100
645 PUSH/LAHF cycles * 100

MazeGen

I was thinking about a sort of PUSHFD macro to avoid trashing EAX and here is the result:


pushfdszapc MACRO
push eax ;; will be rewritten by *
push eax
lahf
mov [esp+4],eax ;; *
pop eax
ENDM

popfdszapc MACRO
push eax
mov eax,[esp+4]
sahf
pop eax
lea esp,[esp+4] ;; use LEA to leave Flags unchanged
ENDM


I use postfix "szapc" since it saves only sign, zero, adjust, parity, and carry flag.

BTW, I don't mind about the code size - the macros are intended for small, optimized pieces of code, which always fit into the trace cache.

It is still faster than PUSHFD/POPFD on my processors.

But - I don't know how to avoid partial register stall on MOV [ESP+4],EAX in the PUSHDSZAPC macro when used on PPro, P2, or P3. I can't use AND EAX,0FF00h or similar because I can't touch the Flags. MOVZX EAX,AX will probably not work because it reads AX, not AH. MOVZX EAX,AH makes the macros too much complex and slow.

Michael, could you please test it on your P3 as-it-is? The code is below.

Any ideas how to improve the macros? :8)[/color]

My timings:

Quote from: AMD Athlon XP 1800
98 add reg,imm cycles * 100
1469 PUSHFD cycles * 100
954 PUSHFDSZAPC cycles * 100

Quote from: Pentium M 1400
4 add reg,imm cycles * 10
219 PUSHFD cycles * 10
127 PUSHFDSZAPC cycles * 10

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

pushfdszapc MACRO
push eax ;; will be rewritten by *
push eax
lahf
mov [esp+4],eax ;; *
pop eax
ENDM

popfdszapc MACRO
push eax
mov eax,[esp+4]
sahf
pop eax
lea esp,[esp+4] ;; use LEA to leave Flags unchanged
ENDM

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000
    REPEAT_COUNT EQU 10

    ; Reality check
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        add   eax,1
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" add reg,imm cycles * 10",13,10)
   
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        ; 1.
        PUSHFD
        ; ...
        POPFD
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" PUSHFD cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        pushfdszapc
        popfdszapc
      ENDM       
    counter_end
    print ustr$(eax)
    print chr$(" PUSHFDSZAPC cycles * 10",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

MichaelW

The results repeated exactly for 5 runs on my P3:

5 add reg,imm cycles * 10
221 PUSHFD cycles * 10
154 PUSHFDSZAPC cycles * 10


And after I tried adding a xor eax,eax before the lahf:

5 add reg,imm cycles * 10
221 PUSHFD cycles * 10
164 PUSHFDSZAPC cycles * 10

eschew obfuscation

MazeGen

Thanks again, Michael :8) Nice to see it is faster also on P3 without any modifications.

To avoid the partial register access, we need to write to the whole EAX just after LAHF. That's because LAHF writes only to AH, but  MOV [ESP+4],EAX reads from whole EAX.

If it is not annoying for you yet, try the following modification, it should be faster:


lahf
and eax,0FF00h ; avoid partial register stall
mov [esp+4],eax


In fact, we can use AND or similar because it changes the Flags, and also OF, which is not saved by LAHF.

MichaelW


pushfdszapc MACRO
push  eax         ;; will be rewritten by *
push  eax
lahf
and   eax,0FF00h  ; avoid partial register stall
mov   [esp+4],eax
pop   eax
ENDM


5 add reg,imm cycles * 10
222 PUSHFD cycles * 10
160 PUSHFDSZAPC cycles * 10

eschew obfuscation

MichaelW

I think there is no partial-register stall problem with LAHF, or at least on a P3.

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT * 10
        xor   ecx,ecx
        lahf
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" xor ecx,ecx lahf cycles * 100",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT * 10
        xor   eax,eax
        lahf
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" xor eax,eax lahf cycles * 100",13,10)


117 xor ecx,ecx lahf cycles * 100
117 xor eax,eax lahf cycles * 100


This coding is faster on a P3:

pushfdszapc MACRO
sub   esp,4
;push  eax         ;; will be rewritten by *
push  eax
lahf
mov   [esp+4],eax
pop   eax
ENDM


5 add reg,imm cycles * 10
222 PUSHFD cycles * 10
144 PUSHFDSZAPC cycles * 10

eschew obfuscation

MazeGen

Hehe, we can't SUB before we save the Flags :wink the macro would lose its meaning. That's why I use PUSH EAX.

According to your previous post, it really seems that ANDing the EAX didn't accelerate the macro.

But now it seems that one of us don't understand, what means partial register stall (no offense taken, Michael, I'm newbie in optimizations, I'm just trying to be humorous also in English :green)

According to Fog's manual, on P3, when you write to part of a 32-bit register (LAHF -> AH) and later read from the whole register (MOV [ESP+4],EAX), you get partial register stall, like in given example:


MOV AL, BYTE PTR [M8]
MOV EBX, EAX ; partial register stall


I would expect the stall here:


lahf ; write to a part of 32-bit EAX
mov [esp+4],eax ; read from the whole EAX -> the stall


That's why I tried to put AND EAX in between LAHF and MOV in order to write whole EAX.

In your test:


        xor   eax,eax ; write to the whole 32-bit register
        lahf ; write to a part of 32-bit register


You can't get partial register stall, because you always write to the register; there is no partial write + whole read.