News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How do you save the Flags in your code?

Started by MazeGen, May 26, 2005, 11:08:37 AM

Previous topic - Next topic

MichaelW

Oops, I was not considering the effect on the flags, just looking for a faster way to adjust the stack pointer, because that is the essential function of the first push instruction.

By my reasoning, EAX is being read by the following XOR EAX,EAX for all but the last repeated pair. But to make the test valid for all of the repeated instructions I added a MOV EBX,EAX.

    ; Test for partial-register stall with LAHF
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   ecx,ecx
        lahf
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" xor ecx,ecx lahf mov ebx,eax cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        lahf
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" xor eax,eax lahf mov ebx,eax cycles * 10",13,10)


71 xor ecx,ecx lahf mov ebx,eax cycles * 10
71 xor eax,eax lahf mov ebx,eax cycles * 10

eschew obfuscation

MazeGen

:green

It seems that we are still missing the point. XOR EAX, EAX in the following code has no influence on the latter instructions:


        xor   eax,eax
        lahf
        mov   ebx,eax


The important part should be what you put in between LAHF and MOV, I think.

Please try the following code - examples from the Agner's manual:


    ; Test for the simplest partial-register stall on PPro, P2, P3
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov al,byte ptr [esp]
        mov ebx,eax        ; partial register stall
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" partial register stall expected * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        movzx ebx,byte ptr [esp]
        and eax,0FFFFFF00h
        or ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" no partial register stall * 10",13,10)


Quote from: Pentium M 1400
5 add reg,imm cycles * 10
87 partial register stall expected * 10
11 no partial register stall * 10

Unfortunately, not even ANDing the EAX before MOV [ESP+4],EAX in the pushfszapc macro doesn't seem to improve the performance...

MichaelW

You're right. I remembered reading this:
Quote
The PPro, P2 and P3 processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that the upper 24 bits of EAX are zero, so that a partial stall can be avoided. This mechanism works only on certain combinations:
But I remembered only the base concept, and not the exceptions.

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   al,3
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Agner Fog's no partial register stall example cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   ah,3
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Agner Fog's partial register stall example cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        lahf
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" subsitiuting lahf for mov ah,3 in above cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov al,byte ptr [esp]
        mov ebx,eax        ; partial register stall
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" partial register stall expected * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        movzx ebx,byte ptr [esp]
        and eax,0FFFFFF00h
        or ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" no partial register stall * 10",13,10)


10 Agner Fog's no partial register stall example cycles * 10
69 Agner Fog's partial register stall example cycles * 10
71 subsitiuting lahf for mov ah,3 in above cycles * 10
81 partial register stall expected * 10
10 no partial register stall * 10

eschew obfuscation

MichaelW

Experimenting with the code some more I could not find any way around the stall. But I did determine that a stall (apparently) occurs not only when the full register is read, after a partial write, but also when the full register is the destination in a logical instruction, after a partial write.

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   al,3
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Agner Fog's no partial register stall example cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   ah,3
        mov   ebx,eax
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Agner Fog's partial register stall example cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   ah,3       
        or    eax,ebx
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Above with or eax,ebx substituted for mov ebx,eax cycles * 10",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor   eax,eax
        mov   ah,3       
        mov   eax,ebx
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" Above with mov eax,ebx substituted for or eax,ebx cycles * 10",13,10)


10 Agner Fog's no partial register stall example cycles * 10
69 Agner Fog's partial register stall example cycles * 10
71 Above with or eax,ebx substituted for mov ebx,eax cycles * 10
10 Above with mov eax,ebx substituted for or eax,ebx cycles * 10

eschew obfuscation

MazeGen

Quote from: MichaelW on May 30, 2005, 10:39:54 PM
Experimenting with the code some more I could not find any way around the stall.

Yeah, it really seems we can't avoid the stall. Nevertheless, the macros are still faster than PUSHFD/POPFD.

Quote from: MichaelW on May 30, 2005, 10:39:54 PM
But I did determine that a stall (apparently) occurs not only when the full register is read, after a partial write, but also when the full register is the destination in a logical instruction, after a partial write.

The only idea which comes to my mind:
I think you haven't used correct tests. In the last example:


        xor   eax,eax
        mov   ah,3       
        mov   eax,ebx


you've rewritten whole EAX and the processor probably knows that there's no need to wait until MOV AH is finished. (But I've never read about such behaviour).

By contrast, you compare it with the following code:


        xor   eax,eax
        mov   ah,3       
        or    eax,ebx


here, the processor has to wait until MOV AH is finished, because OR EAX depends on MOV AH.

Try to modify the last example to


        xor   eax,eax
        mov   ah,3       
        mov   ebx,eax


and the timings will be much more similar:

Quote
5 add reg,imm cycles * 10
79 Above with or eax,ebx substituted for mov ebx,eax cycles * 10
74 Above with mov ebx,eax substituted for or eax,ebx cycles * 10

MichaelW

Quote

xor eax,eax
mov ah,3
or eax,ebx

here, the processor has to wait until MOV AH is finished, because OR EAX depends on MOV AH.

Yes, just as a following MOX EBX,EAX would depend on the MOV AH,3.

From Agner Fog's Pentium Optimization manual:
Quote
Partial register stall is a problem that occurs in PPro, P2 and P3 when you write to part of a 32-bit register and later read from the whole register or a bigger part of it. Example:

MOV AL, BYTE PTR [M8]
MOV EBX, EAX ; partial register stall

This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX.

I can't help but wonder why Intel chose to make the special case handling (that allows a preceding XOR EAX,EAX to prevent the stall) work only for AL, and not for AH.
eschew obfuscation

MazeGen

Quote from: MichaelW on May 31, 2005, 09:52:34 AM
...but wonder why Intel chose to make the special case handling (that allows a preceding XOR EAX,EAX to prevent the stall) work only for AL, and not for AH.

As I see it, it is very simple for the processor to remember that specific range of bits from msb (in case of EAX, according to Agner's manual, bits 31..8) are zero. Now, when only the range outside the zero area is read, it is sufficient to read only that part. Imagine some variable, which holds the number of the last significant zero bit (here 8).

If it should work also for AH, the processor would have to know additionaly whether AL range (7..0) is zero or not. It would be probably too complicated and slow, because the processor would have to contain one more such variable for the additional range and therefore it works only for AL.

I hope you can understand it in my funny English ;)

MichaelW

QuoteI hope you can understand it in my funny English

No problem at all, your English is better than many of the native speakers I know :U
eschew obfuscation