News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

xchg r32, esp: evil or friend

Started by UncannyDude, February 02, 2005, 02:21:08 AM

Previous topic - Next topic

UncannyDude

When I saw [mov edx, [eax]; add eax, 4] in my code, come up to me to use stack autoupdating addressing. As pop'ing is a common operation, it is very optimized and shorter.

At home, the difference was not so visible, but at job was something like +20%.

What if we use this as a common idiom? It would not affect the cache in any way, so performance will be at least the same. The code must be carefully written to do not generate a non-local goto, say exceptions, and forget function calls. On Windows worked okay, but if a context switch occur before restoring the stack pointer, can this trick crash an environment(there's an OS which assumes a valid esp?)?

What can you say about that?

hutch--

Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

UncannyDude

Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.
Yeah, it maybe so. But I'm not considering its use for trivial loads, but in a loop like this:

    xchg    eax, esp
.loop:
    pop     edx
...
    jmp     .loop
    xchg    eax, esp


So, there's one XCHG per hundreds or thousands of cycles.

Thank you.

Randall Hyde

Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.

XCHG reg,mem is slow, because of the implicit LOCK that gets done. AFAIK, XCHG reg, reg is a perfectly reasonable instruction to use.

As to the original question, interrupts and task switches use a different stack, so you don't have to worry about those events corrupting your stack.
Cheers,
Randy Hyde

MichaelW

I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\oleaut32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\oleaut32.lib

    include \masm32\macros\macros.asm

    include macros2.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        lpString1  dd 0
        lpString2  dd 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    STRLENGTH EQU 1048576 * 4
    LOOPCOUNT EQU 1

    mov lpString1, alloc$(STRLENGTH)
    mov lpString2, alloc$(STRLENGTH)

    print chr$("rep movsd                  : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
      cld
      rep   movsd
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    print chr$("move through reg32 in loop : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
     
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
    @@:
      mov   eax, [esi]
      mov   [edi], eax
      add   esi, 4
      add   edi, 4
      sub   ecx, 1
      jnz   @B
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    print chr$("stack auto-increment       : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
      xchg  esi, esp
    @@:
      pop   [edi]
      add   edi, 4
      sub   ecx, 1
      jnz   @B
      xchg  esi, esp
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    free$ lpString1
    free$ lpString2
    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Typical output on a P3:

rep movsd                  : 16461048 clock cycles
move through reg32 in loop : 19885052 clock cycles
stack auto-increment       : 20035101 clock cycles



[attachment deleted by admin]
eschew obfuscation

Randall Hyde

Quote from: MichaelW on February 02, 2005, 06:23:23 AM
I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.

Interrupts, multitasking, and other events pretty much guarantee that you will not get consistent results if your sample period is long enough.
Cheers,
Randy Hyde

Mark_Larson

  With only 8 registers it is hard to free up one to save ESP to.  Try this instead.


    movd   mm0, esp
.loop:
    pop     edx
...
    jmp     .loop
    movd    esp, mm0

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

I have used this technique in the past.


funcname proc args etc ...

    .data?
      reg_esp dd ?
    .code

    mov reg_esp, esp

    ; write code here using ESP

    mov esp, reg_esp

    ret

funcname endp
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Kestrel

Use the [xchg xxx, esp] must to be careful !
because the esp (Stack Point) be changed .

When NMI come in, maybe ..???

Candy

all these interrupt handlers are still implemented in ring0 in all operating systems I know, and according to the Intel System Programming Manual, it's even forbidden to do them in ring3. Each switch to ring0 means that you swap stack first (with your non-stack, but who cares except for a debugger) and then use the kernel stack assigned to you (or a generic kernel one) for storing those registers. Only for a few exceptions (see also this link on a different forum where people are discussing implementing exceptions in ring3 and their limitations)  you can expect them to be handled in user space at some time, but this is not in any current OS as far as I know.

Summary: Nothing yet, might be a few in the future, just don't do them.

UncannyDude

@Randall Hyde: I agree with you, as a CPU cannot access another's registers. And thank you for your information.

@MichaelW: in this case, the routine using the trick in question could avoid xchg. I'll use the following(obviously, if it worked):
      mov   esi, esp
      mov   esp, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
    @@:
      dec   ecx
      pop   [edi+ecx*4]
      jnz   @B
      mov   esp, esi

But I see this trick is not intended for move operations.

@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?

Mark_Larson

Quote from: UncannyDude on February 03, 2005, 11:14:17 AM
@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?

  All times are for a P4.  The actual timing will vary depending on dependencies and other stalls.  Xchg runs in 1.5 cycles.  "movd register,mmx register" runs in 5 cycles.  "movd mmx register,register" runs in 2 cycles.  So xchg is faster, however the way you are doing  it wastes a CPU register to save it.  So what is the point in even using it when you have to waste another register to save the value in the register?  Does that make sense?

  Also if you aren't doing floating point code you don't have to use EMMS.  IF you do use floating point, then put the EMMS once right before any floating point code.  As an alternative you can use this, and it won't require an EMMS, but it's slower.


movd xmm0,esp
movd esp,xmm0

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

UncannyDude

By the time I considered using "xchg esp, eax", I did feel no register pressure, so the point is to "waste" a spare register.