xchg r32, esp: evil or friend

UncannyDude · February 02, 2005, 02:21:08 AM

When I saw [mov edx, [eax]; add eax, 4] in my code, come up to me to use stack autoupdating addressing. As pop'ing is a common operation, it is very optimized and shorter.

At home, the difference was not so visible, but at job was something like +20%.

What if we use this as a common idiom? It would not affect the cache in any way, so performance will be at least the same. The code must be carefully written to do not generate a non-local goto, say exceptions, and forget function calls. On Windows worked okay, but if a context switch occur before restoring the stack pointer, can this trick crash an environment(there's an OS which assumes a valid esp?)?

What can you say about that?

hutch-- · February 02, 2005, 03:07:01 AM

Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.

UncannyDude · February 02, 2005, 03:44:16 AM

Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.

Yeah, it maybe so. But I'm not considering its use for trivial loads, but in a loop like this:

Code Select

    xchg    eax, esp
.loop:
    pop     edx
...
    jmp     .loop
    xchg    eax, esp

So, there's one XCHG per hundreds or thousands of cycles.

Thank you.

Randall Hyde · February 02, 2005, 05:00:42 AM

Quote from: hutch-- on February 02, 2005, 03:07:01 AM
Generally XCHG is slow on any modern hardware. Its easy enough to benchmark it but from memory either the stack method or using registers is faster.

XCHG reg,mem is slow, because of the implicit LOCK that gets done. AFAIK, XCHG reg, reg is a perfectly reasonable instruction to use.

As to the original question, interrupts and task switches use a different stack, so you don't have to worry about those events corrupting your stack.
Cheers,
Randy Hyde

MichaelW · February 02, 2005, 06:23:23 AM

I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\oleaut32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\oleaut32.lib

    include \masm32\macros\macros.asm

    include macros2.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        lpString1  dd 0
        lpString2  dd 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    STRLENGTH EQU 1048576 * 4
    LOOPCOUNT EQU 1

    mov lpString1, alloc$(STRLENGTH)
    mov lpString2, alloc$(STRLENGTH)

    print chr$("rep movsd                  : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
      cld
      rep   movsd
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    print chr$("move through reg32 in loop : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
      
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
    @@:
      mov   eax, [esi]
      mov   [edi], eax
      add   esi, 4
      add   edi, 4
      sub   ecx, 1
      jnz   @B
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    print chr$("stack auto-increment       : ")
    clockctr_begin LOOPCOUNT, REALTIME_PRIORITY_CLASS
      mov   esi, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
      xchg  esi, esp
    @@:
      pop   [edi]
      add   edi, 4
      sub   ecx, 1
      jnz   @B
      xchg  esi, esp
    clockctr_end
    print ustr$(eax)
    print chr$(" clock cycles",13,10)

    free$ lpString1
    free$ lpString2
    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Typical output on a P3:

Code Select


rep movsd                  : 16461048 clock cycles
move through reg32 in loop : 19885052 clock cycles
stack auto-increment       : 20035101 clock cycles

[attachment deleted by admin]

Randall Hyde · February 02, 2005, 04:48:05 PM

Quote from: MichaelW on February 02, 2005, 06:23:23 AM
I'm not sure I implemented parts of this correctly, and the clock cycle counts aren't very repeatable for a single loop of a process that takes millions of cycles.

Interrupts, multitasking, and other events pretty much guarantee that you will not get consistent results if your sample period is long enough.
Cheers,
Randy Hyde

Mark_Larson · February 03, 2005, 04:05:32 AM

With only 8 registers it is hard to free up one to save ESP to. Try this instead.

Code Select


    movd   mm0, esp
.loop:
    pop     edx
...
    jmp     .loop
    movd    esp, mm0

hutch-- · February 03, 2005, 05:41:39 AM

I have used this technique in the past.

funcname proc args etc ...

.data?
reg_esp dd ?
.code

mov reg_esp, esp

; write code here using ESP

mov esp, reg_esp

ret

funcname endp

Kestrel · February 03, 2005, 08:02:27 AM

Use the [xchg xxx, esp] must to be careful !
because the esp (Stack Point) be changed .

When NMI come in, maybe ..???

Candy · February 03, 2005, 08:24:09 AM

all these interrupt handlers are still implemented in ring0 in all operating systems I know, and according to the Intel System Programming Manual, it's even forbidden to do them in ring3. Each switch to ring0 means that you swap stack first (with your non-stack, but who cares except for a debugger) and then use the kernel stack assigned to you (or a generic kernel one) for storing those registers. Only for a few exceptions (see also this link on a different forum where people are discussing implementing exceptions in ring3 and their limitations) you can expect them to be handled in user space at some time, but this is not in any current OS as far as I know.

Summary: Nothing yet, might be a few in the future, just don't do them.

UncannyDude · February 03, 2005, 11:14:17 AM

@Randall Hyde: I agree with you, as a CPU cannot access another's registers. And thank you for your information.

@MichaelW: in this case, the routine using the trick in question could avoid xchg. I'll use the following(obviously, if it worked):

Code Select

      mov   esi, esp
      mov   esp, lpString1
      mov   edi, lpString2
      mov   ecx, STRLENGTH SHR 2
    @@:
      dec   ecx
      pop   [edi+ecx*4]
      jnz   @B
      mov   esp, esi

But I see this trick is not intended for move operations.

@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?

Mark_Larson · February 03, 2005, 02:16:20 PM

Quote from: UncannyDude on February 03, 2005, 11:14:17 AM
@Mark Larson: indeed, a MMX register could be used, but some cases(if a register already contains the new pointer for esp) xchg performs better. But there's a need to emms after?

All times are for a P4. The actual timing will vary depending on dependencies and other stalls. Xchg runs in 1.5 cycles. "movd register,mmx register" runs in 5 cycles. "movd mmx register,register" runs in 2 cycles. So xchg is faster, however the way you are doing it wastes a CPU register to save it. So what is the point in even using it when you have to waste another register to save the value in the register? Does that make sense?

Also if you aren't doing floating point code you don't have to use EMMS. IF you do use floating point, then put the EMMS once right before any floating point code. As an alternative you can use this, and it won't require an EMMS, but it's slower.

Code Select


movd xmm0,esp
movd esp,xmm0

UncannyDude · February 03, 2005, 09:49:38 PM

By the time I considered using "xchg esp, eax", I did feel no register pressure, so the point is to "waste" a spare register.

News:

xchg r32, esp: evil or friend

UncannyDude

hutch--

UncannyDude

Randall Hyde

MichaelW

Randall Hyde

Mark_Larson

hutch--

Kestrel

Candy

UncannyDude

Mark_Larson

UncannyDude