News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Possible problems with SSE usage.

Started by KeepingRealBusy, July 07, 2010, 12:57:11 AM

Previous topic - Next topic

sinsi

I know xchg is slow but how does a push/mov compare?
I also used esp, not ebp, would that make a difference?

It's all voodoo anyway eh?
Light travels faster than sound, that's why some people seem bright until you hear them.

clive

Quote from: sinsi
I know xchg is slow but how does a push/mov compare?

They would go via the write buffer, and cache. PUSH/POP pairs, figure 6 cycles. MOV EAX,[EBP+x]; XCHG EAX,EBX; MOV [EBP+x],EAX; also around 6 cycles (P4 Prescott) in some synthetic testing.

Quote
I also used esp, not ebp, would that make a difference?

No, XCHG reg,mem is intrinsically locked, ESP or EBP, etc all perform the same.
It could be a random act of randomness. Those happen a lot as well.

MichaelW

Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:

182 cycles, (xchg reg,reg)*100
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100
183 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
eschew obfuscation

clive

Quote from: MichaelW
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100

How fast is the P3 running?

I'll note that the encoding for both is XCHG mem,reg

00000000  87 45 08         xchg    eax,[ebp+8]
00000003  87 45 08         xchg    [ebp+8],eax



00000000 874508                 xchg    [ebp+8],eax
00000003 874508                 xchg    [ebp+8],eax

It could be a random act of randomness. Those happen a lot as well.

MichaelW

QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

QuoteI'll note that the encoding for both is XCHG mem,reg

I did it both ways to see if there would be any significant difference in the cycle counts. On my P3 there wasn't, the difference in the results is within the run-to-run variation that is typical for cycle counts in the thousands.
eschew obfuscation

hutch--

I have not bothered to benchmark the following test piece but from memory within an algorithm XCHG was usually slow and could be replaced by MOV with a faster result. The 3 tests are mem-mem, reg-mem, reg-reg with the 1st being the slowest and the last being fastest. I have mainly seen this operation in exchange sorts (pointers or values) and usually XCHG is off the pace.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL var1  :DWORD
    LOCAL var2  :DWORD

    push esi
    push edi

  ; ---------
  ; mem - mem
  ; ---------
    mov var1, 1234
    mov var2, 5678

    mov eax, var1
    mov ecx, var2
    mov var1, ecx
    mov var2, eax

    print str$(var1),13,10
    print str$(var2),13,10

  ; ---------
  ; reg - mem
  ; ---------
    mov esi, 1234
    mov var1, 5678

    mov eax, var1
    mov var1, esi
    mov esi, eax

    print str$(esi),13,10
    print str$(var1),13,10

  ; ---------
  ; reg - reg
  ; ---------
    mov esi, 1234
    mov edi, 5678

    mov edx, esi
    mov esi, edi
    mov edi, edx

    print str$(esi),13,10
    print str$(edi),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

Quote from: dedndave on July 16, 2010, 04:42:09 AM
out of old-school habit, i avoid using stack space under the stack pointer

Are we sure that no debuggers trash the area under the stack?

I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: MichaelW on July 16, 2010, 12:05:31 PM
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:

Prescott P4:
146 cycles, (xchg reg,reg)*100
9247 cycles, (xchg reg,mem)*100
9277 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
306 cycles, (exchange reg,mem)*100 using mov
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]


The latter are intermediate cases using the stack:
        push edx
        mov edx, [ebx]
        pop [ebx]
...
        push [ebx]
        mov [ebx], edx
        pop edx


Slower than exchange reg,mem using mov but a lot faster than xchg.

MichaelW

Quote from: Rockoon on July 16, 2010, 03:22:58 PM
Are we sure that no debuggers trash the area under the stack?

I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.

In the 16-bit RM days hardware interrupts would use whatever stack was active when the interrupt occurred.
eschew obfuscation

MichaelW


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    call main
    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL var1  :DWORD
    LOCAL var2  :DWORD

    push esi
    push edi

    invoke Sleep, 4000

  ; ---------
  ; mem - mem
  ; ---------
    mov var1, 1234
    mov var2, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov eax, var1
    mov ecx, var2
    mov var1, ecx
    mov var2, eax
    ENDM

    counter_end
    print str$(eax)," cycles, mem - mem",13,10

    ;print str$(var1),13,10
    ;print str$(var2),13,10

  ; ---------
  ; reg - mem
  ; ---------
    mov esi, 1234
    mov var1, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov eax, var1
    mov var1, esi
    mov esi, eax
    ENDM

    counter_end
    print str$(eax)," cycles, reg - mem",13,10

    ;print str$(esi),13,10
    ;print str$(var1),13,10

  ; ---------
  ; reg - reg
  ; ---------
    mov esi, 1234
    mov edi, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov edx, esi
    mov esi, edi
    mov edi, edx
    ENDM

    counter_end
    print str$(eax)," cycles, reg - reg",13,10

    ;print str$(esi),13,10
    ;print str$(edi),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Running on a P3:

35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg

eschew obfuscation

Queue

Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on an old Athlon 1.3 GHz:

147 cycles, (xchg reg,reg)*100
1630 cycles, (xchg reg,mem)*100
1631 cycles, (xchg mem,reg)*100
148 cycles, (exchange reg,reg)*100 using mov
270 cycles, (exchange reg,mem)*100 using mov
406 cycles, (exchange reg,mem)*100 using pop [ebx]
406 cycles, (exchange reg,mem)*100 using push [ebx]


Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a Core2Duo 2.8 Ghz:

219 cycles, (xchg reg,reg)*100
1842 cycles, (xchg reg,mem)*100
1835 cycles, (xchg mem,reg)*100
184 cycles, (exchange reg,reg)*100 using mov
299 cycles, (exchange reg,mem)*100 using mov
507 cycles, (exchange reg,mem)*100 using pop [ebx]
507 cycles, (exchange reg,mem)*100 using push [ebx]


Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a P4 2.8 Ghz:

146 cycles, (xchg reg,reg)*100
9271 cycles, (xchg reg,mem)*100
9158 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
312 cycles, (exchange reg,mem)*100 using mov
1005 cycles, (exchange reg,mem)*100 using pop [ebx]
497 cycles, (exchange reg,mem)*100 using push [ebx]


Why would xchg mem,reg be so extra costly on a P4?

Queue

dedndave

that has always been that way - even on the 8088
i am a little surprised to see the xchg reg,reg comparison, though
for a long time, i have used XCHG EAX,reg32 (AX,reg16 in DOS) because it is a single byte op-code
still - it doesn't compare too badly against MOV

i see the test uses XCHG EDX,ECX - a 2-byte instruction

clive

Quote from: MichaelW
QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz

Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?

As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.

In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.

It's not so much a cycles issue, than a time issue.
It could be a random act of randomness. Those happen a lot as well.

Rockoon

Which P4 core is that? A Northwood?

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Celeron M timings:
165 cycles, (xchg reg,reg)*100
1910 cycles, (xchg reg,mem)*100
1910 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
495 cycles, (exchange reg,mem)*100 using pop [ebx]
495 cycles, (exchange reg,mem)*100 using push [ebx]


Note the symmetry of the last two, in contrast to the Prescott P4:
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]