News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

stack vs memory

Started by loki_dre, April 21, 2008, 07:13:17 PM

Previous topic - Next topic

MichaelW

This is an attempt to compare the cycle counts when the memory being accessed is not in the cache.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov edi, halloc(70000000)

    invoke Sleep, 3000

    REPEAT 10

      counter_begin 1000, HIGH_PRIORITY_CLASS

        push ebp
        push ebx
        push esi
        push edi
        push eax
        push ecx
        push edx

        pop edx
        pop ecx
        pop eax
        pop edi
        pop esi
        pop ebx
        pop ebp

      counter_end
      print ustr$(eax)," cycles",9,"push/pop",13,10

      counter_begin 1, HIGH_PRIORITY_CLASS

        mov [edi], ebp
        mov [edi+10000000], ebx
        mov [edi+20000000], esi
        mov [edi+30000000], edi
        mov [edi+40000000], eax
        mov [edi+50000000], ecx
        mov [edi+60000000], edx

        mov edx, [edi+60000000]
        mov ecx, [edi+50000000]
        mov eax, [edi+40000000]
        mov edi, [edi+30000000]
        mov esi, [edi+20000000]
        mov ebx, [edi+10000000]
        mov ebp, [edi]

      counter_end
      print ustr$(eax)," cycles",9,"mov/mov",13,10

    ENDM

    print chr$(13,10)

    hfree edi

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Typical results on my P3:

10 cycles       push/pop
19454 cycles    mov/mov
10 cycles       push/pop
606 cycles      mov/mov
10 cycles       push/pop
597 cycles      mov/mov
10 cycles       push/pop
487 cycles      mov/mov
10 cycles       push/pop
682 cycles      mov/mov
10 cycles       push/pop
516 cycles      mov/mov
10 cycles       push/pop
625 cycles      mov/mov
10 cycles       push/pop
487 cycles      mov/mov
10 cycles       push/pop
597 cycles      mov/mov
10 cycles       push/pop
534 cycles      mov/mov


There is a lot of variation in the mov/mov counts because I'm making only one pass through the instructions. For the push/pop counts I just assumed that the active area of the stack would be in the cache and averaged the counts over 1000 loops. I'm guessing that no processor's L1 cache can (yet) span a 10-million byte separation. I think the consistently higher cycle count for the first access is because at that point the memory is not in the L2 cache.

[attachment deleted by admin]
eschew obfuscation

donkey

Quote from: MichaelW on April 24, 2008, 11:00:22 PM
There is a lot of variation in the mov/mov counts because I'm making only one pass through the instructions. For the push/pop counts I just assumed that the active area of the stack would be in the cache and averaged the counts over 1000 loops. I'm guessing that no processor's L1 cache can (yet) span a 10-million byte separation. I think the consistently higher cycle count for the first access is because at that point the memory is not in the L2 cache.

This demonstrates my point, it is certainly much more realistic than the other tests in the thread.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

This is my problem with the comparison method, it was supposed to be a comparison between push/pop and the load/sore versions of MOV but when you use push/pop on local stack memory with linear addressing and do the test on load/store MOV with large spacings between each memory address you are no longer comparing the same thing.

Instead of being an instruction choice comparison as per the original question it is a test of in versus out of the cache and the result is nothing contentious, its usually referred to as page thrashing.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

The original question specified "a variable in memory", with no indication of where in memory. It seems reasonable to assume that memory in the vicinity of ESP would be in the cache, but for "a variable in memory" it would depend on whether or not the variable was local. At least my second test illustrates why, given a choice between push/pop and a mov to/from a non-local variable, push/pop would likely be the best choice.
eschew obfuscation

NightWare

 ::) and you have an algo where you use datas with a distance of 60000000 between them ? (coz it sound like virtual mem usage here...). reducing it in feew ko of distance will certainly give more realistic results...

ic2

QuoteAs I said long long ago, mov to a register is always faster

I got the saved web page in my folder somewhere donkey, and it was more like  Nothing is faster than ...  You was a member of the Code Warriors.  You guys did not sound that nice to me...  But you were the one who would calm things down a bit.  It was my best ever past time reading.  Now I do decent ASM.  I just had to crack my joke while the getting was good.  :wink


MichaelW

Quoteand you have an algo where you use datas with a distance of 60000000 between them ? (coz it sound like virtual mem usage here...). reducing it in feew ko of distance will certainly give more realistic results...

Allocating 1000000 bytes and using a distance of 100000:

10 cycles       push/pop
14108 cycles    mov/mov
10 cycles       push/pop
331 cycles      mov/mov
10 cycles       push/pop
418 cycles      mov/mov
10 cycles       push/pop
305 cycles      mov/mov
10 cycles       push/pop
400 cycles      mov/mov
10 cycles       push/pop
343 cycles      mov/mov
10 cycles       push/pop
400 cycles      mov/mov
10 cycles       push/pop
257 cycles      mov/mov
10 cycles       push/pop
372 cycles      mov/mov
10 cycles       push/pop
378 cycles      mov/mov


Allocating 100000 bytes and using a distance of 10000:

10 cycles       push/pop
12262 cycles    mov/mov
10 cycles       push/pop
342 cycles      mov/mov
10 cycles       push/pop
437 cycles      mov/mov
10 cycles       push/pop
406 cycles      mov/mov
10 cycles       push/pop
350 cycles      mov/mov
10 cycles       push/pop
344 cycles      mov/mov
10 cycles       push/pop
415 cycles      mov/mov
10 cycles       push/pop
299 cycles      mov/mov
10 cycles       push/pop
399 cycles      mov/mov
10 cycles       push/pop
345 cycles      mov/mov


Allocating 10000 bytes and using a distance of 1000:

10 cycles       push/pop
496 cycles      mov/mov
10 cycles       push/pop
258 cycles      mov/mov
10 cycles       push/pop
305 cycles      mov/mov
10 cycles       push/pop
262 cycles      mov/mov
10 cycles       push/pop
287 cycles      mov/mov
10 cycles       push/pop
284 cycles      mov/mov
10 cycles       push/pop
322 cycles      mov/mov
10 cycles       push/pop
195 cycles      mov/mov
10 cycles       push/pop
287 cycles      mov/mov
10 cycles       push/pop
236 cycles      mov/mov


The only really large difference is in the last test for the first access. Considering that each cycle count is for 14 accesses and at least 7 cache misses, the numbers don't seem out of line to me.
eschew obfuscation

hutch--

Here is a modified version of the earlier test piece that uses global memory in the .DATA? section for the variable to store registers in.


687 local memory push / pop
672 global memory load /store MOV
703 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
672 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
--------------------------------
689 --- average local push/pop
684 --- average global MOV load/store
--------------------------------
Press any key to continue ...


Although slower than the same space allocated on the stack with MOV code, it is no slower than push/pop so I would conclude from the tests that push/pop is slower than load/store MOV but not by enough to effect most algorithms.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pushem PROTO
    movem  PROTO

    .data?
      align 4
      _eax dd ?
      _ecx dd ?
      _edx dd ?
      _ebx dd ?
      _ebp dd ?
      _esi dd ?
      _edi dd ?

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    LOCAL clock1    :DWORD
    LOCAL clock2    :DWORD

    mov clock1, 0
    mov clock2, 0

    push esi

    REPEAT 8

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call pushem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add clock1, eax

    print str$(eax)," local memory push / pop",13,10

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call movem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add clock2, eax

    print str$(eax)," global memory load /store MOV",13,10

  ; **********************************

    ENDM

    shr clock1, 3
    shr clock2, 3

    print "--------------------------------",13,10
    print str$(clock1)," --- average local push/pop",13,10
    print str$(clock2)," --- average global MOV load/store",13,10
    print "--------------------------------",13,10

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

movem proc

;     sub esp, 28
;
;     mov [esp], ebp
;     mov [esp+4], ebx
;     mov [esp+8], esi
;     mov [esp+12], edi
;     mov [esp+16], eax
;     mov [esp+20], ecx
;     mov [esp+24], edx
;
;     mov edx, [esp+24]
;     mov ecx, [esp+20]
;     mov eax, [esp+16]
;     mov edi, [esp+12]
;     mov esi, [esp+8]
;     mov ebx, [esp+4]
;     mov ebp, [esp]
;
;     add esp, 28


    mov _eax, eax
    mov _ecx, ecx
    mov _edx, edx
    mov _ebx, ebx
    mov _ebp, ebp
    mov _esi, esi
    mov _edi, edi

    mov eax, _eax
    mov ecx, _ecx
    mov edx, _edx
    mov ebx, _ebx
    mov ebp, _ebp
    mov esi, _esi
    mov edi, _edi

    ret

movem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

pushem proc

    push ebp
    push ebx
    push esi
    push edi
    push eax
    push ecx
    push edx

    pop edx
    pop ecx
    pop eax
    pop edi
    pop esi
    pop ebx
    pop ebp

    ret

pushem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

donkey

I think both MichaelW's and Hutch's tests are fine, as I said it is a very difficult problem to write a test for, how do you decide on the spacing to simulate cache misses ? Since in a normal assembly program you could probably get the entire data section into the L2 cache (512K to 1MB) then you could make a case for Hutch's implementation and it becomes obvious that Intel's comment that MOV had been optimized on the PIV was correct. However, since the original question did not specify any particular storage scheme a wider approach such as MichaelW's is also equally correct and realistic. For myself I stick to my original thought in that under normal circumstances the gain is negligible and can be ignored since in real life it would normally be a one-shot deal and losing a nanosecond would not adversely affect run time.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

NightWare

Quote from: donkey on April 25, 2008, 05:31:45 AM
it becomes obvious that Intel's comment that MOV had been optimized on the PIV was correct.
hmm... if it's true then they have optimised push/pop later... coz here the results of hutch's test on my core2 duo 2ghz :
592 local memory push / pop
656 global memory load /store MOV
577 local memory push / pop
686 global memory load /store MOV
578 local memory push / pop
670 global memory load /store MOV
578 local memory push / pop
624 global memory load /store MOV
577 local memory push / pop
639 global memory load /store MOV
562 local memory push / pop
671 global memory load /store MOV
561 local memory push / pop
655 global memory load /store MOV
562 local memory push / pop
671 global memory load /store MOV
--------------------------------
573 --- average local push/pop
659 --- average global MOV load/store
--------------------------------
Or, maybe it's because of the independant L1 cache... or the shared L2 cache... Damn... too many parameters to take care...  :lol

donkey

Hi NightWare,

It could very well be that they have a different scheme for different PIV's, certainly Netburst would favor the MOV optimization while smaller cache models (ie 512K) might favor the push/pop optimization. Intel tends to change things like this on the fly, a good example is the change from DEC reg to favor SUB reg,1 in the PIV that Hutch pointed out in a much earlier thread and no longer seems to hold true on the newer models such as the Core 2 Duo in my other box, though I have to admit I have not tested the instructions thoroughly it seems like DEC reg is back in favor. I think that DEC reg is an optimization that could make a difference since it is used in loops quite often and can have a real influence on run time.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

If I remember correctly the core design for the later Core 2 Duo is much closer to a PIII design than a PIV. What I would be interested to see is if LEA is as fast on a Core 2 Duo as it ws relatively on a PIII. On a PIV it is off the pace enough to not use it much.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

The .data? section should be in the cache anyway, same for the stack. But what about GlobalAlloc, HeapAlloc & Co? Is there anything to watch for in view of optimisation?

donkey

Quote from: jj2007 on April 26, 2008, 04:43:45 PM
The .data? section should be in the cache anyway, same for the stack. But what about GlobalAlloc, HeapAlloc & Co? Is there anything to watch for in view of optimisation?

Use VirutalAlloc for allocations over 4MB, use HeapAlloc for alloactions up to 4MB, avoid using GlobalAlloc. That's according to Microsoft...

Quote from: MSDNNote  The global functions are slower than other memory management functions and do not provide as many features. Therefore, new applications should use the heap functions. However, the global functions are still used with DDE, the clipboard functions, and OLE data objects.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

NightWare

Quote from: hutch-- on April 26, 2008, 05:56:17 AM
What I would be interested to see is if LEA is as fast on a Core 2 Duo as it ws relatively on a PIII. On a PIV it is off the pace enough to not use it much.
hi,
off topic, but lea has same speed for me on c700/p4/core2duo... but it's true core2duo is relatively near to p3... for example it's the same for rol/ror on c700 and core2duo (and it's horribly slow on my p4...). i've also noted, in recent speed tests, that it show disparity between p3/core2 and p4... and even between differents p4...

now, there is also difference between p3 and core2, push/pop were slower than mov on p3...  :wink