News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

stack vs memory

Started by loki_dre, April 21, 2008, 07:13:17 PM

Previous topic - Next topic

loki_dre

anyone know if it is generally faster to push and pop a value onto the stack, vs saving a value to a variable in memory

donkey

It depends on the processor you are targeting though the gains are minimal at best. Latency on earlier Pentiums favoured the PUSH based model while I believe the Pentium IV favours the MOV based model. This type of thing can change from processor to processor and I am pretty sure that the AMD vs Intel question would also come into play as I believe AMD is now optimizing in favour of the PUSH based model which would be at odds with the Pentium IV. As a general rule you can treat the 2 storage schemes as equals and concentrate your optimization efforts on using registers as much as you can and immediates where possible.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

ic2

You have been corupted... Too much c++ for you donkey.   I still live by your old school style that you seem to have forgot...

Noting is faster than a mov  .. It's like the hand is quicker than the eye.

It's in masm32 help files.  If AMD can change that, please show me

hutch--

Over a long time I have never seen the advantage of one over the other, at times I do stack preservations using MOV to locals but its generally with a no stack frame procedure as that simplifies the stack address calculations. The theory is that MOV is faster than either PUSH/POP but in practice it does not seem to matter.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Tedd

I think it comes down to caching.
If the memory is already in cache, then mov will be faster. Since the top of the stack is likely to be in cache anyway, it would be faster otherwise.
But in reality, instructions don't execute in isolation, so it's really not going to make much difference :P
No snowflake in an avalanche feels responsible.

ic2

Opcodes.hlp is what you need to check about others.  You'll get a general idea.  Things haven't change that much.  There are too difference processors to worry about what they change to make one thing faster and what they may have took from another to make that possible.  It will be your best friend. or just read the entire Intel/AMD manual(s).  Anger Fog is the most popular optimizing manual for ASM coding.

jj2007

From opcodes, for 486 - obviously not taking account of caching etc.
Any idea why pop is 4 times slower than push?

instruction cy bytes
push ecx 1 1
pop ecx 4 1

mov r32, mem 1 2-4
mov mem, r32 1 2-4

MichaelW

As far as I know the most recent cycle counts published by Intel were for the P1 and PMMX. I don't have my P1 manual available, but you can get what appear to be the same cycle counts from Agner Fog's instruction_tables PDF, available here. This code measures and displays the cycle counts for several sequences of push, pop, and mov instructions, with 12 instructions executed per sequence.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        MEM dd 0
      ENDM
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        push REG
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        pop REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push reg + 6 * pop reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        push REG
        pop REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push/pop reg",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        push MEM
      ENDM
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        pop MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push mem + 6 * pop mem",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        push MEM
        pop MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push/pop mem",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov REG, eax
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov eax, REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov reg, eax + 6 * mov eax, reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov REG, edx
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov edx, REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov reg, edx + 6 * mov edx, reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        mov MEM, eax
      ENDM
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        mov eax, MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov mem, eax + 6 * mov eax, mem",13,10

    print chr$(13,10)
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

8 cycles        6 * push reg + 6 * pop reg
8 cycles        6 * push/pop reg

21 cycles       6 * push mem + 6 * pop mem
20 cycles       6 * push/pop mem

2 cycles        6 * mov reg, eax + 6 * mov eax, reg
2 cycles        6 * mov reg, edx + 6 * mov edx, reg
7 cycles        6 * mov mem, eax + 6 * mov eax, mem


I could not think of any good way to test push and pop independently, but if pop actually is slower, I too would like to know why.
eschew obfuscation

zooba

Pop may be slower since a memory write (push) can be scheduled and forgotten about, while pop has to wait for the read to occur before continuing.

That's just a suggestion. It's so hard to follow what goes on in modern processors. The older ones are much easier :bg  :P

Cheers,

Zooba :U

ic2

QuoteFrom opcodes, for 486 - obviously not taking account of caching etc.
I guest it time to read some Intel/AMD manual(s) Is there any special few that would be suggested for good reading or others.  Like something to stick with for a few years..

hutch--

Here is a test piece that shows the mov is faster than push / pop on my PIV. Typical results are,


703 push / pop
641 load /store MOV
687 push / pop
656 load /store MOV
688 push / pop
656 load /store MOV
688 push / pop
640 load /store MOV
703 push / pop
641 load /store MOV
703 push / pop
641 load /store MOV
687 push / pop
641 load /store MOV
703 push / pop
641 load /store MOV
Press any key to continue ...


The code.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pushem PROTO
    movem  PROTO

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push esi

    REPEAT 8

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call pushem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," push / pop",13,10

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call movem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," load /store MOV",13,10

  ; **********************************

    ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

movem proc

    sub esp, 28

    mov [esp], ebp
    mov [esp+4], ebx
    mov [esp+8], esi
    mov [esp+12], edi
    mov [esp+16], eax
    mov [esp+20], ecx
    mov [esp+24], edx

    mov edx, [esp+24]
    mov ecx, [esp+20]
    mov eax, [esp+16]
    mov edi, [esp+12]
    mov esi, [esp+8]
    mov ebx, [esp+4]
    mov ebp, [esp]

    add esp, 28

    ret

movem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

pushem proc

    push ebp
    push ebx
    push esi
    push edi
    push eax
    push ecx
    push edx

    pop edx
    pop ecx
    pop eax
    pop edi
    pop esi
    pop ebx
    pop ebp

    ret

pushem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: MichaelW on April 24, 2008, 04:11:08 AM
..cycle counts from Agner Fog's instruction_tables PDF

8 cycles        6 * push reg + 6 * pop reg
7 cycles        6 * mov mem, eax + 6 * mov eax, mem

Which is perfectly in line with the 10-15% difference observed by Hutch.
Agner, P1 (chapter 2):
Instruction, Operands, cycles
MOV r/m/i 1
PUSH r/i 1
POP r 1

So the opcodes.hlp might need an update ;-)
I only quote r/m because that's how we basically use a push/pop.

donkey

Quote from: ic2 on April 23, 2008, 06:14:00 AM
You have been corupted... Too much c++ for you donkey.   I still live by your old school style that you seem to have forgot...

Noting is faster than a mov  .. It's like the hand is quicker than the eye.

It's in masm32 help files.  If AMD can change that, please show me

As I said long long ago, mov to a register is always faster, however mov to a memory location is not guaranteed to be faster depending on data cache issues and instruction latency and the ability of the processor to execute the mov out of order (non-temporally). With the newer processors some things are counter-intuitive and what would logically appear to always be faster might not be. Even testing timings for specific opcodes, especially when they involve memory latency, is virtually impossible since once the first write/read is complete the memory address is in the cache and subsequent iterations will execute more quickly. However it is very rare that you will be writing or reading the same data at the same address thousands of times, it is usually a one shot deal and you must therefore take into account the fact that the address will probably not be cached and memory and instruction latency. The stack however is always in the cache and does not suffer to the same degree as general memory when it comes to these problems, registers do not suffer any of these slow downs with the exception of stalls.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Vortex

The results on my PIV :

657 push / pop
578 load /store MOV
656 push / pop
578 load /store MOV
656 push / pop
579 load /store MOV
656 push / pop
578 load /store MOV
641 push / pop
578 load /store MOV
641 push / pop
578 load /store MOV
656 push / pop
578 load /store MOV
657 push / pop
578 load /store MOV
Press any key to continue ...

donkey

Hi Vortex,

As I said in my post, the data is meaningless. push/pop will generally be consistent from first execution to second while the difference between the first execution of a mov mem, imm and its second execution could be quite large due to the state of the cache. Averaging over 1000's of iterations is generally an acceptable way to gauge the execution speed of an instruction but in this case since it will normally be a one shot mov without any consideration of what's already in the cache that method of timing is very misleading, in my opinion, though it is not easily proven, push/pop will be faster.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable