The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: loki_dre on April 21, 2008, 07:13:17 PM

Title: stack vs memory
Post by: loki_dre on April 21, 2008, 07:13:17 PM
anyone know if it is generally faster to push and pop a value onto the stack, vs saving a value to a variable in memory
Title: Re: stack vs memory
Post by: donkey on April 21, 2008, 10:46:11 PM
It depends on the processor you are targeting though the gains are minimal at best. Latency on earlier Pentiums favoured the PUSH based model while I believe the Pentium IV favours the MOV based model. This type of thing can change from processor to processor and I am pretty sure that the AMD vs Intel question would also come into play as I believe AMD is now optimizing in favour of the PUSH based model which would be at odds with the Pentium IV. As a general rule you can treat the 2 storage schemes as equals and concentrate your optimization efforts on using registers as much as you can and immediates where possible.
Title: Re: stack vs memory
Post by: ic2 on April 23, 2008, 06:14:00 AM
You have been corupted... Too much c++ for you donkey.   I still live by your old school style that you seem to have forgot...

Noting is faster than a mov  .. It's like the hand is quicker than the eye.

It's in masm32 help files.  If AMD can change that, please show me
Title: Re: stack vs memory
Post by: hutch-- on April 23, 2008, 06:39:33 AM
Over a long time I have never seen the advantage of one over the other, at times I do stack preservations using MOV to locals but its generally with a no stack frame procedure as that simplifies the stack address calculations. The theory is that MOV is faster than either PUSH/POP but in practice it does not seem to matter.
Title: Re: stack vs memory
Post by: Tedd on April 23, 2008, 11:43:42 AM
I think it comes down to caching.
If the memory is already in cache, then mov will be faster. Since the top of the stack is likely to be in cache anyway, it would be faster otherwise.
But in reality, instructions don't execute in isolation, so it's really not going to make much difference :P
Title: Re: stack vs memory
Post by: ic2 on April 23, 2008, 01:57:28 PM
Opcodes.hlp is what you need to check about others.  You'll get a general idea.  Things haven't change that much.  There are too difference processors to worry about what they change to make one thing faster and what they may have took from another to make that possible.  It will be your best friend. or just read the entire Intel/AMD manual(s).  Anger Fog is the most popular optimizing manual for ASM coding.
Title: Re: stack vs memory
Post by: jj2007 on April 24, 2008, 12:09:07 AM
From opcodes, for 486 - obviously not taking account of caching etc.
Any idea why pop is 4 times slower than push?

instruction cy bytes
push ecx 1 1
pop ecx 4 1

mov r32, mem 1 2-4
mov mem, r32 1 2-4
Title: Re: stack vs memory
Post by: MichaelW on April 24, 2008, 04:11:08 AM
As far as I know the most recent cycle counts published by Intel were for the P1 and PMMX. I don't have my P1 manual available, but you can get what appear to be the same cycle counts from Agner Fog's instruction_tables PDF, available  here (http://www.agner.org/optimize/). This code measures and displays the cycle counts for several sequences of push, pop, and mov instructions, with 12 instructions executed per sequence.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        MEM dd 0
      ENDM
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        push REG
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        pop REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push reg + 6 * pop reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        push REG
        pop REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push/pop reg",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        push MEM
      ENDM
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        pop MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push mem + 6 * pop mem",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        push MEM
        pop MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * push/pop mem",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov REG, eax
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov eax, REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov reg, eax + 6 * mov eax, reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov REG, edx
      ENDM
      FOR REG,<eax,ebx,ecx,edx,esi,edi>
        mov edx, REG
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov reg, edx + 6 * mov edx, reg",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        mov MEM, eax
      ENDM
      FOR MEM, <m1,m2,m3,m4,m5,m6>
        mov eax, MEM
      ENDM
    counter_end
    print ustr$(eax)," cycles",9,"6 * mov mem, eax + 6 * mov eax, mem",13,10

    print chr$(13,10)
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

8 cycles        6 * push reg + 6 * pop reg
8 cycles        6 * push/pop reg

21 cycles       6 * push mem + 6 * pop mem
20 cycles       6 * push/pop mem

2 cycles        6 * mov reg, eax + 6 * mov eax, reg
2 cycles        6 * mov reg, edx + 6 * mov edx, reg
7 cycles        6 * mov mem, eax + 6 * mov eax, mem


I could not think of any good way to test push and pop independently, but if pop actually is slower, I too would like to know why.
Title: Re: stack vs memory
Post by: zooba on April 24, 2008, 05:33:24 AM
Pop may be slower since a memory write (push) can be scheduled and forgotten about, while pop has to wait for the read to occur before continuing.

That's just a suggestion. It's so hard to follow what goes on in modern processors. The older ones are much easier :bg  :P

Cheers,

Zooba :U
Title: Re: stack vs memory
Post by: ic2 on April 24, 2008, 06:37:05 AM
QuoteFrom opcodes, for 486 - obviously not taking account of caching etc.
I guest it time to read some Intel/AMD manual(s) Is there any special few that would be suggested for good reading or others.  Like something to stick with for a few years..
Title: Re: stack vs memory
Post by: hutch-- on April 24, 2008, 07:21:17 AM
Here is a test piece that shows the mov is faster than push / pop on my PIV. Typical results are,


703 push / pop
641 load /store MOV
687 push / pop
656 load /store MOV
688 push / pop
656 load /store MOV
688 push / pop
640 load /store MOV
703 push / pop
641 load /store MOV
703 push / pop
641 load /store MOV
687 push / pop
641 load /store MOV
703 push / pop
641 load /store MOV
Press any key to continue ...


The code.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pushem PROTO
    movem  PROTO

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push esi

    REPEAT 8

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call pushem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," push / pop",13,10

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call movem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," load /store MOV",13,10

  ; **********************************

    ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

movem proc

    sub esp, 28

    mov [esp], ebp
    mov [esp+4], ebx
    mov [esp+8], esi
    mov [esp+12], edi
    mov [esp+16], eax
    mov [esp+20], ecx
    mov [esp+24], edx

    mov edx, [esp+24]
    mov ecx, [esp+20]
    mov eax, [esp+16]
    mov edi, [esp+12]
    mov esi, [esp+8]
    mov ebx, [esp+4]
    mov ebp, [esp]

    add esp, 28

    ret

movem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

pushem proc

    push ebp
    push ebx
    push esi
    push edi
    push eax
    push ecx
    push edx

    pop edx
    pop ecx
    pop eax
    pop edi
    pop esi
    pop ebx
    pop ebp

    ret

pushem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


end start
Title: Re: stack vs memory
Post by: jj2007 on April 24, 2008, 08:21:57 AM
Quote from: MichaelW on April 24, 2008, 04:11:08 AM
..cycle counts from Agner Fog's instruction_tables PDF

8 cycles        6 * push reg + 6 * pop reg
7 cycles        6 * mov mem, eax + 6 * mov eax, mem

Which is perfectly in line with the 10-15% difference observed by Hutch.
Agner, P1 (chapter 2):
Instruction, Operands, cycles
MOV r/m/i 1
PUSH r/i 1
POP r 1

So the opcodes.hlp might need an update ;-)
I only quote r/m because that's how we basically use a push/pop.
Title: Re: stack vs memory
Post by: donkey on April 24, 2008, 08:18:31 PM
Quote from: ic2 on April 23, 2008, 06:14:00 AM
You have been corupted... Too much c++ for you donkey.   I still live by your old school style that you seem to have forgot...

Noting is faster than a mov  .. It's like the hand is quicker than the eye.

It's in masm32 help files.  If AMD can change that, please show me

As I said long long ago, mov to a register is always faster, however mov to a memory location is not guaranteed to be faster depending on data cache issues and instruction latency and the ability of the processor to execute the mov out of order (non-temporally). With the newer processors some things are counter-intuitive and what would logically appear to always be faster might not be. Even testing timings for specific opcodes, especially when they involve memory latency, is virtually impossible since once the first write/read is complete the memory address is in the cache and subsequent iterations will execute more quickly. However it is very rare that you will be writing or reading the same data at the same address thousands of times, it is usually a one shot deal and you must therefore take into account the fact that the address will probably not be cached and memory and instruction latency. The stack however is always in the cache and does not suffer to the same degree as general memory when it comes to these problems, registers do not suffer any of these slow downs with the exception of stalls.

Donkey
Title: Re: stack vs memory
Post by: Vortex on April 24, 2008, 08:41:52 PM
The results on my PIV :

657 push / pop
578 load /store MOV
656 push / pop
578 load /store MOV
656 push / pop
579 load /store MOV
656 push / pop
578 load /store MOV
641 push / pop
578 load /store MOV
641 push / pop
578 load /store MOV
656 push / pop
578 load /store MOV
657 push / pop
578 load /store MOV
Press any key to continue ...
Title: Re: stack vs memory
Post by: donkey on April 24, 2008, 09:06:57 PM
Hi Vortex,

As I said in my post, the data is meaningless. push/pop will generally be consistent from first execution to second while the difference between the first execution of a mov mem, imm and its second execution could be quite large due to the state of the cache. Averaging over 1000's of iterations is generally an acceptable way to gauge the execution speed of an instruction but in this case since it will normally be a one shot mov without any consideration of what's already in the cache that method of timing is very misleading, in my opinion, though it is not easily proven, push/pop will be faster.

Donkey
Title: Re: stack vs memory
Post by: MichaelW on April 24, 2008, 11:00:22 PM
This is an attempt to compare the cycle counts when the memory being accessed is not in the cache.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov edi, halloc(70000000)

    invoke Sleep, 3000

    REPEAT 10

      counter_begin 1000, HIGH_PRIORITY_CLASS

        push ebp
        push ebx
        push esi
        push edi
        push eax
        push ecx
        push edx

        pop edx
        pop ecx
        pop eax
        pop edi
        pop esi
        pop ebx
        pop ebp

      counter_end
      print ustr$(eax)," cycles",9,"push/pop",13,10

      counter_begin 1, HIGH_PRIORITY_CLASS

        mov [edi], ebp
        mov [edi+10000000], ebx
        mov [edi+20000000], esi
        mov [edi+30000000], edi
        mov [edi+40000000], eax
        mov [edi+50000000], ecx
        mov [edi+60000000], edx

        mov edx, [edi+60000000]
        mov ecx, [edi+50000000]
        mov eax, [edi+40000000]
        mov edi, [edi+30000000]
        mov esi, [edi+20000000]
        mov ebx, [edi+10000000]
        mov ebp, [edi]

      counter_end
      print ustr$(eax)," cycles",9,"mov/mov",13,10

    ENDM

    print chr$(13,10)

    hfree edi

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Typical results on my P3:

10 cycles       push/pop
19454 cycles    mov/mov
10 cycles       push/pop
606 cycles      mov/mov
10 cycles       push/pop
597 cycles      mov/mov
10 cycles       push/pop
487 cycles      mov/mov
10 cycles       push/pop
682 cycles      mov/mov
10 cycles       push/pop
516 cycles      mov/mov
10 cycles       push/pop
625 cycles      mov/mov
10 cycles       push/pop
487 cycles      mov/mov
10 cycles       push/pop
597 cycles      mov/mov
10 cycles       push/pop
534 cycles      mov/mov


There is a lot of variation in the mov/mov counts because I'm making only one pass through the instructions. For the push/pop counts I just assumed that the active area of the stack would be in the cache and averaged the counts over 1000 loops. I'm guessing that no processor's L1 cache can (yet) span a 10-million byte separation. I think the consistently higher cycle count for the first access is because at that point the memory is not in the L2 cache.

[attachment deleted by admin]
Title: Re: stack vs memory
Post by: donkey on April 24, 2008, 11:55:46 PM
Quote from: MichaelW on April 24, 2008, 11:00:22 PM
There is a lot of variation in the mov/mov counts because I'm making only one pass through the instructions. For the push/pop counts I just assumed that the active area of the stack would be in the cache and averaged the counts over 1000 loops. I'm guessing that no processor's L1 cache can (yet) span a 10-million byte separation. I think the consistently higher cycle count for the first access is because at that point the memory is not in the L2 cache.

This demonstrates my point, it is certainly much more realistic than the other tests in the thread.

Donkey
Title: Re: stack vs memory
Post by: hutch-- on April 25, 2008, 01:18:35 AM
This is my problem with the comparison method, it was supposed to be a comparison between push/pop and the load/sore versions of MOV but when you use push/pop on local stack memory with linear addressing and do the test on load/store MOV with large spacings between each memory address you are no longer comparing the same thing.

Instead of being an instruction choice comparison as per the original question it is a test of in versus out of the cache and the result is nothing contentious, its usually referred to as page thrashing.
Title: Re: stack vs memory
Post by: MichaelW on April 25, 2008, 02:44:40 AM
The original question specified "a variable in memory", with no indication of where in memory. It seems reasonable to assume that memory in the vicinity of ESP would be in the cache, but for "a variable in memory" it would depend on whether or not the variable was local. At least my second test illustrates why, given a choice between push/pop and a mov to/from a non-local variable, push/pop would likely be the best choice.
Title: Re: stack vs memory
Post by: NightWare on April 25, 2008, 03:27:07 AM
 ::) and you have an algo where you use datas with a distance of 60000000 between them ? (coz it sound like virtual mem usage here...). reducing it in feew ko of distance will certainly give more realistic results...
Title: Re: stack vs memory
Post by: ic2 on April 25, 2008, 04:39:57 AM
QuoteAs I said long long ago, mov to a register is always faster

I got the saved web page in my folder somewhere donkey, and it was more like  Nothing is faster than ...  You was a member of the Code Warriors.  You guys did not sound that nice to me...  But you were the one who would calm things down a bit.  It was my best ever past time reading.  Now I do decent ASM.  I just had to crack my joke while the getting was good.  :wink

Title: Re: stack vs memory
Post by: MichaelW on April 25, 2008, 05:13:52 AM
Quoteand you have an algo where you use datas with a distance of 60000000 between them ? (coz it sound like virtual mem usage here...). reducing it in feew ko of distance will certainly give more realistic results...

Allocating 1000000 bytes and using a distance of 100000:

10 cycles       push/pop
14108 cycles    mov/mov
10 cycles       push/pop
331 cycles      mov/mov
10 cycles       push/pop
418 cycles      mov/mov
10 cycles       push/pop
305 cycles      mov/mov
10 cycles       push/pop
400 cycles      mov/mov
10 cycles       push/pop
343 cycles      mov/mov
10 cycles       push/pop
400 cycles      mov/mov
10 cycles       push/pop
257 cycles      mov/mov
10 cycles       push/pop
372 cycles      mov/mov
10 cycles       push/pop
378 cycles      mov/mov


Allocating 100000 bytes and using a distance of 10000:

10 cycles       push/pop
12262 cycles    mov/mov
10 cycles       push/pop
342 cycles      mov/mov
10 cycles       push/pop
437 cycles      mov/mov
10 cycles       push/pop
406 cycles      mov/mov
10 cycles       push/pop
350 cycles      mov/mov
10 cycles       push/pop
344 cycles      mov/mov
10 cycles       push/pop
415 cycles      mov/mov
10 cycles       push/pop
299 cycles      mov/mov
10 cycles       push/pop
399 cycles      mov/mov
10 cycles       push/pop
345 cycles      mov/mov


Allocating 10000 bytes and using a distance of 1000:

10 cycles       push/pop
496 cycles      mov/mov
10 cycles       push/pop
258 cycles      mov/mov
10 cycles       push/pop
305 cycles      mov/mov
10 cycles       push/pop
262 cycles      mov/mov
10 cycles       push/pop
287 cycles      mov/mov
10 cycles       push/pop
284 cycles      mov/mov
10 cycles       push/pop
322 cycles      mov/mov
10 cycles       push/pop
195 cycles      mov/mov
10 cycles       push/pop
287 cycles      mov/mov
10 cycles       push/pop
236 cycles      mov/mov


The only really large difference is in the last test for the first access. Considering that each cycle count is for 14 accesses and at least 7 cache misses, the numbers don't seem out of line to me.
Title: Re: stack vs memory
Post by: hutch-- on April 25, 2008, 05:16:59 AM
Here is a modified version of the earlier test piece that uses global memory in the .DATA? section for the variable to store registers in.


687 local memory push / pop
672 global memory load /store MOV
703 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
672 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
687 local memory push / pop
688 global memory load /store MOV
--------------------------------
689 --- average local push/pop
684 --- average global MOV load/store
--------------------------------
Press any key to continue ...


Although slower than the same space allocated on the stack with MOV code, it is no slower than push/pop so I would conclude from the tests that push/pop is slower than load/store MOV but not by enough to effect most algorithms.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pushem PROTO
    movem  PROTO

    .data?
      align 4
      _eax dd ?
      _ecx dd ?
      _edx dd ?
      _ebx dd ?
      _ebp dd ?
      _esi dd ?
      _edi dd ?

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    LOCAL clock1    :DWORD
    LOCAL clock2    :DWORD

    mov clock1, 0
    mov clock2, 0

    push esi

    REPEAT 8

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call pushem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add clock1, eax

    print str$(eax)," local memory push / pop",13,10

  ; **********************************

    invoke GetTickCount
    push eax

    mov esi, 100000000
  @@:
    call movem
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add clock2, eax

    print str$(eax)," global memory load /store MOV",13,10

  ; **********************************

    ENDM

    shr clock1, 3
    shr clock2, 3

    print "--------------------------------",13,10
    print str$(clock1)," --- average local push/pop",13,10
    print str$(clock2)," --- average global MOV load/store",13,10
    print "--------------------------------",13,10

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

movem proc

;     sub esp, 28
;
;     mov [esp], ebp
;     mov [esp+4], ebx
;     mov [esp+8], esi
;     mov [esp+12], edi
;     mov [esp+16], eax
;     mov [esp+20], ecx
;     mov [esp+24], edx
;
;     mov edx, [esp+24]
;     mov ecx, [esp+20]
;     mov eax, [esp+16]
;     mov edi, [esp+12]
;     mov esi, [esp+8]
;     mov ebx, [esp+4]
;     mov ebp, [esp]
;
;     add esp, 28


    mov _eax, eax
    mov _ecx, ecx
    mov _edx, edx
    mov _ebx, ebx
    mov _ebp, ebp
    mov _esi, esi
    mov _edi, edi

    mov eax, _eax
    mov ecx, _ecx
    mov edx, _edx
    mov ebx, _ebx
    mov ebp, _ebp
    mov esi, _esi
    mov edi, _edi

    ret

movem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

pushem proc

    push ebp
    push ebx
    push esi
    push edi
    push eax
    push ecx
    push edx

    pop edx
    pop ecx
    pop eax
    pop edi
    pop esi
    pop ebx
    pop ebp

    ret

pushem endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Title: Re: stack vs memory
Post by: donkey on April 25, 2008, 05:31:45 AM
I think both MichaelW's and Hutch's tests are fine, as I said it is a very difficult problem to write a test for, how do you decide on the spacing to simulate cache misses ? Since in a normal assembly program you could probably get the entire data section into the L2 cache (512K to 1MB) then you could make a case for Hutch's implementation and it becomes obvious that Intel's comment that MOV had been optimized on the PIV was correct. However, since the original question did not specify any particular storage scheme a wider approach such as MichaelW's is also equally correct and realistic. For myself I stick to my original thought in that under normal circumstances the gain is negligible and can be ignored since in real life it would normally be a one-shot deal and losing a nanosecond would not adversely affect run time.
Title: Re: stack vs memory
Post by: NightWare on April 26, 2008, 01:29:00 AM
Quote from: donkey on April 25, 2008, 05:31:45 AM
it becomes obvious that Intel's comment that MOV had been optimized on the PIV was correct.
hmm... if it's true then they have optimised push/pop later... coz here the results of hutch's test on my core2 duo 2ghz :
592 local memory push / pop
656 global memory load /store MOV
577 local memory push / pop
686 global memory load /store MOV
578 local memory push / pop
670 global memory load /store MOV
578 local memory push / pop
624 global memory load /store MOV
577 local memory push / pop
639 global memory load /store MOV
562 local memory push / pop
671 global memory load /store MOV
561 local memory push / pop
655 global memory load /store MOV
562 local memory push / pop
671 global memory load /store MOV
--------------------------------
573 --- average local push/pop
659 --- average global MOV load/store
--------------------------------
Or, maybe it's because of the independant L1 cache... or the shared L2 cache... Damn... too many parameters to take care...  :lol
Title: Re: stack vs memory
Post by: donkey on April 26, 2008, 02:19:55 AM
Hi NightWare,

It could very well be that they have a different scheme for different PIV's, certainly Netburst would favor the MOV optimization while smaller cache models (ie 512K) might favor the push/pop optimization. Intel tends to change things like this on the fly, a good example is the change from DEC reg to favor SUB reg,1 in the PIV that Hutch pointed out in a much earlier thread and no longer seems to hold true on the newer models such as the Core 2 Duo in my other box, though I have to admit I have not tested the instructions thoroughly it seems like DEC reg is back in favor. I think that DEC reg is an optimization that could make a difference since it is used in loops quite often and can have a real influence on run time.

Donkey
Title: Re: stack vs memory
Post by: hutch-- on April 26, 2008, 05:56:17 AM
If I remember correctly the core design for the later Core 2 Duo is much closer to a PIII design than a PIV. What I would be interested to see is if LEA is as fast on a Core 2 Duo as it ws relatively on a PIII. On a PIV it is off the pace enough to not use it much.
Title: Re: stack vs memory
Post by: jj2007 on April 26, 2008, 04:43:45 PM
The .data? section should be in the cache anyway, same for the stack. But what about GlobalAlloc, HeapAlloc & Co? Is there anything to watch for in view of optimisation?
Title: Re: stack vs memory
Post by: donkey on April 26, 2008, 07:38:05 PM
Quote from: jj2007 on April 26, 2008, 04:43:45 PM
The .data? section should be in the cache anyway, same for the stack. But what about GlobalAlloc, HeapAlloc & Co? Is there anything to watch for in view of optimisation?

Use VirutalAlloc for allocations over 4MB, use HeapAlloc for alloactions up to 4MB, avoid using GlobalAlloc. That's according to Microsoft...

Quote from: MSDNNote  The global functions are slower than other memory management functions and do not provide as many features. Therefore, new applications should use the heap functions. However, the global functions are still used with DDE, the clipboard functions, and OLE data objects.
Title: Re: stack vs memory
Post by: NightWare on April 26, 2008, 09:18:50 PM
Quote from: hutch-- on April 26, 2008, 05:56:17 AM
What I would be interested to see is if LEA is as fast on a Core 2 Duo as it ws relatively on a PIII. On a PIV it is off the pace enough to not use it much.
hi,
off topic, but lea has same speed for me on c700/p4/core2duo... but it's true core2duo is relatively near to p3... for example it's the same for rol/ror on c700 and core2duo (and it's horribly slow on my p4...). i've also noted, in recent speed tests, that it show disparity between p3/core2 and p4... and even between differents p4...

now, there is also difference between p3 and core2, push/pop were slower than mov on p3...  :wink
Title: Re: stack vs memory
Post by: hutch-- on April 27, 2008, 12:06:59 AM
Unless Microsoft change the timable operation of GlobalAlloc() in a later version, it still seems to have the legs against other memory strategies when you use the GMEM_FIXED flag. Apart from that there are a couple of legacy applications for it with other flags but its usually off the pace.

It has many characteristics that I find useful, fine granularity, will allocate up to the physical limit of the OS and with large allocations its fast. Recently while I was working on the dynamic array code I started with HeapAlloc() for the main pointer array but it crashed after about 300000 stored DWORD pointers. Changed it to GlobalAlloc and it is both fast and reliable. I used OLE string memory for the array members as it has the "garbage collection" characteristic that reduces memory fragmentation as well as storing the length 4 bytes below the start address.
Title: Re: stack vs memory
Post by: ic2 on April 27, 2008, 02:21:00 AM
I don't know where I was when I found the first link but maybe it's up-to-date.

http://www.freetechbooks.com/software-optimization-manuals-t330.html


http://www.agner.org/optimize/
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 03:22:52 AM
Quote from: hutch-- on April 27, 2008, 12:06:59 AM
Unless Microsoft change the timable operation of GlobalAlloc() in a later version, it still seems to have the legs against other memory strategies when you use the GMEM_FIXED flag. Apart from that there are a couple of legacy applications for it with other flags but its usually off the pace.

It has many characteristics that I find useful, fine granularity, will allocate up to the physical limit of the OS and with large allocations its fast. Recently while I was working on the dynamic array code I started with HeapAlloc() for the main pointer array but it crashed after about 300000 stored DWORD pointers. Changed it to GlobalAlloc and it is both fast and reliable. I used OLE string memory for the array members as it has the "garbage collection" characteristic that reduces memory fragmentation as well as storing the length 4 bytes below the start address.

Well, HeapAlloc and GlobalAlloc both use the heap to allocate memory so I don't know why you had problems with HeapAlloc, I've never experienced any stability issues with either of them. The speed difference is in the calls to ntdll.dll, Heapxxx functions are simple forwarding wrappers for NTDLL while the Globalxxx functions must perform various setups each time a memory management function is called in order to maintain compatibility with Windows 3.x. For the actual moving of data in and out there is no difference in speed as far as I can tell. The extra overhead involved in making the Globalxxx functions compatible with the 16bit legacy functions such as the clipboard and DDE make calls to those functions slower by comparison however, when you look at the disassembly of Kernel32.dll, the actual allocation schemes are identical so I can't see where one would fail and the other wouldn't.

As for granularity, both use the same heap and are subject to the same granularity so there cannot be any difference there. But both are subject to severe fragmentation when the block of memory gets too large and so in both cases allocating over 4 MB, though possible, is not recommended or you may experience slow downs.

Donkey
Title: Re: stack vs memory
Post by: zooba on April 27, 2008, 04:18:18 AM
The process heap is limited in size (see the /HEAP (http://msdn2.microsoft.com/en-us/library/f90ybzkh(VS.85).aspx) linker option).

If you create a growable heap using HeapCreate() then you won't hit an upper limit until the system runs out of memory.

Cheers,

Zooba :U
Title: Re: stack vs memory
Post by: hutch-- on April 27, 2008, 05:08:10 AM
Where I found the problem was with HeapAlloc() was the size of the allocation. I think Zooba is right that you can set up using HeapCreate() but I then woder whether its less overhead than GlobalAlloc() that easily allocates over a gigabyte with no hiccups at all. In the array system I needed fixed memory for the pointer array but movable emory for the members.

Where you don't use GlobalAlloc() is with the movable flag set as it is purely legacy code for DDE and the clipboard but with the fixed memory flag, nothing else is faster that I have found. I have also picked up that with later versions of XP and probably Vista that the low fragmentation heap option for HeapAlloc is particularly slow so I have yet to find good reason to use HeapAlloc() which has always sounds like a leftove4r from the old C code mentality of the DOS days.
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 05:32:12 AM
Quote from: hutch-- on April 27, 2008, 05:08:10 AM
Where I found the problem was with HeapAlloc() was the size of the allocation. I think Zooba is right that you can set up using HeapCreate() but I then woder whether its less overhead than GlobalAlloc() that easily allocates over a gigabyte with no hiccups at all. In the array system I needed fixed memory for the pointer array but movable emory for the members.

Where you don't use GlobalAlloc() is with the movable flag set as it is purely legacy code for DDE and the clipboard but with the fixed memory flag, nothing else is faster that I have found. I have also picked up that with later versions of XP and probably Vista that the low fragmentation heap option for HeapAlloc is particularly slow so I have yet to find good reason to use HeapAlloc() which has always sounds like a leftove4r from the old C code mentality of the DOS days.

Why would you need moveable memory for the members ? unless you plan to copy them to the clipboard moveable memory is useless in Win32. Even Raymond Chen admits that there is no practical use for moveable memory outside of clipboard and DDE functions in the Old New Thing (http://blogs.msdn.com/oldnewthing/archive/2004/11/08/253891.aspx) and here (http://blogs.msdn.com/oldnewthing/archive/2004/11/09/254441.aspx). And even then, the only reason clipboard functions may fail without GlobalAlloc is because of the rampant use of undocumented and unsupported behaviour by third party applications that may interact with yours, and those are becoming rarer as the users of co-operative multitasking disappear (thankfully).

GMEM_MOVEABLE was only ever useful in co-operative multitasking, since there was no virtual memory the memory manager had to be able to move blocks of memory from time to time and applications could not rely on pointers in real memory being stable, so they used a handle instead and had to ask the OS where the memory actually ended up each time they wanted to access it. The advent of virtual memory made this scheme pointless and for the most part (outside of a few legacy functions) it is never used in modern coding since it is always more efficient to use a fixed pointer. However, since in Windows 1.x applications were discouraged from using fixed pointers, when 32 bit Windows was introduced most applications used moveable memory and had to be accomodated so the kludge was added to the API and we're stuck with it, but that doesn't mean we have to use it, after all internally there are around 15 to 20 extra API calls just to allocate memory using the Global functions and the end result is exactly the same as the Heap functions with the exception of the return value type.

Donkey
Title: Re: stack vs memory
Post by: hutch-- on April 27, 2008, 08:24:47 AM
Adgar,

> Why would you need moveable memory for the members ?

Simple, after thousands to millions of allocations,deallocations you get severe memory fragmentation. Use a method that the OS can move around in memory and the memory manager solves that problem, it compacts the memory splattered all over the place and keeps more memory available in usable sized blocks. The price is speed, fixed memory is faster but makes a much bgger mess when large numbers of allocations and deallocations occur.
Title: Re: stack vs memory
Post by: zooba on April 27, 2008, 08:46:24 AM
Quote from: hutch-- on April 27, 2008, 08:24:47 AM
Use a method that the OS can move around in memory and the memory manager solves that problem, it compacts the memory splattered all over the place and keeps more memory available in usable sized blocks.

That's assuming that Windows actually does that anymore. The LFH was slower when I last tested it (XP SP2) but it is supposed to have an advantage for lots of allocations and deallocations of varying (smallish) sizes. The theory (http://msdn2.microsoft.com/en-us/library/aa366750(VS.85).aspx) behind how it works seems sound, as does movable memory. Both have performance hits over fixed memory and I am quite skeptical that GlobalAlloc() still behaves in the way it used to.

Cheers,

Zooba :U
Title: Re: stack vs memory
Post by: hutch-- on April 27, 2008, 09:34:27 AM
I should have made the distinction, GlobalAlloc() with the GMEM_FIXED flag versus SysAllocStringByteLen(), the opposite end characteristics. Unless you were writing legacy code with DDE or the clipboard you would not bother to use GlobalAlloc() with the GMEM_MOVABLE flag.
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 04:20:22 PM
Quote from: hutch-- on April 27, 2008, 08:24:47 AMUse a method that the OS can move around in memory and the memory manager solves that problem, it compacts the memory splattered all over the place and keeps more memory available in usable sized blocks. The price is speed, fixed memory is faster but makes a much bgger mess when large numbers of allocations and deallocations occur.

Windows has not done memory compaction since Windows 3.51, that ended with Windows 95/NT4.0. Read Raymonds blog that I linked...

Quote from: Raymond ChenMemory blocks still had a lock count, even though it didn't really accomplish anything since Win32 never compacted memory
Title: Re: stack vs memory
Post by: jj2007 on April 27, 2008, 06:10:48 PM
Question: Does SysAllocStringByteLen do some kind of memory compaction or garbage collection? I.e. is it a good idea to use SysAllocStringByteLen for many small reallocations?
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 06:19:46 PM
Quote from: jj2007 on April 27, 2008, 06:10:48 PM
Question: Does SysAllocStringByteLen do some kind of memory compaction or garbage collection? I.e. is it a good idea to use SysAllocStringByteLen for many small reallocations?

No, Win32 does not do any memory compaction at all. This simply allocates a BSTR (OLE) string of a given length, the memory allocated is not really moveable in the Windows 1.x sense since that has no meaning in a flat memory model. Beyond that it allocates one DWORD before the pointer to hold the string length. These functions have nothing at all to do with garbage collection or compaction, a disassembly of the functions and documentation at MSDN confirm this. If you need compaction and garbage collection use a database and the ODBC.

Alternatively you can use the DSA_xxx (http://msdn2.microsoft.com/en-us/library/bb775647.aspx) dynamic structure array functions, they provide a callback for garbage collection and are quite efficient when handling large arrays of small allocations.
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 06:49:44 PM
Just to re-itterate...

Win32 uses a flat memory model, the concepts of global and local memory in a flat model are meaningless as is the concept of moveable memory. Due to legacy functions Windows simulates the outdated 640K model at the cost of speed due to the overhead involved in the simulation. The GlobalAlloc function is by far the worst allocation scheme available to Windows, even LocalAlloc is more efficient at allocating memory (though not by much). You should always use the heap functions for small allocations and the Virtual memory functions for large ones unless you are stuck with a legacy function that requires the simulated moveable memory.

You should avoid using COM functions to allocate memory that you will only be using internally. COM performs marshalling on it's buffers to allow for interoperability between protected mode processes and this is very slow and creates a thread you have no control over further slowing your application, as well loading COM for things like BSTRs will add 100-200K to your footprint as well as the setup time for a COM application. In other words, avoid using COM functions unless you actually will be using COM.
Title: Re: stack vs memory
Post by: jj2007 on April 27, 2008, 08:43:37 PM
Quote from: donkey on April 27, 2008, 06:49:44 PM
Just to re-iterate... You should avoid using COM functions

Edgar, I don't quite understand the context of your warning - does SysAllocStringByteLen have anything to do with COM??
I am inspired by Hutch's dynamic array project but a bit confused. SysAllocStringByteLen had never crossed my way before, but I see it's pretty old, see here (http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dnarword97/html/wdupdtwlls.asp) an article from 1997. The Sys* functions seem to be handy and fast, so is there any reason not to use them for string management in a major application? Is the lack of mem compaction an issue? Another question, maybe a stupid one: Is there a need to de-allocate all strings with SysFreeString on ExitProcess, or does Windows handle that? Sorry for so many newbie questions...

EDIT: A sample app for SysAllocStringByteLen, assembling at 5632 bytes (it uses the str$ and chr$ macros as well as lstrcat). SysFreeString returns 48, no error. So my question again: Is there any argument NOT to use this fantastic API?

.nolist
include \masm32\include\masm32rt.inc

ShowWinErr PROTO:DWORD

crlf EQU 13, 10

.data?
My$ dd ?
ShowErrBuf db 200 dup(?) ; 200 chars for Windows

.data
MyText db "This is a little test string", 0
ShowErr db "Windows error:", crlf, 0

.code

start:
invoke SysAllocStringByteLen, addr MyText, sizeof MyText
mov My$, eax
invoke MessageBox, 0, str$(eax), chr$("String pointer; show the string now?"), MB_YESNO
.if eax==IDYES
invoke MessageBox, 0, My$, chr$("The String:"), MB_OK
.endif
invoke SysFreeString, My$
invoke MessageBox, 0, str$(eax), chr$("SysFreeString:"), MB_OK
invoke ShowWinErr, 1
invoke ExitProcess, 0

ShowWinErr proc ShowFormatted:DWORD
pushad
invoke GetLastError
.if ShowFormatted
invoke FormatMessage,FORMAT_MESSAGE_FROM_SYSTEM,
0, ; GetItFromSystem
eax, ; ErrNum
0, ; Default language
addr ShowErrBuf, ; where to send the string from system
200, 0          ; size 200, no arguments
.if ShowFormatted!=1
invoke lstrcat, addr ShowErrBuf, ShowFormatted
.endif
invoke MessageBox, 0, addr ShowErrBuf, addr ShowErr, MB_OK
.else
invoke MessageBox, 0, str$(eax), addr ShowErr, MB_OK
.endif
popad
ret
ShowWinErr endp

end start
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 09:27:56 PM
Hi jj2007,

Any of  the Sys string functions are used to allocate strings to be used with COM, that's why the functions are in the oleaut32 DLL. But no, I don't believe that particular one uses IMalloc directly, at least as far as I know it only calls several other functions in OLE32 and OLEAUT32 but because of the nature of COM it is difficult to track what those functions are. I am always wary of any function that could call a COM interface and therefore load COM without you being aware. Simpler just to create a buffer using HeapAlloc and go from there.

No, Windows does not handle deallocating the string, you must use SysFreeString to deallocate the memory. The main reason not to use it is HeapAlloc is faster, by quite a large margin and in reality all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???

Donkey
Title: Re: stack vs memory
Post by: jj2007 on April 27, 2008, 09:54:39 PM
Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???
Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??
Title: Re: stack vs memory
Post by: donkey on April 27, 2008, 10:53:32 PM
Quote from: jj2007 on April 27, 2008, 09:54:39 PM
Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???
Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??

This is not an "ideological war", there are quantifiable reasons to use one scheme over the other and they are all well documented at MS and by Intel. I'm not sure how many times I will have to type this before it is understood, There is no moveable memory in Win32 in the sense Hutch was using it. The best allocation scheme for what you are doing is to use a linked list and HeapAlloc creating a private heap.

The MASM32 library waits for a process to end using a loop instead of the Wait functions, wrong but in there all the same... Please, no arguments about the value of a loop to do this, I'm tired of that one.

Donkey
Title: Re: stack vs memory
Post by: NightWare on April 28, 2008, 12:51:11 AM
hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?
Title: Re: stack vs memory
Post by: donkey on April 28, 2008, 01:37:25 AM
Quote from: NightWare on April 28, 2008, 12:51:11 AM
hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?


Hi NightWare,

That's pretty much how I would do it, using a linked list makes ordering simple and by writing your own memory management you can optimize it for your particular application.

Donkey
Title: Re: stack vs memory
Post by: hutch-- on April 28, 2008, 02:43:55 AM
There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414
Title: Re: stack vs memory
Post by: donkey on April 28, 2008, 04:59:34 AM
Quote from: hutch-- on April 28, 2008, 02:43:55 AM
There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414

COM uses IUnknown.Release when a program terminates in order to allow the reference count to be decremented if the COM thread for an application is no longer running (IE pinging the interface). The garbage collection here is simply a de-allocation of memory that is no longer referenced by any running process, not a compaction of in-use memory or even a simple garbage collect of in-use memory. This is similar to the way the API releases handles and the process heap even if your program does not explicitly do so on termination, another form of garbage collection, however like the one described in the article it has nothing to do with garbage collection in a large array that is still in use, which is the actual topic under discussion.

Also it is wrong to refer to memory that is being marshalled as "moveable" memory, they are not even close to the same thing, a closer parallel would be "shared" memory. Nowhere in the article did they refer to the memory as "moveable" since that concept has no meaning in a flat memory model.

Donkey
Title: Re: stack vs memory
Post by: hutch-- on April 28, 2008, 08:22:11 AM
Edgar,

> since that concept has no meaning in a flat memory model.

Memory management schemes are independent of the memory modeil they are used under, you are confusing meaning with implimentation. There is of course a bit more to the choice of allocation method than just memory recovery (garbage collection), an OLE string stores the length when it is allocated where fixed memory schemes like GlobalAlloc() and HeapAlloc() don't and this saves either scanning the string each time you need its length or having to store it somewhere else. OLE string is marginally slower in allocation than fixed memory strategies but that is predictable as it does slightly more.

None the less I give you the point that automatic compaction does not occur without a mechanism written to do that job in much the manner as java or basic garbage collection. Unfiortunately fragmentation testing is not a simple task as sequential allocation in simple test pieces gives you linear addressing for each block. To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.
Title: Re: stack vs memory
Post by: jj2007 on April 28, 2008, 08:32:07 AM
Quote from: hutch-- on April 28, 2008, 08:22:11 AMTo test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

OK, let's try:

include \masm32\include\masm32rt.inc

ShowWinErr PROTO:DWORD

.data?
ShowErrBuf dd 100 dup(?) ; 200 chars for Windows errors
StringTable dd 2000 dup (?) ; pointers to 2000 string variables

.data
ShowErr db "Windows error:", 13, 10, 0

s10 db "1234567890", 0 ; ten bytes
s2k db 200 dup ("x234567890"), 0 ; 2000
s64k db 6400 dup ("A234567890"), 0 ; 64000

.code

start: pushad
xor esi, esi ; outer loop counter ->1000*2000*66000=132 000 000 000
.While esi<1000
xor ebx, ebx ; inner loop counter
.While ebx<2000
lea edi, [StringTable+4*ebx]
mov eax, [edi]
.if eax
invoke SysFreeString, eax
.endif
mov edx, offset s10 ; default: a short string
mov eax, ebx
add eax, esi
bt eax, 0
.if CARRY? ; odd
mov edx, offset s2k
.else
bt eax, 1
.if CARRY?
mov edx, offset s64k
.endif
.endif
invoke SysAllocStringByteLen, edx, len(edx)
.if eax
mov [edi], eax
.else
invoke ShowWinErr, 1 ; show formatted error
mov esi, 999999 ; get out of the outer loop, too
.break
.endif
; invoke MessageBox, 0, eax, chr$("The String:"), MB_OK
inc ebx ; inner loop counter
.Endw
inc esi ; outer loop counter
; give us some feedback on where you are...
print chr$(13, 10, "Loop ")
print str$(esi)
print chr$(13, 10, "Length of first string is ")
lea edi, [StringTable]
invoke SysStringByteLen, [edi]
print str$(eax)
; invoke StdOut, [edi] ; the string itself
.Endw
.While ebx ; free the strings allocated in the last inner loop
dec ebx
lea edi, [StringTable+4*ebx]
mov eax, [edi]
.if eax
invoke SysFreeString, eax
.endif
.Endw
; invoke ShowWinErr, 1
if 0 ; try the same with The Heap?
invoke GetProcessHeap ; for FileOpen, Save, Menus
mov ProcHeap, eax
invoke HeapAlloc, ProcHeap,
HEAP_GENERATE_EXCEPTIONS or HEAP_ZERO_MEMORY, sizeof MyText
mov MyHeap$, eax
invoke lstrcpy, eax, addr MyText
invoke MessageBox, 0, MyHeap$, chr$("The Heap String:"), MB_OK
invoke HeapFree, ProcHeap, 0, MyHeap$
invoke ShowWinErr, chr$("HeapFree")
invoke HeapCompact, ProcHeap, 0
invoke MessageBox, 0, str$(eax), chr$("HeapCompact:"), MB_OK
endif
popad
invoke ExitProcess, 0

ShowWinErr proc ShowFormatted:DWORD
pushad
invoke GetLastError
.if ShowFormatted
invoke FormatMessage,FORMAT_MESSAGE_FROM_SYSTEM,
0, ; GetItFromSystem
eax, ; ErrNum
0, ; Default language
addr ShowErrBuf, ; where to send the string from system
400, 0          ; size 400, no arguments
.if ShowFormatted!=1
invoke lstrcat, addr ShowErrBuf, ShowFormatted
.endif
invoke MessageBox, 0, addr ShowErrBuf, addr ShowErr, MB_OK
.else
invoke MessageBox, 0, str$(eax), addr ShowErr, MB_OK
.endif
popad
ret
ShowWinErr endp

end start
Title: Re: stack vs memory
Post by: jj2007 on April 28, 2008, 10:34:03 AM
Quote from: hutch-- on April 28, 2008, 08:22:11 AM
To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

Here is a testbed, hopefully correct. Set UseHeap EQU 1 to assemble either version.

Neither version crashed, but HeapAlloc takes roughly three times as long for the same task.

[attachment deleted by admin]
Title: Re: stack vs memory
Post by: hutch-- on April 28, 2008, 11:10:41 AM
jj,

Is it possible to use a faster string copy algo.


mov [edi], eax ; pointer into stringtable
mov eax, len(edx)
invoke lstrcpyn, [edi], edx, eax


lstrcpyn is particularly slow, I wondered if you used another copy algo if it would make the HeapAlloc code faster.

LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.


invoke MemCopy,edx,[edi],eax
Title: Re: stack vs memory
Post by: jj2007 on April 28, 2008, 01:30:49 PM
Quote from: hutch-- on April 28, 2008, 11:10:41 AM
LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.


invoke MemCopy,edx,[edi],eax

It works, yes, but HeapAlloc is still considerably slower, about 17%. Here some stats on ticks & mem usage:
Ticks Mem Handles GDI
Heap 12000 87 12 4
SysAll 10000 87 14 4

with these settings:
Loops EQU 8000
Strings EQU 5000


Note that SysAlloc uses three handles more. HeapAlloc on the other hand is somewhat clumsier to use.

8000*5000*66000=2.64E+012 bytes shoved around, no crash. Apparently both versions know how to reuse the freed slots. A weak point of the test app might be that although the three strings are different in size, there is an exact old slot available for future use every time one of them is released.

New version attached, will first use SysAlloc, then HeapAlloc.

[attachment deleted by admin]
Title: Re: stack vs memory
Post by: Jimg on April 28, 2008, 02:45:07 PM
I played around with your code just for fun.  By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times.  It certainly takes a lot more code, however.  Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference.  This was done on an AMD.  I'm sure P4's are much different.

[attachment deleted by admin]
Title: Re: stack vs memory
Post by: jj2007 on April 28, 2008, 07:22:20 PM
Quote from: Jimg on April 28, 2008, 02:45:07 PM
I played around with your code just for fun.  By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times.  It certainly takes a lot more code, however.  Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference.  This was done on an AMD.  I'm sure P4's are much different.
I see the old 17% difference when running your exe (P4, 2.4 GHz, 512 MB Ram, XP SP1):
Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 19599

Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 23554
Title: Re: stack vs memory
Post by: Jimg on April 28, 2008, 11:59:55 PM
I figured it was my flakey AMD.  I just ran it four times in a row from the command prompt-

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17516
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17313

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17406
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17296

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17438
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17438

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17484
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17390

Press any key to continue ...

R:\
Title: Re: stack vs memory
Post by: hutch-- on April 29, 2008, 12:38:42 AM
Here azre my results using Jim's test code. This is on my old 2.8 gig Northwood core PIV, 2 gig of DDR400.


Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 11672
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 11906

Press any key to continue ...


I am certainly not seeing the so called advantage of HeapAlloc() here and the OLE string still has the added advantage of being designed for large count small allocations complete with length storage 4 bytes below its start address. For what it is worth I have never seen HeapAlloc outperform GlobalAlloc when used with the GMEM_FIXED flag which I use for the main pointer array.

As a humerous aside, when I was doing some testing recently on different memory strategies I forgot to change the deallocation from GlobalFree() to HeapFree yet the GlobalFree successfully deallocated the memory handle for HeapAlloc.
Title: Re: stack vs memory
Post by: hutch-- on April 29, 2008, 01:18:43 AM
Jim's inlining the memory copy was a good move, here is an addition that selects between short and long string copies that gets the speed up.


                        .if ecx > 1024

                          push esi
                          push edi
                          push ecx
                          mov esi,edx
                          mov edi,[edi]
                          shr ecx,2
                          rep movsd
                          pop ecx
                          and ecx,3
                          rep movsb
                          pop edi
                          pop esi

                        .else

                          push ebx
                          push esi
                          push edi

                          mov esi, edx
                          mov edi, [edi]
                          mov ebx, -1
                          @@:
                          add ebx, 1
                          ; movzx eax, BYTE PTR [esi+ebx]
                          mov al, [esi+ebx]
                          mov [edi+ebx], al
                          cmp ebx, ecx
                          jle @B

                          pop edi
                          pop esi
                          pop ebx

                        .endif
Title: Re: stack vs memory
Post by: jj2007 on April 29, 2008, 08:05:20 AM
Doesn't work out for me: H=Hutch, M=Memcopy

avg m=12640, avg h=12726 on a P4, 3.4 GHz, 1.5 G Ram, XP SP2

The MemCop equate in the attachment determines how it is being assembled.

D:\wwn\asm\DynArray>omsm

Loop 80 - len of first string is 04000
Ticks, UseHeap=1 : 12984
D:\wwn\asm\DynArray>omsh

Loop 80 - len of first string is 04000
Ticks, UseHeap=1 : 12797
D:\wwn\asm\DynArray>omsh

Loop 80 - len of first string is 04000
Ticks, UseHeap=1 : 12656
D:\wwn\asm\DynArray>omsm

Loop 80 - len of first string is 04000
Ticks, UseHeap=1 : 12297
D:\wwn\asm\DynArray>

[attachment deleted by admin]