The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: Neil on November 19, 2008, 03:13:12 PM

Title: stb & lod Macros
Post by: Neil on November 19, 2008, 03:13:12 PM
Having used these macros, am I correct in assuming that their word & dword equivalents would be faster than their string counterparts or
would the inc or add instructions make them slower?
Title: Re: stb & lod Macros
Post by: Tedd on November 19, 2008, 03:50:19 PM
The string instructions can be faster under some circumstances, as long there is the right amount of data being handled - there's a cut-off where it becomes faster to do it the 'long' way (or is it the other way around :lol)
There is another thread about this... somewhere.
Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 03:57:41 PM
Tedd,
         Are you saying that      mov [edi],eax
                                          add edi,4

                   is faster than      stosd
Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 04:04:39 PM
Tedd,
I've searched for this other thread you mentioned, but can't find it maybe I'm putting in the wrong search parameters.
Title: Re: stb & lod Macros
Post by: hutch-- on November 19, 2008, 04:18:17 PM
Neil,

There are a couple of special cases with string instructions, REP movsd and REP stosd, it only works with the REP prefix, separately they are very slow and should be avoided. The REP string  instructions do outperform the normal integer instructions in some contexts, if both source and destination are not in cache at the same time they are faster as they appear to handle non-temporal writes in much the same way as the specialised SSE instructions where the normal interger instructions don't have that option.

Most of us grew up with the string instructions but since the early PIIs onwards they have not been competitive in most instances, loading a register and incrementing the index is almost always faster and sometimes by a large amount.
Title: Re: stb & lod Macros
Post by: MichaelW on November 19, 2008, 04:25:56 PM
I couldn't think of a good real-world test for the macros, but on my P3 there appears to be no contest.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 8 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lodsb
      ENDM
    counter_end
    print ustr$(eax), " cycles, lodsb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lob
      ENDM
    counter_end
    print ustr$(eax), " cycles, lob",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stosb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stosb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stb",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


17 cycles, lodsb
5 cycles, lob
16 cycles, stosb
8 cycles, stb


And this is more or less of a worst-case comparison:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff1 dd 1000 dup(0)
      buff2 dd 1000 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


926 cycles
3005 cycles
924 cycles
3005 cycles

Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 04:45:59 PM
Thanks Hutch & Michael,
Michael your demo code gives a great illustration of the speed difference, more than 3 times as fast using the lob macro, it also demostrates how the use of rep speeds up the string instructions. Now I have one more question regarding writing macros to replace stosw, stosd etc, are 2 incs or 4 incs faster or slower than adding the appropriate offset i.e. there must be a time when n number of incs becomes slower than add n.
Title: Re: stb & lod Macros
Post by: MichaelW on November 19, 2008, 05:03:40 PM
I think add would probably be faster than two or more incs.
Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 05:09:58 PM
Well, add it is then :U
Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 05:13:52 PM
Mind you, I have a few instances where I use std, this is starting to get complicated :toothy
Title: Re: stb & lod Macros
Post by: KeepingRealBusy on November 19, 2008, 05:21:57 PM
Neil,

I think it all depends on what you are doing


mov ecx count
dec ecx
mov esi,OFFSET str
mov edi,OFFSET buf
@@:
mov eax,[esi+ecx*4]
mov [edi+ecx*4],eax
dec ecx
jns @b


Depending on what you're doing, you only need an increment or a decrement.

Dave
Title: Re: stb & lod Macros
Post by: Neil on November 19, 2008, 07:31:22 PM
Dave,
That's an interesting code snippet, but (Correct me if I'm wrong) according to Hutch & Michael rep movsd would be much quicker.