News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

stb & lod Macros

Started by Neil, November 19, 2008, 03:13:12 PM

Previous topic - Next topic

Neil

Having used these macros, am I correct in assuming that their word & dword equivalents would be faster than their string counterparts or
would the inc or add instructions make them slower?

Tedd

The string instructions can be faster under some circumstances, as long there is the right amount of data being handled - there's a cut-off where it becomes faster to do it the 'long' way (or is it the other way around :lol)
There is another thread about this... somewhere.
No snowflake in an avalanche feels responsible.

Neil

Tedd,
         Are you saying that      mov [edi],eax
                                          add edi,4

                   is faster than      stosd

Neil

Tedd,
I've searched for this other thread you mentioned, but can't find it maybe I'm putting in the wrong search parameters.

hutch--

Neil,

There are a couple of special cases with string instructions, REP movsd and REP stosd, it only works with the REP prefix, separately they are very slow and should be avoided. The REP string  instructions do outperform the normal integer instructions in some contexts, if both source and destination are not in cache at the same time they are faster as they appear to handle non-temporal writes in much the same way as the specialised SSE instructions where the normal interger instructions don't have that option.

Most of us grew up with the string instructions but since the early PIIs onwards they have not been competitive in most instances, loading a register and incrementing the index is almost always faster and sometimes by a large amount.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

I couldn't think of a good real-world test for the macros, but on my P3 there appears to be no contest.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 8 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lodsb
      ENDM
    counter_end
    print ustr$(eax), " cycles, lodsb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff
      REPEAT 10
        lob
      ENDM
    counter_end
    print ustr$(eax), " cycles, lob",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stosb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stosb",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov edi, OFFSET buff
      REPEAT 10
        stb
      ENDM
    counter_end
    print ustr$(eax), " cycles, stb",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


17 cycles, lodsb
5 cycles, lob
16 cycles, stosb
8 cycles, stb


And this is more or less of a worst-case comparison:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff1 dd 1000 dup(0)
      buff2 dd 1000 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
      rep movsd
    counter_end
    print ustr$(eax), " cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      mov esi, OFFSET buff1
      mov edi, OFFSET buff2
      mov ecx, 1000
    @@:
      mov eax, [esi]
      mov [edi], eax
      dec ecx
      jnz @B
    counter_end
    print ustr$(eax), " cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


926 cycles
3005 cycles
924 cycles
3005 cycles

eschew obfuscation

Neil

Thanks Hutch & Michael,
Michael your demo code gives a great illustration of the speed difference, more than 3 times as fast using the lob macro, it also demostrates how the use of rep speeds up the string instructions. Now I have one more question regarding writing macros to replace stosw, stosd etc, are 2 incs or 4 incs faster or slower than adding the appropriate offset i.e. there must be a time when n number of incs becomes slower than add n.

MichaelW

I think add would probably be faster than two or more incs.
eschew obfuscation

Neil


Neil

Mind you, I have a few instances where I use std, this is starting to get complicated :toothy

KeepingRealBusy

Neil,

I think it all depends on what you are doing


mov ecx count
dec ecx
mov esi,OFFSET str
mov edi,OFFSET buf
@@:
mov eax,[esi+ecx*4]
mov [edi+ecx*4],eax
dec ecx
jns @b


Depending on what you're doing, you only need an increment or a decrement.

Dave

Neil

Dave,
That's an interesting code snippet, but (Correct me if I'm wrong) according to Hutch & Michael rep movsd would be much quicker.