News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLeft and szRight

Started by Larry Hammick, December 12, 2007, 07:51:14 AM

Previous topic - Next topic

Larry Hammick

They can be slightly abbreviated:
szLeft proc src:DWORD,dst:DWORD,ln:DWORD

    mov ecx, ln             ; length in ECX
    xchg edi, src           ; source address
    add edi, ecx            ; add required length
    mov edx, dst            ; destination address
    add edx, ecx            ; this also sets the terminator position
    neg ecx                 ; invert sign
    ;jnc short poke_terminator
  @@:
    mov al, [edi+ecx]
    mov [edx+ecx], al
    add ecx, 1
    jnz @B
;poke_terminator:
    mov [edx], cl  ; 0
    pop ebp      ;"proc" pushed ebp
    pop ecx      ;return address
    pop edi      ;restore caller's edi
    pop eax      ;return eax = dst
    pop edx      ;ln
    jmp ecx

szLeft endp

szRight proc src:DWORD,dst:DWORD,ln:DWORD

    xchg edi, src
    invoke szLen, edi
    sub eax, ln
    lea edx, [eax+edi]
    mov edi, dst
    sub edi, edx
@@: mov al, [edx]
    mov [edi+edx], al
    add edx, 1     ;is this faster than INC EDX?
    test al, al
    jne @B
    pop ebp
    pop ecx    ;return address
    pop edi    ;caller's edi
    pop eax    ;return eax = dst
    pop edx    ;ln
    jmp ecx

szRight endp

zooba

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM

    add edx, 1     ;is this faster than INC EDX?


On some processors. IIRC, P4 is much quicker while the earlier ones matter less. I'm not sure about newer processors, I've somewhat gone off instruction level optimisation recently.

Cheers,

Zooba :U

RuiLoureiro

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
    pop   ebp      ;"proc" pushed ebp
    pop   ecx      ;return address
    pop   edi       ;restore caller's edi
    pop   eax      ;return eax = dst
    pop   edx      ;ln
    jmp   ecx

Hi Larry,
            Is this exit method much quicker ?
Rui

jdoe

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
    mov [edi+edx], al
    add edx, 1     ;is this faster than INC EDX?
    test al, al
    jne @B

When you try to optimize an algo, you can't just pick the fastest opcodes. In a loop, the code alignment can be more relevant.

It's all about try and test. MichaelW wrote macros to help timing algos http://www.masm32.com/board/index.php?topic=770.0



Larry Hammick

Quote from: RuiLoureiro on December 12, 2007, 03:58:38 PM
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
    pop   ebp      ;"proc" pushed ebp
    pop   ecx      ;return address
    pop   edi       ;restore caller's edi
    pop   eax      ;return eax = dst
    pop   edx      ;ln
    jmp   ecx

Hi Larry,
            Is this exit method much quicker ?
Rui
I haven't timed it, but in combination with "xchg edi,(arg)" it saves a byte or two in memory and on disk.
On a different topic, about these string-copying routines, it's nice to use a mixed pointer, e.g.
mov al,[ecx]
mov [ecx+edx],al
add ecx,1

so that both pointers get incremented or decremented by the same one instruction (add ecx,1 or similar).

MichaelW

#5
On my P3, and using relatively short strings, the originals are much faster.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff1 db "my other brother darryl",0
      buff2 db 20 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
_szLeft proc src:DWORD,dst:DWORD,ln:DWORD
    mov ecx, ln             ; length in ECX
    xchg edi, src           ; source address
    add edi, ecx            ; add required length
    mov edx, dst            ; destination address
    add edx, ecx            ; this also sets the terminator position
    neg ecx                 ; invert sign
    ;jnc short poke_terminator
  @@:
    mov al, [edi+ecx]
    mov [edx+ecx], al
    add ecx, 1
    jnz @B
;poke_terminator:
    mov [edx], cl  ; 0
    pop ebp      ;"proc" pushed ebp
    pop ecx      ;return address
    pop edi      ;restore caller's edi
    pop eax      ;return eax = dst
    pop edx      ;ln
    jmp ecx

_szLeft endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
_szRight proc src:DWORD,dst:DWORD,ln:DWORD
    xchg edi, src
    invoke szLen, edi
    sub eax, ln
    lea edx, [eax+edi]
    mov edi, dst
    sub edi, edx
@@: mov al, [edx]
    mov [edi+edx], al
    add edx, 1     ;is this faster than INC EDX?
    test al, al
    jne @B
    pop ebp
    pop ecx    ;return address
    pop edi    ;caller's edi
    pop eax    ;return eax = dst
    pop edx    ;ln
    jmp ecx
_szRight endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      invoke szLeft,ADDR buff1, ADDR buff2, 10
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      invoke szRight,ADDR buff1, ADDR buff2, 10
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      invoke _szLeft,ADDR buff1, ADDR buff2, 10
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      invoke _szRight,ADDR buff1, ADDR buff2, 10
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      xchg eax, DWORD PTR buff2
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000000, HIGH_PRIORITY_CLASS
      REPEAT 4
        mov edx, DWORD PTR buff2
        mov DWORD PTR buff2, eax
        mov eax, edx
      ENDM
    counter_end
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


37 cycles
90 cycles
68 cycles
129 cycles
25 cycles
6 cycles


The biggest problem is the xchg edi, src. Per Agner Fog’s optimizing_assembly.pdf, under Problematic Instructions, XCHG register,[memory]:
Quote
This instruction always has an implicit LOCK prefix which prevents it from using the cache. This instruction is therefore very time consuming, and should always be avoided.

The REPEAT 4 serves to get the cycle count up to something that can be reasonably measured.
eschew obfuscation

Larry Hammick

#6
Thanks MW.

update:
Okay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.
_szRight proc src:DWORD,dst:DWORD,ln:DWORD
    mov ecx,src
    mov edx,ln
    sub ecx,1
@@: add ecx,1
;    test byte ptr[ecx],-1
;    jz invalid_len
    test byte ptr[ecx+edx],-1
    jnz @B
    push edi
    mov edi,dst
align 4       ;the original version also benefits from this move
@@: mov al,[ecx+edx]
    mov [edi+edx],al
    sub edx,1
    jns @B       ;to copy the terminator as well -- ln+1 bytes in all
    mov eax, dst
    pop edi
;invalid_len:
    ret

_szRight endp

This library is going to have to be pretty popular if our efforts are ever going to show a profit.  :green2

MichaelW

QuoteOkay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.

The execution time went down by only 52 nanoseconds on by 500 MHz P3, but if your app calls it 19230769 times, you will have saved a whole second  :toothy

eschew obfuscation

hutch--

Larry,

Give this a whirl.


: change this
@@: mov al,[ecx+edx]
    mov [edi+edx],al

; to this.
@@: movzx eax, BYTE PTR [ecx+edx]
    mov [edi+edx],al


It tends to be PIV code but has often worked well in dropping the time of an algo.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php