They can be slightly abbreviated:
szLeft proc src:DWORD,dst:DWORD,ln:DWORD
  mov ecx, ln       ; length in ECX
  xchg edi, src      ; source address
  add edi, ecx      ; add required length
  mov edx, dst      ; destination address
  add edx, ecx      ; this also sets the terminator position
  neg ecx         ; invert sign
  ;jnc short poke_terminator
 @@:
  mov al, [edi+ecx]
  mov [edx+ecx], al
  add ecx, 1
  jnz @B
;poke_terminator:
  mov [edx], cl ; 0
  pop ebp    ;"proc" pushed ebp
  pop ecx    ;return address
  pop edi   ;restore caller's edi
  pop eax   ;return eax = dst
  pop edx   ;ln
  jmp ecx
szLeft endp
szRight proc src:DWORD,dst:DWORD,ln:DWORD
  xchg edi, src
  invoke szLen, edi
  sub eax, ln
  lea edx, [eax+edi]
  mov edi, dst
  sub edi, edx
@@: mov al, [edx]
  mov [edi+edx], al
  add edx, 1   ;is this faster than INC EDX?
  test al, al
  jne @B
  pop ebp
  pop ecx  ;return address
  pop edi  ;caller's edi
  pop eax  ;return eax = dst
  pop edx  ;ln
  jmp ecx
szRight endp
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
  add edx, 1   ;is this faster than INC EDX?
On some processors. IIRC, P4 is much quicker while the earlier ones matter less. I'm not sure about newer processors, I've somewhat gone off instruction level optimisation recently.
Cheers,
Zooba :U
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
  pop ebp    ;"proc" pushed ebp
  pop ecx    ;return address
  pop edi   ;restore caller's edi
  pop eax   ;return eax = dst
  pop edx   ;ln
  jmp ecx
Hi Larry,
Is this exit method much quicker ?
Rui
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
  mov [edi+edx], al
  add edx, 1   ;is this faster than INC EDX?
  test al, al
  jne @B
When you try to optimize an algo, you can't just pick the fastest opcodes. In a loop, the code alignment can be more relevant.
It's all about try and test. MichaelW wrote macros to help timing algos http://www.masm32.com/board/index.php?topic=770.0
Quote from: RuiLoureiro on December 12, 2007, 03:58:38 PM
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
  pop  ebp    ;"proc" pushed ebp
  pop  ecx    ;return address
  pop  edi    ;restore caller's edi
  pop  eax   ;return eax = dst
  pop  edx   ;ln
  jmp  ecx
Hi Larry,
      Is this exit method much quicker ?
Rui
I haven't timed it, but in combination with "xchg edi,(arg)" it saves a byte or two in memory and on disk.
On a different topic, about these string-copying routines, it's nice to use a mixed pointer, e.g.
mov al,[ecx]
mov [ecx+edx],al
add ecx,1
so that both pointers get incremented or decremented by the same one instruction (add ecx,1 or similar).
On my P3, and using relatively short strings, the originals are much faster.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
  include \masm32\include\masm32rt.inc
  .686
  include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
  .data
   buff1 db "my other brother darryl",0
   buff2 db 20 dup(0)
  .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
_szLeft proc src:DWORD,dst:DWORD,ln:DWORD
  mov ecx, ln       ; length in ECX
  xchg edi, src      ; source address
  add edi, ecx      ; add required length
  mov edx, dst      ; destination address
  add edx, ecx      ; this also sets the terminator position
  neg ecx         ; invert sign
  ;jnc short poke_terminator
 @@:
  mov al, [edi+ecx]
  mov [edx+ecx], al
  add ecx, 1
  jnz @B
;poke_terminator:
  mov [edx], cl ; 0
  pop ebp   ;"proc" pushed ebp
  pop ecx   ;return address
  pop edi   ;restore caller's edi
  pop eax   ;return eax = dst
  pop edx   ;ln
  jmp ecx
_szLeft endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
_szRight proc src:DWORD,dst:DWORD,ln:DWORD
  xchg edi, src
  invoke szLen, edi
  sub eax, ln
  lea edx, [eax+edi]
  mov edi, dst
  sub edi, edx
@@: mov al, [edx]
  mov [edi+edx], al
  add edx, 1   ;is this faster than INC EDX?
  test al, al
  jne @B
  pop ebp
  pop ecx  ;return address
  pop edi  ;caller's edi
  pop eax  ;return eax = dst
  pop edx  ;ln
  jmp ecx
_szRight endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
  invoke Sleep, 4000
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   invoke szLeft,ADDR buff1, ADDR buff2, 10
  counter_end
  print ustr$(eax)," cycles",13,10
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   invoke szRight,ADDR buff1, ADDR buff2, 10
  counter_end
  print ustr$(eax)," cycles",13,10
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   invoke _szLeft,ADDR buff1, ADDR buff2, 10
  counter_end
  print ustr$(eax)," cycles",13,10
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   invoke _szRight,ADDR buff1, ADDR buff2, 10
  counter_end
  print ustr$(eax)," cycles",13,10
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   xchg eax, DWORD PTR buff2
  counter_end
  print ustr$(eax)," cycles",13,10
  counter_begin 1000000, HIGH_PRIORITY_CLASS
   REPEAT 4
    mov edx, DWORD PTR buff2
    mov DWORD PTR buff2, eax
    mov eax, edx
   ENDM
  counter_end
  print ustr$(eax)," cycles",13,10
  inkey "Press any key to exit..."
  exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
37 cycles
90 cycles
68 cycles
129 cycles
25 cycles
6 cycles
The biggest problem is the xchg edi, src. Per Agner Fog’s optimizing_assembly.pdf, under Problematic Instructions, XCHG register,[memory]:
Quote
This instruction always has an implicit LOCK prefix which prevents it from using the cache. This instruction is therefore very time consuming, and should always be avoided.
The REPEAT 4 serves to get the cycle count up to something that can be reasonably measured.
Thanks MW.
update:
Okay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.
_szRight proc src:DWORD,dst:DWORD,ln:DWORD
  mov ecx,src
  mov edx,ln
  sub ecx,1
@@: add ecx,1
;Â Â test byte ptr[ecx],-1
;Â Â jz invalid_len
  test byte ptr[ecx+edx],-1
  jnz @B
  push edi
  mov edi,dst
align 4Â Â Â Â ;the original version also benefits from this move
@@: mov al,[ecx+edx]
  mov [edi+edx],al
  sub edx,1
  jns @B    ;to copy the terminator as well -- ln+1 bytes in all
  mov eax, dst
pop edi
;invalid_len:
  ret
_szRight endp
This library is going to have to be pretty popular if our efforts are ever going to show a profit. :green2
QuoteOkay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.
The execution time went down by only 52 nanoseconds on by 500 MHz P3, but if your app calls it 19230769 times, you will have saved a whole second :toothy
Larry,
Give this a whirl.
: change this
@@: mov al,[ecx+edx]
mov [edi+edx],al
; to this.
@@: movzx eax, BYTE PTR [ecx+edx]
mov [edi+edx],al
It tends to be PIV code but has often worked well in dropping the time of an algo.