Print Page - szLeft and szRight

Title: szLeft and szRight
Post by: Larry Hammick on December 12, 2007, 07:51:14 AM

They can be slightly abbreviated:

szLeft proc src:DWORD,dst:DWORD,ln:DWORD

Â  Â  mov ecx, lnÂ  Â  Â  Â  Â  Â  Â ; length in ECX
Â  Â  xchg edi, srcÂ  Â  Â  Â  Â  Â ; source address
Â  Â  add edi, ecxÂ  Â  Â  Â  Â  Â  ; add required length
Â  Â  mov edx, dstÂ  Â  Â  Â  Â  Â  ; destination address
Â  Â  add edx, ecxÂ  Â  Â  Â  Â  Â  ; this also sets the terminator position
Â  Â  neg ecxÂ  Â  Â  Â  Â  Â  Â  Â  Â ; invert sign
Â  Â  ;jnc short poke_terminator
Â  @@:
Â  Â  mov al, [edi+ecx]
Â  Â  mov [edx+ecx], al
Â  Â  add ecx, 1
Â  Â  jnz @B
;poke_terminator:
Â  Â  mov [edx], clÂ  ; 0
Â  Â  pop ebpÂ Â  Â  Â ;"proc" pushed ebp
Â  Â  pop ecx Â  Â  Â ;return address
Â  Â  pop ediÂ  Â  Â  ;restore caller's edi
Â  Â  pop eaxÂ  Â  Â  ;return eax = dst
Â  Â  pop edxÂ  Â  Â  ;ln
Â  Â  jmp ecx

szLeft endp

szRight proc src:DWORD,dst:DWORD,ln:DWORD

Â  Â  xchg edi, src
Â  Â  invoke szLen, edi
Â  Â  sub eax, ln
Â  Â  lea edx, [eax+edi]
Â  Â  mov edi, dst
Â  Â  sub edi, edx
@@: mov al, [edx]
Â  Â  mov [edi+edx], al
Â  Â  add edx, 1Â  Â  Â ;is this faster than INC EDX?
Â  Â  test al, al
Â  Â  jne @B
Â  Â  pop ebp
Â  Â  pop ecxÂ  Â  ;return address
Â  Â  pop ediÂ  Â  ;caller's edi
Â  Â  pop eaxÂ  Â  ;return eax = dst
Â  Â  pop edxÂ  Â  ;ln
Â  Â  jmp ecx

szRight endp

Title: Re: szLeft and szRight
Post by: zooba on December 12, 2007, 08:10:14 AM

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
Code Select Expand
Â Â add edx, 1Â Â Â ;is this faster than INC EDX?

On some processors. IIRC, P4 is much quicker while the earlier ones matter less. I'm not sure about newer processors, I've somewhat gone off instruction level optimisation recently.

Cheers,

Zooba :U

Title: Re: szLeft and szRight
Post by: RuiLoureiro on December 12, 2007, 03:58:38 PM

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
Â Â pop ebpÂ Â Â Â ;"proc" pushed ebp
Â Â pop ecx Â Â Â ;return address
Â Â pop ediÂ Â Â ;restore caller's edi
Â Â pop eaxÂ Â Â ;return eax = dst
Â Â pop edxÂ Â Â ;ln
Â Â jmp ecx

Hi Larry,
Is this exit method much quicker ?
Rui

Title: Re: szLeft and szRight
Post by: jdoe on December 12, 2007, 05:16:37 PM

Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
Â Â mov [edi+edx], al
Â Â add edx, 1Â Â Â ;is this faster than INC EDX?
Â Â test al, al
Â Â jne @B

When you try to optimize an algo, you can't just pick the fastest opcodes. In a loop, the code alignment can be more relevant.

It's all about try and test. MichaelW wrote macros to help timing algos http://www.masm32.com/board/index.php?topic=770.0

Title: Re: szLeft and szRight
Post by: Larry Hammick on December 13, 2007, 01:02:33 AM

Quote from: RuiLoureiro on December 12, 2007, 03:58:38 PM
Quote from: Larry Hammick on December 12, 2007, 07:51:14 AM
Â Â popÂ Â ebpÂ Â Â Â ;"proc" pushed ebp
Â Â popÂ Â ecx Â Â Â ;return address
Â Â popÂ Â ediÂ Â Â Â ;restore caller's edi
Â Â popÂ Â eaxÂ Â Â ;return eax = dst
Â Â popÂ Â edxÂ Â Â ;ln
Â Â jmpÂ Â ecx

Hi Larry,
Â Â Â Â Â Â Is this exit method much quicker ?
Rui

I haven't timed it, but in combination with "xchg edi,(arg)" it saves a byte or two in memory and on disk.
On a different topic, about these string-copying routines, it's nice to use a mixed pointer, e.g.

Code Select

mov al,[ecx]
mov [ecx+edx],al
add ecx,1

so that both pointers get incremented or decremented by the same one instruction (add ecx,1 or similar).

Title: Re: szLeft and szRight
Post by: MichaelW on December 13, 2007, 06:13:50 AM

On my P3, and using relatively short strings, the originals are much faster.

Code Select


; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
Â  Â  include \masm32\include\masm32rt.inc
Â  Â  .686
Â  Â  include \masm32\macros\timers.asm
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
Â  Â  .data
Â  Â  Â  buff1 db "my other brother darryl",0
Â  Â  Â  buff2 db 20 dup(0)
Â  Â  .code
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
_szLeft proc src:DWORD,dst:DWORD,ln:DWORD
Â  Â  mov ecx, lnÂ  Â  Â  Â  Â  Â  Â ; length in ECX
Â  Â  xchg edi, srcÂ  Â  Â  Â  Â  Â ; source address
Â  Â  add edi, ecxÂ  Â  Â  Â  Â  Â  ; add required length
Â  Â  mov edx, dstÂ  Â  Â  Â  Â  Â  ; destination address
Â  Â  add edx, ecxÂ  Â  Â  Â  Â  Â  ; this also sets the terminator position
Â  Â  neg ecxÂ  Â  Â  Â  Â  Â  Â  Â  Â ; invert sign
Â  Â  ;jnc short poke_terminator
Â  @@:
Â  Â  mov al, [edi+ecx]
Â  Â  mov [edx+ecx], al
Â  Â  add ecx, 1
Â  Â  jnz @B
;poke_terminator:
Â  Â  mov [edx], clÂ  ; 0
Â  Â  pop ebpÂ  Â  Â  ;"proc" pushed ebp
Â  Â  pop ecxÂ  Â  Â  ;return address
Â  Â  pop ediÂ  Â  Â  ;restore caller's edi
Â  Â  pop eaxÂ  Â  Â  ;return eax = dst
Â  Â  pop edxÂ  Â  Â  ;ln
Â  Â  jmp ecx

_szLeft endp
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
_szRight proc src:DWORD,dst:DWORD,ln:DWORD
Â  Â  xchg edi, src
Â  Â  invoke szLen, edi
Â  Â  sub eax, ln
Â  Â  lea edx, [eax+edi]
Â  Â  mov edi, dst
Â  Â  sub edi, edx
@@: mov al, [edx]
Â  Â  mov [edi+edx], al
Â  Â  add edx, 1Â  Â  Â ;is this faster than INC EDX?
Â  Â  test al, al
Â  Â  jne @B
Â  Â  pop ebp
Â  Â  pop ecxÂ  Â  ;return address
Â  Â  pop ediÂ  Â  ;caller's edi
Â  Â  pop eaxÂ  Â  ;return eax = dst
Â  Â  pop edxÂ  Â  ;ln
Â  Â  jmp ecx
_szRight endp
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
start:
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
Â  Â  invoke Sleep, 4000

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  invoke szLeft,ADDR buff1, ADDR buff2, 10
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  invoke szRight,ADDR buff1, ADDR buff2, 10
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  invoke _szLeft,ADDR buff1, ADDR buff2, 10
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  invoke _szRight,ADDR buff1, ADDR buff2, 10
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  xchg eax, DWORD PTR buff2
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  counter_begin 1000000, HIGH_PRIORITY_CLASS
Â  Â  Â  REPEAT 4
Â  Â  Â  Â  mov edx, DWORD PTR buff2
Â  Â  Â  Â  mov DWORD PTR buff2, eax
Â  Â  Â  Â  mov eax, edx
Â  Â  Â  ENDM
Â  Â  counter_end
Â  Â  print ustr$(eax)," cycles",13,10

Â  Â  inkey "Press any key to exit..."
Â  Â  exit
; Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«Â«
end start

Code Select


37 cycles
90 cycles
68 cycles
129 cycles
25 cycles
6 cycles

The biggest problem is the xchg edi, src. Per Agner Fogâ€™s optimizing_assembly.pdf, under Problematic Instructions, XCHG register,[memory]:

Quote
This instruction always has an implicit LOCK prefix which prevents it from using the cache. This instruction is therefore very time consuming, and should always be avoided.

The REPEAT 4 serves to get the cycle count up to something that can be reasonably measured.

Title: Re: szLeft and szRight
Post by: Larry Hammick on December 13, 2007, 08:12:45 AM

Thanks MW.

update:
Okay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.

Code Select

_szRight proc src:DWORD,dst:DWORD,ln:DWORD
Â  Â  mov ecx,src
Â  Â  mov edx,ln
Â  Â  sub ecx,1
@@: add ecx,1
;Â  Â  test byte ptr[ecx],-1
;Â  Â  jz invalid_len
Â  Â  test byte ptr[ecx+edx],-1
Â  Â  jnz @B
Â  Â  push edi
Â  Â  mov edi,dst
align 4Â  Â  Â  Â ;the original version also benefits from this move
@@: mov al,[ecx+edx]
Â  Â  mov [edi+edx],al
Â  Â  sub edx,1
Â  Â  jns @BÂ  Â  Â  Â ;to copy the terminator as well -- ln+1 bytes in all
Â  Â  mov eax, dst
    pop edi
;invalid_len:
Â  Â  ret

_szRight endp

This library is going to have to be pretty popular if our efforts are ever going to show a profit.Â :green2

Title: Re: szLeft and szRight
Post by: MichaelW on December 14, 2007, 09:12:50 AM

QuoteOkay, after another hour of toil, I think I have shaved almost a whole microsecond off of szRight.

The execution time went down by only 52 nanoseconds on by 500 MHz P3, but if your app calls it 19230769 times, you will have saved a whole secondÂ :toothy

Title: Re: szLeft and szRight
Post by: hutch-- on December 14, 2007, 10:12:55 AM

Larry,

Give this a whirl.

Code Select


: change this
@@: mov al,[ecx+edx]
    mov [edi+edx],al

; to this.
@@: movzx eax, BYTE PTR [ecx+edx]
    mov [edi+edx],al

It tends to be PIV code but has often worked well in dropping the time of an algo.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Larry Hammick on December 12, 2007, 07:51:14 AM