SzCpy vs. lstrcpy

lingo · May 11, 2005, 10:57:29 PM

Hutch :lol

"Lingo,
It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?"

My algos are for ASM professionals rather than for newbies...

MichaelW :lol

1. This is buliaNaza's algo rather than A.Fog's algo
A.Fog's algo has additional 2 instructions in the main loop:
XOR EBX, -1 ; invert all bytes
AND ECX, EBX ; and these two

I know it very well because in a long time ago I improved buliaNaza's algo:
http://board.win32asmcommunity.net/index.php?topic=8330.msg77056#msg77056

2. As Jimg stated above

Quote
.... Unfortunately, it get much slower if there is a high ascii character near the beginning of the string
Maybe if you replace the last part with one of the other routines when high ascii is found?

That is true. "high ascii characters" are characters above ASCII code 80h and
if we have them in our string your algo will work in "slow" last part, copy one byte at a time rather than a dword...

Regards,
Lingo

hutch-- · May 11, 2005, 11:18:52 PM

:bg

Lingo,

> My algos are for ASM professionals rather than for newbies...

I gather this means that it does not handle byte aligned data and does not handle lengths that are not multiples of 4. This is not a criticism as the algo looks good and it should be fast as the testing shows but its not a general purpose algorithm as byte data is usually of random length. One of the normal tasks for zero terminated string copy is to append to the end of a buffer where the location is not 4 byte aligned and the string being copied to it is not a multiple of 4.

Compliments though, for its task range it is a very good algo design.

Jim,

just for my reference, what model Athlon are you testing on ? My last ancient AMD died recently and I have a Sempron 2.4 in its place which deliverss similar timing spacing to the two PIVs I use so its worth me knowing this for writing general purpose code that is basically mixed model algo design.

Mark_Larson · May 12, 2005, 01:57:33 AM

Hey Hutch--, there was one other problem with Lingo's code. It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes. So it copies extra data past the end of the source string to the destination string. In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash. In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.

Here's the MMX code. I accidentally posted a string length routine in the previous post. I haven't finished optimizing it yet. But I figured I'd post it to see if anyone can use it for their own ideas. It does properly handle non-divisible by 8 string sizes. It runs in 107 cycles on my P4 for the 128 byte sample string.

Code Select


szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]

align 16
qword_copy:
   pxor	mm1,mm1
   movq	mm0,[eax] 
   pcmpeqb	mm1,mm0
   add	eax,8 
   pmovmskb	ecx,mm1
   or		ecx,ecx 
   jnz	finish_rest
   movq	[esi],mm0
   add	esi,8
   jmp	qword_copy

finish_rest:

   sub	eax,8
byte_copy:
   movzx	ecx,byte ptr [eax]
   add	eax,1
   mov 	[esi],cl
   add	 esi,1
   test	 cl,cl
   jz		 zero_byte
   jmp	 byte_copy
zero_byte:

;   emms 

   RET
szCopyMMX endp

hutch-- · May 12, 2005, 04:13:06 AM

Mark,

This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.

MichaelW · May 12, 2005, 07:25:55 AM

Jim,

Thanks for pointing out my stupidity :bg As it turns out, and as I see lingo has now pointed out, that was what the extra instructions in Agner Fog's algorithm were for. Using another method after the first high-ascii character did not seem too promising, so I added the two additional instructions from Agner Fog's algorithm so the result bytes are now 80h for a null byte, and zero for any other byte value.

Mark,

Thanks for the suggestions. Running on a P3, with the sample strings, the optimum unroll is 8. Higher values slow the code down. The current code copies the sample string in 170 cycles. The dword loop still contains a dependency, but I can't see any reasonable way to eliminate it.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      strx db 80h,"ample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10,13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR strx,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to detect the terminating null, without
; detecting the first high-ascii character as a null, was
; adapted from the MASM32 STRLEN procedure, which was adapted
; from an algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

; Corrected another mistake that MASM was actually ingoring:
; SzCopy_dw proc uses ebx edi lpsrc:DWORD,lpdest:DWORD

SzCopy_dw proc lpsrc:DWORD,lpdest:DWORD
    push  ebx
    mov   eax,[esp+8]           ; lpsrc
    mov   edx,[esp+12]          ; lpdest
    align 4
  @@:
    REPEAT 8                    ; unroll x 8
      mov   ebx,[eax]           ; get next 4 bytes
      add   eax,4
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
      not   ebx                 ; invert all bytes      
      and   ecx,ebx             ; and each byte with inversion
      not   ebx                 ; restore bytes
      and   ecx,80808080h       ; bytes=80h for null,otherwise 0
      jnz   @F                  ; jump if zero byte present
      mov   [edx],ebx           ; copy 4 bytes
      add   edx,4
    ENDM
    jmp   @B
  @@:
    mov   [edx],bl              ; move first byte
    add   edx,1
    test  bl,bl
    jz    @F                    ; jump if was null
    mov   [edx],bh              ; move next byte
    add   edx,1
    test  bh,bh
    jz    @F                    ; jump if was null
    shr   ebx,16
    mov   [edx],bl              ; move next byte
    add   edx,1
    test  bl,bl
    jz    @F                    ; jump if was null
    mov   [edx],bh              ; move last byte
  @@:
    mov   eax,edx               ; return length w/o null
    sub   eax,[esp+12]          ; lpdest

    pop   ebx
    ret   8
SzCopy_dw endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

[attachment deleted by admin]

Mark_Larson · May 12, 2005, 01:02:10 PM

Quote from: hutch-- on May 12, 2005, 04:13:06 AM
Mark,

This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.

It doesn't require 8 byte aligned source and data strings. It just runs slower like a dword algorithm with unaligned dword data. Only SSE and SSE2 require aligned source and data otherwise they generate an exception. I also got it down to 86 cycles last night after posting it. I need to check for bugs and make sure it works correctly before posting the changes.

Jimg · May 12, 2005, 02:00:08 PM

Michael-

Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.

Jimg · May 12, 2005, 03:27:54 PM

Mark_Larson

Mark- is it ever possible to have a page fault or other error reading in the string when using movq where there is only 1 byte left and it happens to be at the end of one's address space? Or is there always enough extra space that it could never happen?

Mark_Larson · May 12, 2005, 03:37:21 PM

Quote from: Jimg on May 12, 2005, 02:00:08 PM
Michael-

Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.

Jim, MMX is also non-fpu. I think you meant "486 compatible" code.

Hutch, what about having two routines? An szCopy and an szCopyMMX? And let the programmer pick which one they want to use. Because the programmer is best going to know the minimum processor supported for his code.

I did some bench testing for un-aligned MMX. I used the current version I have and did some fancy batch files to auto-check them all. The current version I have is running in 87 cycles at work. So I am going to use that as my baseline. I am using Michael's posted code except I changed the timing macros to use REALTIME.

I made some changes to the ASM file and the batch file to compile the code. First let's look at the changes to the assembler code. You pass in DO_TESTING using the /D switch to ML.EXE to make it do the "special" compile of forcing non-alignment. You also pass in "FRED_VALUE" to ML.EXE with different values ranging from 1 to 7. So in the code snippet below the code gets both DO_TESTING and FRED_VALUE from the command line using the /D command line switch. Because of the "ALIGN 16" right before the DB this forces it to be 1 to 7 bytes off of a 16 byte alignment.

Code Select


align 16
ifdef DO_TESTING
db FRED_VALUE dup (0)
endif ; ifdef DO_TESTING
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0

I created 7 different batch files that each pass in the values ranging from 1 to 7 for FRED_VALUE. Here is an example of setting FRED_VALUE=2, and it also turns DO_TESTING on.

Code Select


\masm32\bin\ml /D"DO_TESTING=1" /D"FRED_VALUE=2" /c /coff "szcopytest.asm"

So now let's look at some timings. I ran each one 5 times and took the fastest time running at REALTIME priority.

Code Select


1 byte off of 16 byte alignment          275 clocks
2 byte off of 16 byte alignment          266 clocks
3 byte off of 16 byte alignment          283 clocks
4 byte off of 16 byte alignment          256 clocks
5 byte off of 16 byte alignment          284 clocks
6 byte off of 16 byte alignment          282 clocks
7 byte off of 16 byte alignment          284 clocks
8 byte off of 16 byte alignment          87 - ran this one manually outside of the process to verify it goes back to being the same speed.
                                          8 bytes off of a 16 byte alignment means it's 8 byte aligned, which is all MMX requires for speed.

So the moral of the story, ALIGN ALIGN ALIGN ( which is exactly what I said on my webpage). However the penalty for un-aligned dword is a lot less than un-aligned MMX.

Jimg · May 12, 2005, 04:10:19 PM

MarkL-

QuoteJim, MMX is also non-fpu. I think you meant "486 compatible" code.

:red (too old I guess)

pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )

Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?

AeroASM · May 12, 2005, 05:33:20 PM

.mmx = MMX as normal
.xmm = MMX, SSE and SSE2

Mark_Larson · May 12, 2005, 08:01:29 PM

Quote from: Jimg on May 12, 2005, 04:10:19 PM

pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )

Yep pmovmskb was added with SSE support. If I wanted to make it exclusively MMX I'd have to change that to some other instruction.

XMM is for both SSE and SSE2. Things get confusing, because when Intel added SSE support they also added additional MMX instructions. They did a similar thing with SSE2 and added some new MMX instructions along with the new double precision floating point support. So when I say "SSE" support I mean both the MMX instructions they added for SSE support as well as the single precision floating point instructions they added. So you have to use .xmm to use "pmovmskb" even though it's not a single precision floating point instruction because it was added as part of P3's "SSE support". make sense? The way you tell is to pull up book 2 ( instruction set reference), and go to Appendix B, they have all the different instructions broken up by type. So the ones added with "SSE" support whether they are floating point or MMX instructions are under the sub-heading "SSE" in appendix B. very handy. I use that all the time to see what processors support what instructions.

Quote
Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?

Thanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better. It's 87 cycles, and it hasn't been fully checked for bugs, so I hate posting. It isn't very professional, and people usually jump all over you :). I guess I can post it if everyone realizes I haven't finished checking it for bugs yet.

Quote from: AeroASM on May 12, 2005, 05:33:20 PM
.mmx = MMX as normal
.xmm = MMX, SSE and SSE2

This is correct. I use .xmm in all my problems and it enables MMX, SSE, and SSE2. It's just a lot easier. This in my opinion is a bug. Try doing this, and it'll give you errors if you trying doing SSE or SSE2 code. The reason is the .XMm and .mmx aren't additive, it only uses the last one found in the current code. I found that out the hard way. So I just use .xmm in all my code to be extra sure. You also need the ".xmm" AFTER your "option casemap" or it will start complaining that the code isn't in all CAPITAL letters ( ewww).

Code Select


    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
    .xmm
   .mmx                           ; ignores .xmm and just enables MMX support.

Take the code with a grain of salt, since I have not finished debugging it yet. The only string I have tested it with is the one in Michael's test code. It works fine with that, but with all the "special cases" I added, I need to do a lot more testing before it's ready.

I am only showing the portion I changed. I modified the second loop, where it is copying byte by byte, because there is less than 8 bytes left to copy.

Because of PCMPEQB and PMOVMSKB in the code ECX will have all 0's for each bit that is not a 0. Each bit that is a 0 will have a 1. So BSF let's me get, which bit was set in a single instruction, which corresponds to where the 0 is. If you have more than one 0 in a row, the first one is the one we care about, and BSF scans from the start of the DWORD, so it will find the very first one ( which is the only one we care about). After determine, which bit is set, I jump to special case code to handle different numbers of bytes left.

Code Select


bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2

do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret

do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret

do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret


do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret


do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret

do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret

;cmp ecx,1
;je do_1
movzx	ecx,byte ptr [eax-8]
mov 	[esi],cl
ret 8 ; ret

Jimg · May 13, 2005, 12:58:30 AM

QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.

The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.

Mark_Larson · May 13, 2005, 01:28:31 AM

Quote from: Jimg on May 13, 2005, 12:58:30 AM
QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.
The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.

In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.

Code Select


PROLOGUE:NONE
OPTION EPILOGUE:NONE


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[esp+4]
   mov esi,[esp+8]

Code Select


szCopyMMX endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Jimg · May 13, 2005, 02:11:51 AM

Quote from: Mark_Larson on May 13, 2005, 01:28:31 AM
In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.

There is only a cycle or two difference when using the prologue/epilogue code on the Athon. Certainly not enough to bother with.

After extensive testing, inserting this runs the fastest-

Code Select

align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,src_str
   mov esi,dst_str
   jmp qword_copy
   db 7 dup (0)	;7=159
qword_copy:
   pxor mm1,mm1
   movq mm0,[eax]
.
.
.

Quote

News:

SzCpy vs. lstrcpy

AeroASM