Due to the nature of strcat, I have to 0 the memory of the buffer each test, so the following cycles include that overhead. For each trial, two strcat's are done on two different strings onto a zero'd buffer:
edit- read below for update
strcat?? What is it supposed to do? I cannot find any definition and there are no comments in the code. Huh ?
It appends the source string onto the destination string.
kunt0r,
If you have the time, would you test the speed of the two that are in the masm32 library, "szCatStr" and "szMultiCat". The second being able to handle a list of zero terminated strings.
I found an error with my "buff", it seems the larger it is, the longer these functions take. I'm not sure why, so I'll redo the tests and post them below with a fixed buff for each function. I also updated my proc to shave a few more cycles:
Type ---- 0/0 - 1/1 - 5/5 - 10/10 - 50/50 - 200/200 - 250/1000 - 1000/250
=========================================================================
kuntz 85 90 144 121 339 942 2268 3019
szCatStr 96 131 189 223 577 1709 4716 5473
ms api 200 210 252 353 723 2090 5285 6749
New code used for testing:
.686p
.model flat, stdcall
option casemap :none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
include \masm32\include\user32.inc
includelib \masm32\lib\user32.lib
include \masm32\include\masm32.inc
includelib \masm32\lib\masm32.lib
include \masm32\kinc\strcat.inc
include \masm32\kinc\timer.inc
.data
align 4
str1 db 10 dup("a"), 0
align 4
str2 db 10 dup("a"), 0
msgstr db "Cycles: %d", 0
.data?
align 4
buff db (sizeof str1 + sizeof str2 + 3) dup(?)
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
counter_begin 1000000, REALTIME_PRIORITY_CLASS
invoke RtlZeroMemory, addr buff, sizeof buff
invoke szCatStr, addr buff, addr str1
invoke szCatStr, addr buff, addr str2
counter_end
invoke wsprintf, addr str1, addr msgstr, eax
invoke MessageBox, 0, addr str1, 0, 0
invoke ExitProcess, 0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
strcat proto :dword, :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strcat proc xstr1:dword, xstr2:dword
mov dword ptr [esp-4], ebx
mov dword ptr [esp-8], edi
mov edi, dword ptr [esp+4]
d1: mov edx, dword ptr [edi]
mov ecx, edx
add edi, 4
sub edx, 16843009
and edx, 2155905152
jz d1
not ecx
and edx, ecx
jz d1
test dl, dl
jnz d2
test dh, dh
jnz d3
test edx, 16711680
jnz d4
jmp d5
d2: dec edi
d3: dec edi
d4: dec edi
d5: dec edi
mov eax, dword ptr [esp+8]
d6: mov edx, dword ptr [eax]
mov ecx, edx
add eax, 4
sub edx, 16843009
and edx, 2155905152
jnz d7
mov dword ptr [edi], ecx
add edi, 4
jmp d6
d7: mov ebx, ecx
not ecx
and edx, ecx
jnz d8
mov dword ptr [edi], ebx
add edi, 4
jmp d6
d8: mov dword ptr [edi], ebx
mov ebx, dword ptr [esp-4]
mov edi, dword ptr [esp-8]
ret 8
strcat endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################
Here is a quick rewrite of the szCatStr proc. If you have the time, will you run this one on the clock. I develop on a 2.8 gig PIV, are you timing on Intel or AMD hardware ?
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
szCat2 proc src:DWORD,szadd:DWORD
push edi
xor eax, eax
mov edx, [esp+8] ; src
mov edi, [esp+12] ; szadd
sub edx, 1
xor ecx, ecx
align 4
@@:
add edx, 1
cmp BYTE PTR [edx], 0
jne @B
align 4
@@:
mov al, [edi+ecx]
mov [edx+ecx], al
add ecx, 1
test al, al
jnz @B
mov eax, [esp+8] ; src return start address of source
pop edi
ret 8
szCat2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
I'm on an AMD 32-bit 2500. And the size of buff was slowing things down because of the zero memory, no idea why I didn't think of that yesterday, probably was to tired lol. I tried aligning my two loops, sped up my proc a bit on strings larger then 8:
Type ---- 0/0 - 1/1 - 5/5 - 10/10 - 50/50 - 200/200 - 250/1000 - 1000/250
=========================================================================
kuntz 90 91 143 118 333 935 2264 3010
szCatStr 96 131 189 223 577 1709 4716 5473
szCat2 71 84 122 225 595 1446 3567 4957
ms api 200 210 252 353 723 2090 5285 6749
strcat proto :dword, :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strcat proc xstr1:dword, xstr2:dword
mov dword ptr [esp-4], ebx
mov dword ptr [esp-8], edi
mov edi, dword ptr [esp+4]
align 4
d1: mov edx, dword ptr [edi]
mov ecx, edx
add edi, 4
sub edx, 16843009
and edx, 2155905152
jz d1
not ecx
and edx, ecx
jz d1
test dl, dl
jnz d2
test dh, dh
jnz d3
test edx, 16711680
jnz d4
jmp d5
d2: dec edi
d3: dec edi
d4: dec edi
d5: dec edi
mov eax, dword ptr [esp+8]
align 4
d6: mov edx, dword ptr [eax]
mov ecx, edx
add eax, 4
sub edx, 16843009
and edx, 2155905152
jnz d7
mov dword ptr [edi], ecx
add edi, 4
jmp d6
d7: mov ebx, ecx
not ecx
and edx, ecx
jnz d8
mov dword ptr [edi], ebx
add edi, 4
jmp d6
d8: mov dword ptr [edi], ebx
mov ebx, dword ptr [esp-4]
mov edi, dword ptr [esp-8]
ret 8
strcat endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################
Your algo is looking good. Ther is an important test for a catstr algo and that is if it will write to unaligned addresses and uneven length strings. The way to test this is to misalign the test strings by either 1 or 3 bytes. This is an important test as string data is only ever byte aligned in most instances and the algo to be properly general purpose must be able to manage this situation.
I'm not to sure what you mean by this, but I took a wild guess and tried:
align 1
str1 db 200 dup("a"), 0
align 1
str2 db 200 dup("a"), 0
....
align 1
buff db (sizeof str1 + sizeof str2 + 3) dup(?)
Ect ect and it had little to no effect on my cycle times.
I only tested 200/200 but with all the alignment commented out I got 937, with align 1 in there I got 936, and with align 2 I got 934.
Try it this way for testing misaligned reads and writes.
align 4
dummy db 1,2,3 ; misalign by 3 bytes
TestData db "whatever"
Here's the results on Athlon xp 3000+
String1 Size 0 1 0 1 5 13 32 223 251 1031
String2 Size 0 0 1 1 5 17 16 271 1031 251
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
0 byte misalignment
szCatStr 26 27 27 31 46 109 163 1320 3497 2115
hutchszCat2 12 17 18 18 33 115 152 1320 3694 2895
kuntz2 21 22 21 21 34 65 100 751 2124 1529
1 byte misalignment
szCatStr 27 27 27 29 50 114 154 1158 3494 2115
hutchszCat2 13 17 18 18 33 115 150 1309 3677 2885
kuntz2 21 22 21 22 73 105 105 818 2197 1820
2 byte misalignment
szCatStr 25 27 27 29 50 114 163 1294 3499 2116
hutchszCat2 14 17 18 18 33 115 150 1309 3676 2887
kuntz2 21 22 21 22 74 106 103 813 2189 1818
3 byte misalignment
szCatStr 25 27 27 29 50 114 163 1158 3506 2119
hutchszCat2 10 17 18 18 34 114 150 1308 3668 2885
kuntz2 21 22 21 22 34 63 126 813 2190 1818
Press enter to exit...
and the program I used-
[attachment deleted by admin]
Thank you Jimg for doing that, since the weekend is over I am busy once again with work. hutch, how does one go about supporting non-aligned data, and is it even worth it? Most compilers/assemblers would align the data for you, and I believe the address's you get from allocing memory are also aligned correct? Or am I wrong on that?
kunt0r,
I have just posted a self aligning version of Agner Fog's strlen algo in another topic on strlen algos but there may be some faster ones in that thread as well. The idea is to align the start to at least 4 bytes by running a small loop at the beginning then run the much faster DWORD technique that Agner Fog developed. This type of algo is OK in a strcat algo because it should always have memory allocated above where it may read past the end of the string.
Jim,
Thanks for running the tests, tweaking the byte scanner version made it a little aster.