News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

strcat optimize

Started by Brett Kuntz, August 13, 2005, 05:52:27 PM

Previous topic - Next topic

Brett Kuntz

Due to the nature of strcat, I have to 0 the memory of the buffer each test, so the following cycles include that overhead. For each trial, two strcat's are done on two different strings onto a zero'd buffer:

edit- read below for update

Codewarp

strcat??  What is it supposed to do?  I cannot find any definition and there are no comments in the code.  Huh ? 

Brett Kuntz

It appends the source string onto the destination string.

hutch--

kunt0r,

If you have the time, would you test the speed of the two that are in the masm32 library, "szCatStr" and "szMultiCat". The second being able to handle a list of zero terminated strings.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Brett Kuntz

I found an error with my "buff", it seems the larger it is, the longer these functions take. I'm not sure why, so I'll redo the tests and post them below with a fixed buff for each function. I also updated my proc to shave a few more cycles:


Type ---- 0/0 - 1/1 - 5/5 - 10/10 - 50/50 - 200/200 - 250/1000 - 1000/250
=========================================================================
kuntz     85    90    144   121     339     942       2268       3019
szCatStr  96    131   189   223     577     1709      4716       5473
ms api    200   210   252   353     723     2090      5285       6749


New code used for testing:


    .686p
    .model flat, stdcall
    option casemap :none

    include \masm32\include\windows.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\kernel32.lib
    include \masm32\include\user32.inc
    includelib \masm32\lib\user32.lib
    include \masm32\include\masm32.inc
    includelib \masm32\lib\masm32.lib

    include \masm32\kinc\strcat.inc
    include \masm32\kinc\timer.inc

.data
    align 4
    str1 db 10 dup("a"), 0
    align 4
    str2 db 10 dup("a"), 0
    msgstr db "Cycles: %d", 0
.data?
    align 4
    buff db (sizeof str1 + sizeof str2 + 3) dup(?)
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    counter_begin 1000000, REALTIME_PRIORITY_CLASS
    invoke RtlZeroMemory, addr buff, sizeof buff
    invoke szCatStr, addr buff, addr str1
    invoke szCatStr, addr buff, addr str2
    counter_end

    invoke wsprintf, addr str1, addr msgstr, eax
    invoke MessageBox, 0, addr str1, 0, 0
    invoke ExitProcess, 0

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start



    strcat proto :dword, :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

strcat proc xstr1:dword, xstr2:dword

    mov dword ptr [esp-4], ebx
    mov dword ptr [esp-8], edi
    mov edi, dword ptr [esp+4]
d1: mov edx, dword ptr [edi]
    mov ecx, edx
    add edi, 4
    sub edx, 16843009
    and edx, 2155905152
    jz d1
    not ecx
    and edx, ecx
    jz d1
    test dl, dl
    jnz d2
    test dh, dh
    jnz d3
    test edx, 16711680
    jnz d4
    jmp d5
d2: dec edi
d3: dec edi
d4: dec edi
d5: dec edi
    mov eax, dword ptr [esp+8]
d6: mov edx, dword ptr [eax]
    mov ecx, edx
    add eax, 4
    sub edx, 16843009
    and edx, 2155905152
    jnz d7
    mov dword ptr [edi], ecx
    add edi, 4
    jmp d6
d7: mov ebx, ecx
    not ecx
    and edx, ecx
    jnz d8
    mov dword ptr [edi], ebx
    add edi, 4
    jmp d6
d8: mov dword ptr [edi], ebx
    mov ebx, dword ptr [esp-4]
    mov edi, dword ptr [esp-8]
    ret 8

strcat endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################

hutch--

Here is a quick rewrite of the szCatStr proc. If you have the time, will you run this one on the clock. I develop on a 2.8 gig PIV, are you timing on Intel or AMD hardware ?


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCat2 proc src:DWORD,szadd:DWORD

    push edi

    xor eax, eax
    mov edx, [esp+8]    ; src
    mov edi, [esp+12]   ; szadd
    sub edx, 1
    xor ecx, ecx

  align 4
  @@:
    add edx, 1
    cmp BYTE PTR [edx], 0
    jne @B

  align 4
  @@:
    mov al, [edi+ecx]
    mov [edx+ecx], al
    add ecx, 1
    test al, al
    jnz @B

    mov eax, [esp+8]    ; src   return start address of source

    pop edi

    ret 8

szCat2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Brett Kuntz

I'm on an AMD 32-bit 2500. And the size of buff was slowing things down because of the zero memory, no idea why I didn't think of that yesterday, probably was to tired lol. I tried aligning my two loops, sped up my proc a bit on strings larger then 8:


Type ---- 0/0 - 1/1 - 5/5 - 10/10 - 50/50 - 200/200 - 250/1000 - 1000/250
=========================================================================
kuntz     90    91    143   118     333     935       2264       3010
szCatStr  96    131   189   223     577     1709      4716       5473
szCat2    71    84    122   225     595     1446      3567       4957
ms api    200   210   252   353     723     2090      5285       6749



    strcat proto :dword, :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

strcat proc xstr1:dword, xstr2:dword

    mov dword ptr [esp-4], ebx
    mov dword ptr [esp-8], edi
    mov edi, dword ptr [esp+4]
    align 4
d1: mov edx, dword ptr [edi]
    mov ecx, edx
    add edi, 4
    sub edx, 16843009
    and edx, 2155905152
    jz d1
    not ecx
    and edx, ecx
    jz d1
    test dl, dl
    jnz d2
    test dh, dh
    jnz d3
    test edx, 16711680
    jnz d4
    jmp d5
d2: dec edi
d3: dec edi
d4: dec edi
d5: dec edi
    mov eax, dword ptr [esp+8]
    align 4
d6: mov edx, dword ptr [eax]
    mov ecx, edx
    add eax, 4
    sub edx, 16843009
    and edx, 2155905152
    jnz d7
    mov dword ptr [edi], ecx
    add edi, 4
    jmp d6
d7: mov ebx, ecx
    not ecx
    and edx, ecx
    jnz d8
    mov dword ptr [edi], ebx
    add edi, 4
    jmp d6
d8: mov dword ptr [edi], ebx
    mov ebx, dword ptr [esp-4]
    mov edi, dword ptr [esp-8]
    ret 8

strcat endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################

hutch--

Your algo is looking good. Ther is an important test for a catstr algo and that is if it will write to unaligned addresses and uneven length strings. The way to test this is to misalign the test strings by either 1 or 3 bytes. This is an important test as string data is only ever byte aligned in most instances and the algo to be properly general purpose must be able to manage this situation.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Brett Kuntz

I'm not to sure what you mean by this, but I took a wild guess and tried:

align 1
str1 db 200 dup("a"), 0
align 1
str2 db 200 dup("a"), 0
....
align 1
buff db (sizeof str1 + sizeof str2 + 3) dup(?)

Ect ect and it had little to no effect on my cycle times.

I only tested 200/200 but with all the alignment commented out I got 937, with align 1 in there I got 936, and with align 2 I got 934.

hutch--

Try it this way for testing misaligned reads and writes.


align 4
  dummy db 1,2,3                ; misalign by 3 bytes
  TestData db "whatever"
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Here's the results on Athlon xp 3000+


String1 Size     0     1     0     1     5    13    32   223   251  1031
String2 Size     0     0     1     1     5    17    16   271  1031   251
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== =====

0 byte misalignment
szCatStr        26    27    27    31    46   109   163  1320  3497  2115
hutchszCat2     12    17    18    18    33   115   152  1320  3694  2895
kuntz2          21    22    21    21    34    65   100   751  2124  1529

1 byte misalignment
szCatStr        27    27    27    29    50   114   154  1158  3494  2115
hutchszCat2     13    17    18    18    33   115   150  1309  3677  2885
kuntz2          21    22    21    22    73   105   105   818  2197  1820

2 byte misalignment
szCatStr        25    27    27    29    50   114   163  1294  3499  2116
hutchszCat2     14    17    18    18    33   115   150  1309  3676  2887
kuntz2          21    22    21    22    74   106   103   813  2189  1818

3 byte misalignment
szCatStr        25    27    27    29    50   114   163  1158  3506  2119
hutchszCat2     10    17    18    18    34   114   150  1308  3668  2885
kuntz2          21    22    21    22    34    63   126   813  2190  1818

Press enter to exit...


and the program I used-



[attachment deleted by admin]

Brett Kuntz

Thank you Jimg for doing that, since the weekend is over I am busy once again with work. hutch, how does one go about supporting non-aligned data, and is it even worth it? Most compilers/assemblers would align the data for you, and I believe the address's you get from allocing memory are also aligned correct? Or am I wrong on that?

hutch--

kunt0r,

I have just posted a self aligning version of Agner Fog's strlen algo in another topic on strlen algos but there may be some faster ones in that thread as well. The idea is to align the start to at least 4 bytes by running a small loop at the beginning then run the much faster DWORD technique that Agner Fog developed. This type of algo is OK in a strcat algo because it should always have memory allocated above where it may read past the end of the string.

Jim,

Thanks for running the tests, tweaking the byte scanner version made it a little aster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php