News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

GregL

Herge,

You could use MASM 6.15 or later also.


herge


Hi Greg:

I tried the ML.EXE that you can get if you have c++ 2005 from Microsoft.
It compiles okay, but you get a C...5 error access violation and you
send a message to Microsoft when you run the EXE.

lstrlenA return value     : 1024
strlen64Lingo return value: 1024
AzmtStrLen1A return value : 1024
AzmtStrLen2A return value : 1024
markl_szlen return value  : 1024
StringLength_Mmx_Min
(MMX/SSE2) return value   : 1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

align 1k

lstrlenA return value: 1024
lstrlenA      :       1082 cycles
strlen64Lingo :       84 cycles
AzmtStrLen1A  :       1061 cycles
AzmtStrLen2A  :       1061 cycles
markl_szlen   :       415 cycles
StringLength_Mmx_Min: 275 cycles
StrSizeA(SSE2):       168 cycles
_strlen (Agner Fog):  183 cycles

align 0

lstrlenA return value: 191
lstrlenA      :       264 cycles
strlen64Lingo :       85 cycles
AzmtStrLen1A  :       194 cycles
AzmtStrLen2A  :       194 cycles
markl_szlen   :       107 cycles
StringLength_Mmx_Min: 71 cycles
StrSizeA(SSE2):       27 cycles
_strlen (Agner Fog):  37 cycles

align 1

lstrlenA return value: 191
lstrlenA      :       240 cycles
strlen64Lingo :       86 cycles
AzmtStrLen1A  :       203 cycles
AzmtStrLen2A  :       203 cycles
markl_szlen   :       154 cycles
StringLength_Mmx_Min: 109 cycles
StrSizeA(SSE2):       ; It Blows up HERE!



Micosoft writes a report.
C:\DOCUME~1\User\LOCALS~1\Temp\a488_appcompat.txt
Which for reasons I don't understand I can't find.
It does a dump in a list box you can Not Copy.
Which I must Say is Most helpful!

I believe we get a C5 error access violation.

Attachments StrLenaLingo ASM OBJ EXE PDB

Regards herge




[attachment deleted by admin]
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Hi Herge,
There is a new version towards the bottom of page 13 of this thread, in this post. You have a previous one with a tiny bug:

StrSizeA proc lpStrA:DWORD
   
@@:   mov edx,DWORD ptr [esp+4]
   pxor xmm1,xmm1
   mov ecx,edx
      neg ecx
      align 16
@@:   movdqu xmm0,OWORD ptr [edx]
      lea edx,[edx+16]   
      pcmpeqb xmm0,xmm1
      pmovmskb eax,xmm0
   test eax,eax   
      jz @B

@@:   lea ecx,[edx+ecx-16]
      xor edx,edx
      bsf edx,eax
   lea eax,[ecx+edx]
   ret 4

StrSizeA endp

The new version strlen32 is faster and shorter and does not crash.

herge


Hi jj2007:

We Have Lift Off!


lstrlenA return value     : 1024
strlen64Lingo return value: 1024
AzmtStrLen1A return value : 1024
AzmtStrLen2A return value : 1024
markl_szlen return value  : 1024
StringLength_Mmx_Min

(MMX/SSE2) return value   : 1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

align 1k


lstrlenA return value: 1024
lstrlenA      :       1077 cycles
strlen64Lingo :       84 cycles
AzmtStrLen1A  :       1056 cycles
AzmtStrLen2A  :       1056 cycles
markl_szlen   :       413 cycles
StringLength_Mmx_Min: 275 cycles
StrSizeA(SSE2):       224 cycles
_strlen (Agner Fog):  182 cycles

align 0


lstrlenA return value: 191
lstrlenA      :       259 cycles
strlen64Lingo :       83 cycles
AzmtStrLen1A  :       194 cycles
AzmtStrLen2A  :       193 cycles
markl_szlen   :       105 cycles
StringLength_Mmx_Min: 71 cycles
StrSizeA(SSE2):       38 cycles
_strlen (Agner Fog):  37 cycles

align 1


lstrlenA return value: 191
lstrlenA      :       238 cycles
strlen64Lingo :       84 cycles
AzmtStrLen1A  :       201 cycles
AzmtStrLen2A  :       202 cycles
markl_szlen   :       152 cycles
StringLength_Mmx_Min: 109 cycles
StrSizeA(SSE2):       91 cycles
_strlen (Agner Fog):  49 cycles

align 4


lstrlenA return value: 191
lstrlenA      :       239 cycles
strlen64Lingo :       84 cycles
AzmtStrLen1A  :       191 cycles
AzmtStrLen2A  :       191 cycles
markl_szlen   :       105 cycles
StringLength_Mmx_Min: 95 cycles
StrSizeA(SSE2):       104 cycles
_strlen (Agner Fog):  44 cycles

align 7


lstrlenA return value: 191
lstrlenA      :       235 cycles
strlen64Lingo :       84 cycles
AzmtStrLen1A  :       211 cycles
AzmtStrLen2A  :       210 cycles
markl_szlen   :       98 cycles
StringLength_Mmx_Min: 106 cycles
StrSizeA(SSE2):       138 cycles
_strlen (Agner Fog):  49 cycles


Thank you jj2007.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

NightWare

Quote from: jj2007 on March 09, 2009, 04:42:39 PM
Why is it apparently not necessary to reset xmm0 and xmm1 inside the loop?
because xmm0 and xmm1 are defined as zero during the comparisons (until a 0 is found)

here a new one, but must be tested (i've just made few test during conception) :
ALIGN 16
;
; syntax :
; mov esi,OFFSET String
; call NWStrLen
;
; Return :
; eax = String Length
;
NWStrLen PROC
push ecx ;; empiler ecx
push edx ;; empiler edx

mov edx,esi ;; placer l'adresse de départ dans edx
pxor XMM0,XMM0 ;; ) effacer XMM0 et XMM1 (ce sera nos registres de comparaison)
pxor XMM1,XMM1 ;; )
; ici, on teste un bloc de x caractères (dépend de l'alignement), pour voir s'il existe un 0
movdqu XMM2,OWORD PTR [edx] ;; placer l'oword à l'adresse en edx dans XMM2
and edx,0FFFFFFF0h ;; conserver l'alignement 16 précédant dans edx
pcmpeqb XMM0,XMM2 ;; comparer XMM2 à XMM0
pcmpeqb XMM1,OWORD PTR [edx+16] ;; comparer l'oword à l'adresse en edx+16 à XMM1
por XMM1,XMM0 ;; fusionner XMM1 et XMM0
pmovmskb eax,XMM1 ;; générer le masque de XMM1 dans eax
test eax,eax ;; fixer les flags de eax
jz Label1 ;; si c'est égal à 0 (pas de 0 trouvé), aller Label1
; ici, on va chercher le 0 dans le bloc de x caractères
shl eax,16 ;; décaler eax (qui contient déjà le masque XMM1) à gauche d'un dword/2
mov ecx,esi ;; placer l'adresse originelle dans ecx
sub ecx,edx ;; soustraire l'alignement précédant
shr eax,cl ;; décaler eax à droite, correspondant au décalage de l'alignement
pmovmskb ecx,XMM0 ;; générer le masque de XMM0 dans ecx
or eax,ecx ;; fusionner ecx à eax
bsf eax,eax ;; scanner le premier bit armé de eax à partir de la droite

pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
ret ;; retourner (sortir de la procédure)

nop ;; ) alignement nécessaire pour un meilleur rendement
nop ;; )
nop ;; )
; nop ;; )
; nop ;; )
; nop ;; )
; nop ;; )
; nop ;; )
; ici, on teste un bloc de 32 caractères, pour voir s'il existe un 0
Label1: add edx,OWORD*2 ;; ajouter 32 (notre pas de progression) à edx
pcmpeqb XMM0,OWORD PTR [edx] ;; comparer l'oword à l'adresse en edx à XMM0
pcmpeqb XMM1,OWORD PTR [edx+16] ;; comparer l'oword à l'adresse en edx+16 à XMM1
por XMM1,XMM0 ;; fusionner XMM1 et XMM0
pmovmskb eax,XMM1 ;; générer le masque de XMM1 dans eax
test eax,eax ;; fixer les flags de eax
jz Label1 ;; si c'est égal à 0 (pas de 0 trouvé), aller Label1
; ici, on va chercher le 0 dans le bloc de 32 caractères
pmovmskb ecx,XMM0 ;; générer le masque de XMM0 dans ecx
shl eax,16 ;; décaler eax (qui contient déjà le masque XMM1) à gauche d'un dword/2
or eax,ecx ;; fusionner ecx à eax
sub edx,esi ;; enlever l'adresse de départ à edx
bsf eax,eax ;; scanner le premier bit armé de eax à partir de la droite
add eax,edx ;; ajouter edx à eax pour obtenir la taille finale

pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
ret ;; retourner (sortir de la procédure)
NWStrLen ENDP

lingo

I modified a bit my strlen64 and created new strlen64A (Thanks to NightWare for movdqu idea)  :wink
I used jj's test program and have new results:

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
lstrlenA return value     : 1024
strlen64Lingo return value: 1024
strlen32 return value:      1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

strlen64A return value      : 1024

align 1k
lstrlenA return value: 1024
strlen32 return value: 1024
strlen32      :       105 cycles
strlen64Lingo :       84 cycles
strlen64LingoA:       83 cycles
_strlen (Agner Fog):  180 cycles

align 0
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       26 cycles
strlen64Lingo :       18 cycles
strlen64LingoA:       19 cycles
_strlen (Agner Fog):  40 cycles

align 1
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       26 cycles
strlen64Lingo :       not possible
strlen64LingoA:       22 cycles
_strlen (Agner Fog):  50 cycles

align 4
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       26 cycles
strlen64Lingo :       not possible
strlen64LingoA:       23 cycles
_strlen (Agner Fog):  50 cycles

align 7
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       26 cycles
strlen64Lingo :       not possible
strlen64LingoA:       23 cycles
_strlen (Agner Fog):  50 cycles

Press any key to exit...





[attachment deleted by admin]

sinsi

This is getting good...

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
lstrlenA return value     : 1024
strlen64Lingo return value: 1024
strlen32 return value:      1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

strlen64A return value      : 1024

align 1k
lstrlenA return value: 1024
strlen32 return value: 1024
strlen32      :       97 cycles
strlen64Lingo :       84 cycles
strlen64LingoA:       78 cycles
_strlen (Agner Fog):  178 cycles

align 0
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       24 cycles
strlen64Lingo :       19 cycles
strlen64LingoA:       20 cycles
_strlen (Agner Fog):  40 cycles

align 1
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       29 cycles
strlen64Lingo :       not possible
strlen64LingoA:       23 cycles
_strlen (Agner Fog):  49 cycles

align 4
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       25 cycles
strlen64Lingo :       not possible
strlen64LingoA:       23 cycles
_strlen (Agner Fog):  49 cycles

align 7
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       24 cycles
strlen64Lingo :       not possible
strlen64LingoA:       23 cycles
_strlen (Agner Fog):  49 cycles

Hey jj, the CPU identification is good, now add the Windows version to it as well.  :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

#202
Quote from: sinsi on March 10, 2009, 05:45:06 AM
This is getting good...
...
Hey jj, the CPU identification is good, now add the Windows version to it as well.  :bg

XP unless otherwise specified. Speedwise, it should not make any difference. You may check this thread, but warning, what M$ expects us to do to detect the version is no good for your mental health.

I have incorporated Lingo's new algo, and replaced lstrlen with crt_strlen because lstrlen is no longer a serious competitor for these algos.

              Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
codesizes: strlen32=92, strlen64A=117, _strlen=66

-- test 16k           return values jj, Lingo, Agner: 16384, 16384, 16384
crt_strlen    :       16155 cycles
strlen32      :       4819 cycles
strlen64LingoA :      6208 cycles
_strlen (Agner Fog):  10044 cycles

-- test 4k            return values jj, Lingo, Agner: 4096, 4096, 4096
crt_strlen    :       3973 cycles
strlen32      :       1144 cycles
strlen64LingoA :      1137 cycles
_strlen (Agner Fog):  2308 cycles

-- test 1k            return values jj, Lingo, Agner: 1024, 1024, 1024
crt_strlen    :       1046 cycles
strlen32      :       362 cycles
strlen64LingoA :      357 cycles
_strlen (Agner Fog):  651 cycles

-- test 0             return values jj, Lingo, Agner: 191, 191, 191
crt_strlen    :       260 cycles
strlen32      :       73 cycles
strlen64LingoA :      78 cycles
_strlen (Agner Fog):  108 cycles

-- test 1             return values jj, Lingo, Agner: 191, 191, 191
crt_strlen    :       255 cycles
strlen32      :       84 cycles
strlen64LingoA :      91 cycles
_strlen (Agner Fog):  115 cycles

-- test 4             return values jj, Lingo, Agner: 191, 191, 191
crt_strlen    :       242 cycles
strlen32      :       78 cycles
strlen64LingoA :      80 cycles
_strlen (Agner Fog):  116 cycles

-- test 7             return values jj, Lingo, Agner: 191, 191, 191
crt_strlen    :       257 cycles
strlen32      :       79 cycles
strlen64LingoA :      80 cycles
_strlen (Agner Fog):  111 cycles

[attachment deleted by admin]

herge

 Hi jj2007:

Good Morning here are my results for
strlenSSE2.exe


Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32=92, strlen64A=117, _strlen=66

-- test 16k       return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       9666 cycles
strlen32      :       1479 cycles
strlen64LingoA :      1139 cycles
_strlen (Agner Fog):  2817 cycles

-- test 4k       return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       2427 cycles
strlen32      :       405 cycles
strlen64LingoA :      333 cycles
_strlen (Agner Fog):  720 cycles

-- test 1k       return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       648 cycles
strlen32      :       101 cycles
strlen64LingoA :      98 cycles
_strlen (Agner Fog):  197 cycles

-- test 0       return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       123 cycles
strlen32      :       26 cycles
strlen64LingoA :      20 cycles
_strlen (Agner Fog):  56 cycles

-- test 1       return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       122 cycles
strlen32      :       26 cycles
strlen64LingoA :      33 cycles
_strlen (Agner Fog):  40 cycles

-- test 4       return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       122 cycles
strlen32      :       26 cycles
strlen64LingoA :      23 cycles
_strlen (Agner Fog):  46 cycles

-- test 7       return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       119 cycles
strlen32      :       26 cycles
strlen64LingoA :      23 cycles
_strlen (Agner Fog):  40 cycles

Press any key to exit...


Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

lingo

On my old lapi with Vista64 Ultimate SP1:  :wink
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
strlen32 retval: 5, 10, 15, 20, 25, 30, 35, 40, 45
lstrlenA return value     : 1024
strlen64Lingo return value: 1024
strlen32 return value:      1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

strlen64A return value      : 1024

align 1k
lstrlenA return value: 1024
strlen32 return value: 1024
strlen32      :       285 cycles
strlen64Lingo :       236 cycles
strlen64LingoA:       236 cycles
_strlen (Agner Fog):  942 cycles

align 0
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       109 cycles
strlen64Lingo :       53 cycles
strlen64LingoA:       54 cycles
_strlen (Agner Fog):  223 cycles

align 1
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       74 cycles
strlen64Lingo :       not possible
strlen64LingoA:       64 cycles
_strlen (Agner Fog):  197 cycles

align 4
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       74 cycles
strlen64Lingo :       not possible
strlen64LingoA:       64 cycles
_strlen (Agner Fog):  198 cycles

align 7
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       74 cycles
strlen64Lingo :       not possible
strlen64LingoA:       64 cycles
_strlen (Agner Fog):  197 cycles

Press any key to exit...


AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
codesizes: strlen32=92, strlen64A=117, _strlen=66

-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       16537 cycles
strlen32      :       3182 cycles
strlen64LingoA :      3126 cycles
_strlen (Agner Fog):  14014 cycles

-- test 4k            return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       4132 cycles
strlen32      :       867 cycles
strlen64LingoA :      815 cycles
_strlen (Agner Fog):  3537 cycles

-- test 1k            return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       1051 cycles
strlen32      :       288 cycles
strlen64LingoA :      236 cycles
_strlen (Agner Fog):  939 cycles

-- test 0             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       222 cycles
strlen32      :       113 cycles
strlen64LingoA :      54 cycles
_strlen (Agner Fog):  225 cycles

-- test 1             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       217 cycles
strlen32      :       76 cycles
strlen64LingoA :      65 cycles
_strlen (Agner Fog):  197 cycles

-- test 4             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       214 cycles
strlen32      :       76 cycles
strlen64LingoA :      63 cycles
_strlen (Agner Fog):  197 cycles

-- test 7             return values Lingo, jj, Agner: 191, 191, 191
crt_strlen    :       211 cycles
strlen32      :       76 cycles
strlen64LingoA :      64 cycles
_strlen (Agner Fog):  198 cycles

Press any key to exit...



jj2007

Thanks, very interesting. It seems the two algos are roughly equivalent, with Lingo's a bit stronger on AMD and Core2 (Herge) and mine stronger on P4's and (marginally) on Celeron M. In any case, Hutch faces a difficult choice for the next Masm32 version:

-- test 1k --
Masm32 lib szLen    : 2215 cycles
crt_strlen    :       1042 cycles
strlen32      :       354 cycles
strlen64LingoA :      354 cycles
_strlen (Agner Fog):  648 cycles

-- test aligned 1, 191 bytes --
Masm32 lib szLen :    515 cycles
crt_strlen    :       262 cycles
strlen32      :       73 cycles
strlen64LingoA :      105 cycles
_strlen (Agner Fog):  111 cycles


A factor 6-7 on one of the most popular functions is not so bad :green2

lingo

I can't understand what happen with your PC or with you... :lol
New nonsense about the same program and test:
'strlen64LingoA :      105 cycles !!!'
Pls, take a look of your previous messages about the same test and program..
Where is the true?


jj2007

Quote from: lingo on March 10, 2009, 02:03:40 PM
I can't understand what happen with your PC or with you... :lol
New nonsense about the same program and test:
'strlen64LingoA :      105 cycles !!!'
Pls, take a look of your previous messages about the same test and program..
Where is the true?

The truth is that timings tend to be not 100% accurate, and that I have a P4 in office, and a Celeron M at home. Your algo is marginally slower than mine on a P4 for short unaligned strings... no need to panic, dear friend :thumbu

Here are some more timings with a higher LOOP_COUNT:
-- test 0             0=perfectly aligned on 16-byte boundary
crt_strlen    :       243 cycles
strlen32      :       74 cycles
strlen64LingoA :      71 cycles
_strlen (Agner Fog):  105 cycles

-- test 1             1=misaligned 1 byte
crt_strlen    :       247 cycles
strlen32      :       75 cycles
strlen64LingoA :      90 cycles
_strlen (Agner Fog):  111 cycles

-- test 4             return values
crt_strlen    :       240 cycles
strlen32      :       76 cycles
strlen64LingoA :      81 cycles
_strlen (Agner Fog):  130 cycles

-- test 7             return values
crt_strlen    :       243 cycles
strlen32      :       74 cycles
strlen64LingoA :      83 cycles
_strlen (Agner Fog):  114 cycles


Your algo seems faster on AMD and Core Duo. In any case, you should be proud of having found an algo that is 5 times as fast as the fastest M$ algo, and (for longer strings) twice as fast as the latest Agner Fog algo. My own one is a minor adaption of yours, so the credits go to you anyway :U

hutch--

 :bg

> In any case, Hutch faces a difficult choice for the next Masm32 version:

Yeah ?

No I don't, I have been watching musical chairs on string length algos for at least the last 10 years, in about 99.9999999999999999999999999% of cases the slow byte scanner is more than fast enough and in the .0 --- 0001% of other cases Agner Fog's algo is even more than fast enough. Speed is greate but it must also be useful gains and string length algos are rarely ever a big deal.

On a native 64 bit box it should be a toss between native 64 bit and emulated 128 bit SSE3/4/? on paragraph alignment, shame most string data is aligned to 1.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on March 10, 2009, 02:38:50 PM
:bg
... paragraph alignment, shame most string data is aligned to 1.

Shame you don't read the posts in your own forum. Both 'winner' algos have no problem with misalignment.
:(