News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

herge

 Hi jj2007:


lstrlenA return value     : 1024
strlen64Lingo return value: 1024
AzmtStrLen1A return value : 1024
AzmtStrLen2A return value : 1024
markl_szlen return value  : 1024
StringLength_Mmx_Min
(MMX/SSE2) return value   : 1024
StrSizeA(SSE2) value      : 1024
_strlen return value      : 1024

align 1k

lstrlenA return value: 1024
lstrlenA      :       1090 cycles
strlen64Lingo :       96 cycles
AzmtStrLen1A  :       1071 cycles
AzmtStrLen2A  :       1074 cycles
markl_szlen   :       416 cycles
StringLength_Mmx_Min: 245 cycles
StrSizeA(SSE2):       226 cycles
_strlen (Agner Fog):  195 cycles

align 0

lstrlenA return value: 191
lstrlenA      :       262 cycles
strlen64Lingo :       99 cycles
AzmtStrLen1A  :       195 cycles
AzmtStrLen2A  :       195 cycles
markl_szlen   :       107 cycles
StringLength_Mmx_Min: 49 cycles
StrSizeA(SSE2):       40 cycles
_strlen (Agner Fog):  51 cycles

align 1

lstrlenA return value: 191
lstrlenA      :       241 cycles
strlen64Lingo :       98 cycles
AzmtStrLen1A  :       204 cycles
AzmtStrLen2A  :       205 cycles
markl_szlen   :       154 cycles
StringLength_Mmx_Min: 104 cycles
StrSizeA(SSE2):       91 cycles
_strlen (Agner Fog):  40 cycles

align 4

lstrlenA return value: 191
lstrlenA      :       240 cycles
strlen64Lingo :       97 cycles
AzmtStrLen1A  :       195 cycles
AzmtStrLen2A  :       195 cycles
markl_szlen   :       106 cycles
StringLength_Mmx_Min: 108 cycles
StrSizeA(SSE2):       104 cycles
_strlen (Agner Fog):  45 cycles

align 7

lstrlenA return value: 191
lstrlenA      :       241 cycles
strlen64Lingo :       98 cycles
AzmtStrLen1A  :       213 cycles
AzmtStrLen2A  :       214 cycles
markl_szlen   :       99 cycles
StringLength_Mmx_Min: 105 cycles
StrSizeA(SSE2):       129 cycles
_strlen (Agner Fog):  40 cycles




Result on my computer.

regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

NightWare

Quote from: drizz on March 07, 2009, 05:24:12 PM
All of the routines, except Agner Fog's, fail on unaligned read beyond end of the buffer
...
zero should be returned, not access violation.
technically, -1 should be returned (0 for empty string, and obviously here it's not the case  :wink)
anyway, agner's algo isn't (really) safe, he has just displaced the problem from the end of the area to the beginning. so yes, here it works on current systems coz of microsoft functions, but there is no warrantry it will work in the futur... anyway it's not the proper way to solve the problem, your functions dedicated to string should simply contain a "safe area"...

drizz

What are you talking about?!?

Quote from: NightWare on March 09, 2009, 12:29:11 AM
technically, -1 should be returned (0 for empty string, and obviously here it's not the case  :wink)
but _it is_ an empty string

Quote from: NightWare on March 09, 2009, 12:29:11 AM
anyway, agner's algo isn't (really) safe, he has just displaced the problem from the end of the area to the beginning. so yes, here it works on current systems coz of microsoft functions, but there is no warrantry it will work in the futur...
"coz of microsoft functions"  ?!??

Which operating system allocates memory pages that are not 4kB or bigger?
Which operating system memory allocation functions return unaligned pointer?

facts please!

VirtualAlloc/NtAllocateMemory = 4kB aligned pointer
HeapAlloc/GlobalAlloc = 8-byte aligned on 32-bit platforms and 16-bytes on 64-bit


Windows, Linux, BSD or Mac, 32-bit x86
Quote;*************************  strlenSSE2.asm  **********************************
; Author:           Agner Fog
...
; Operating system: Windows, Linux, BSD or Mac, 32-bit x86

His function is safe, others are not!

AzmtStrLen1A - requires 4-byte aligned pointer
AzmtStrLen2A - requires 4-byte aligned pointer
markl_szlen - requires 16-byte aligned pointer
StringLength_Mmx_Min - requires 8-byte aligned pointer
StrSizeA - requires 16-byte aligned pointer
strlen64 - requires 32-byte aligned pointer
_strlen - no pointer alignment requirements
The truth cannot be learned ... it can only be recognized.

jj2007

Quote from: NightWare on March 09, 2009, 12:29:11 AM
anyway, agner's algo isn't (really) safe, he has just displaced the problem from the end of the area to the beginning.

Agner's algo will fail, like all others, in the specific case more thoroughly described here. The issue is a non-issue for all real world applications; i.e. you must construct a test case where no null byte is being found near the page boundary, and the memory is allocated with VirtualAlloc. HeapAlloc'ed memory does not throw an exception if you go some bytes beyond the page boundary.

The trick with his algo is that he starts on a safe boundary, and then uses a bsf to eliminate false hits. While bsf is listed as a very slow opcode in opcodes.chm, with 6-42 cycles, my tests show that it is now down to 2 cycles. The AF algo is pretty fast, and it seems only Lingo's routine can beat it, and for longer strings only.

NightWare

Quote from: drizz on March 09, 2009, 01:20:35 AM
but _it is_ an empty string
0 must be returned only if the first byte is 0, and no other case, on a manipulated string area you can have no 0 (displacement of the 2nd part+insert) and the algo will fail... coz there is no size limit (+with this logic, you must take care of the possible fullfilled area for ALL of your string functions (copy, insert,...))

Quote from: drizz on March 09, 2009, 01:20:35 AM
"coz of microsoft functions"  ?!??
ok here i've misread agner's algo, it's safe until what's previously said

drizz

Quote from: NightWare0 for empty string, and obviously here it's not the case
Quote from: NightWare on March 09, 2009, 02:29:59 AM
0 must be returned only if the first byte is 0
the first byte IS 0  :dazzled:  :dazzled:  :dazzled:
I'm not talking nonsense situations like jj
The truth cannot be learned ... it can only be recognized.

NightWare

Quote from: NightWare on March 09, 2009, 02:29:59 AM
on a manipulated string area you can have no 0 (displacement of the 2nd part+insert) and the algo will fail... coz there is no size limit (+with this logic, you must take care of the possible fullfilled area for ALL of your string functions (copy, insert,...))
:red oops correction... here the non 0 is possible because I USE a safe area... otherwise you're right drizz, no access violation from agner's algo. but i will continue to use my security area...

lingo

jj2007, If I am not wrong may be there is an error in your test
program for my results:
strlen64Lingo :    370 cycles -> align 1k vs
strlen64Lingo :    369 cycles ->align 0      
and from herge's test:
strlen64Lingo :    96 cycles ->align 1k
strlen64Lingo :    99 cycles  ->align 0
What is this: nonsense or manipulation...  :lol


from your test:
align 1k

lstrlenA return value: 1024
lstrlenA      :       3032 cycle
strlen64Lingo :                        370 cycles
AzmtStrLen1A  :       1407 cycle
AzmtStrLen2A  :       1407 cycle
markl_szlen   :       581 cycles
StringLength_Mmx_Min: 908 cycles
StrSizeA(SSE2):       600 cycles
_strlen (Agner Fog):  635 cycles

align 0

lstrlenA return value: 191
lstrlenA      :       630 cycles
strlen64Lingo :                    369 cycles
AzmtStrLen1A  :       301 cycles
AzmtStrLen2A  :       307 cycles
markl_szlen   :       276 cycles
StringLength_Mmx_Min: 224 cycles
StrSizeA(SSE2):       114 cycles
_strlen (Agner Fog):  103 cycles







align 1k

lstrlenA return value: 1024
lstrlenA      :       1090 cycles
strlen64Lingo :       96 cycles
AzmtStrLen1A  :       1071 cycles
AzmtStrLen2A  :       1074 cycles
markl_szlen   :       416 cycles
StringLength_Mmx_Min: 245 cycles
StrSizeA(SSE2):       226 cycles
_strlen (Agner Fog):  195 cycles

align 0

lstrlenA return value: 191
lstrlenA      :       262 cycles
strlen64Lingo :       99 cycles
AzmtStrLen1A  :       195 cycles
AzmtStrLen2A  :       195 cycles
markl_szlen   :       107 cycles
StringLength_Mmx_Min: 49 cycles
StrSizeA(SSE2):       40 cycles
_strlen (Agner Fog):  51 cycles

sinsi

I get the same numbers as herge did - looks like herge has a new computer :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: drizz on March 09, 2009, 02:47:48 AM
I'm not talking nonsense situations like jj

It was precisely my intention to prove that such algos fail only in nonsense situations. There was an earlier thread where people argued that SSE2 is bad because it may read 16 bytes "beyond", except only "one harmless byte".

Quote from: lingo on March 09, 2009, 03:37:18 AM
jj2007, If I am not wrong may be there is an error in your test
..
What is this: nonsense or manipulation...  :lol

No manipulation. I posted the source above, so you can check. But the result is pretty odd indeed, worth investigating.

herge

 Hi All:

I am having trouble compiling StrLenaLingo.asm


C:\masm32\test>\masm32\bin\ml /c /coff /Zi /Zd /Fl "strlenalingo".asm 
Assembling: strlenalingo.asm
strlenalingo.asm(471) : error A2008: syntax error : xmm
strlenalingo.asm(485) : error A2008: syntax error : xmm
strlenalingo.asm(563) : error A2008: syntax error : movdqa

movdqa   xmm1, [eax]           ; read from nearest preceding boundary << 471

movdqa   xmm1, [eax]           ; read 16 bytes aligned << 485

@@: movdqa xmm0,OWORD ptr [edx]          ; << 563


I don't know much about xmm code.
So I don't know how to fix it.

Regards herge

// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

lingo

drizz,
'strlen64 - requires 32-byte aligned pointer'
should be: strlen64 - requires 16-byte aligned pointer :wink

jj2007

#192
Quote from: jj2007 on March 09, 2009, 07:44:22 AM
Quote from: lingo on March 09, 2009, 03:37:18 AM
jj2007, If I am not wrong may be there is an error in your test
..
What is this: nonsense or manipulation...  :lol

No manipulation. I posted the source above, so you can check. But the result is pretty odd indeed, worth investigating.

I investigated, and Lingo is right, there was an error: His code was always called with the address of the 1024 bytes string, which was kind of unfair :red

Now I took the best of two worlds, i.e. Lingo's speed and Agner's brilliant alignment scheme, and threw them together. The result (shown as strlen32) is, ehm, how to put it: just about good enough for my own private library: :bg

EDIT: Bug fixed - there was a "hole" of 16 bytes between the two parts.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

align 1k
lstrlenA return value: 1024
strlen32 return value: 1024
lstrlenA      :       3096 cycles
strlen32      :       347 cycles
strlen64Lingo :       371 cycles
StrSizeA(SSE2):       594 cycles
_strlen (Agner Fog):  633 cycles

align 0
lstrlenA return value: 191
strlen32 return value: 191
lstrlenA      :       637 cycles
strlen32      :       79 cycles
strlen64Lingo :       92 cycles
StrSizeA(SSE2):       122 cycles
_strlen (Agner Fog):  105 cycles

align 1
lstrlenA return value: 191
strlen32 return value: 191
lstrlenA      :       625 cycles
strlen32      :       78 cycles
strlen64Lingo :       not possible
StrSizeA(SSE2):       139 cycles
_strlen (Agner Fog):  112 cycles


Here is the algo, full code is attached.
@Herge: It won't compile with Masm v614 - use JWasm instead.


strlen32 proc src:DWORD ; jj 9 March 2007, 92 (down from 103) bytes
mov eax, [esp+4] ; get pointer to string: -- this part taken from Agner Fog --------
mov ecx, eax ; copy pointer
pxor xmm0, xmm0 ; set to zero for comparison
and eax, -16 ; align pointer by 16
and ecx, 15 ; lower 4 bits indicate misalignment
pcmpeqb xmm0, [eax] ; read 16 from nearest preceding boundary and compare with zero
pmovmskb edx, xmm0 ; get one bit for each byte result
shr edx, cl ; shift out false bits
shl edx, cl ; shift back again
bsf edx, edx ; find first 1-bit
jnz fdr1 ; found in round 1

add eax, 16 ; correct aligned pointer for bytes already treated above
pxor xmm0, xmm0 ; reset to zero for comparisons below
pxor xmm1, xmm1
; align 16 ; no good, costs about one cycle extra
@@: pcmpeqb xmm0, [eax] ; -------------- this part taken from Lingo --------------
pcmpeqb xmm1, [eax+16] ; ecx is pointer to initial string, 16-byte aligned
por xmm1, xmm0
pmovmskb edx, xmm1
add eax, 32 ; len counter (moving up costs 3 cycles for the 191 byte string)
test edx, edx
jz @B

pmovmskb ecx, xmm0
shl edx, 16
or edx, ecx
bsf edx, edx
sub eax, 32 ; subtract initial 16 bytes
fdr1: sub eax, [esp+4]
add eax, edx
ret 4
strlen32 endp

[attachment deleted by admin]

jj2007

Probably a stupid question:

pxor xmm0, xmm0 ; reset to zero for comparisons below
pxor xmm1, xmm1
if 1 ; crashtest - some values will be incorrect
movdqa xmm1, Minus1
endif
; align 16 ; no good, costs about one cycle extra
@@: pcmpeqb xmm0, [eax] ; -------------- this part taken from Lingo --------------
pcmpeqb xmm1, [eax+16] ; ecx is pointer to initial string, 16-byte aligned
por xmm1, xmm0
pmovmskb edx, xmm1
add eax, 32 ; len counter (moving up costs 3 cycles for the 191 byte string)
test edx, edx
jz @B


Why is it apparently not necessary to reset xmm0 and xmm1 inside the loop? If I insert a movdqa xmm1, Minus1 before the loop, the algo will not work correctly for some strings; but although xmm1 changes a lot inside the loop, results seem not to be affected :dazzled:

jj2007

0.183 cycles per byte seems quite acceptable - a factor 10 faster than lstrlen. In contrast to the P4, here Lingo's algo is a little bit faster for short strings.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
strlen32 codesize=92

align 4k
lstrlenA return value: 4096
strlen32 return value: 4096
strlen32      :       749 cycles
strlen64Lingo :       761 cycles
_strlen (Agner Fog):  1095 cycles

align 1k
lstrlenA return value: 1024
strlen32 return value: 1024
strlen32      :       199 cycles
strlen64Lingo :       200 cycles
_strlen (Agner Fog):  271 cycles

align 0
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       48 cycles
strlen64Lingo :       44 cycles
_strlen (Agner Fog):  91 cycles

align 1
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       48 cycles
strlen64Lingo :       not possible
_strlen (Agner Fog):  64 cycles

align 4
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       48 cycles
strlen64Lingo :       not possible
_strlen (Agner Fog):  64 cycles

align 7
lstrlenA return value: 191
strlen32 return value: 191
strlen32      :       49 cycles
strlen64Lingo :       not possible
_strlen (Agner Fog):  64 cycles