News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

lingo

"It seems Lingo has a little edge here."
madjj is never tired to convert my code in lame code... :lol
but, 'Lingo has a BIG edge here' :lol....
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
100000000 bytes allocated

codesizes: strlen32sLAME=85, strlen64A=120, strlen64B=84, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32sLAME        1575 cycles
strlen64LingoB       1532 cycles
_strlen (Agner Fog)  2761 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32sLAME        427 cycles
strlen64LingoB       404 cycles
_strlen (Agner Fog)  708 cycles

-- test 1k, misaligned 15, 1024 bytes
strlen32sLAME        100 cycles
strlen64LingoB       78 cycles
_strlen (Agner Fog)  193 cycles

-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   99 cycles
  crt strlen         75 cycles
strlen32sLAME        19 cycles
strlen64LingoB       10 cycles
_strlen (Agner Fog)  19 cycles

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   99 cycles
  crt strlen         79 cycles
strlen32sLAME        19 cycles
strlen64LingoB       10 cycles
_strlen (Agner Fog)  20 cycles

-- test 3, misaligned 3, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         17 cycles
strlen32sLAME        3 cycles
strlen64LingoB       1 cycles
_strlen (Agner Fog)  7 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         19 cycles
strlen32sLAME        20 cycles
strlen64LingoB       17 cycles
_strlen (Agner Fog)  7 cycles

Press any key to exit...




[attachment deleted by admin]

sinsi


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
100000000 bytes allocated

codesizes: strlen32sLAME=85, strlen64A=120, strlen64B=84, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32sLAME        1454 cycles
strlen64LingoB       1183 cycles
_strlen (Agner Fog)  2759 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32sLAME        393 cycles
strlen64LingoB       330 cycles
_strlen (Agner Fog)  707 cycles

-- test 1k, misaligned 15, 1024 bytes
strlen32sLAME        101 cycles
strlen64LingoB       78 cycles
_strlen (Agner Fog)  193 cycles

-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   99 cycles
  crt strlen         75 cycles
strlen32sLAME        19 cycles
strlen64LingoB       13 cycles
_strlen (Agner Fog)  19 cycles

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   99 cycles
  crt strlen         80 cycles
strlen32sLAME        19 cycles
strlen64LingoB       11 cycles
_strlen (Agner Fog)  21 cycles

-- test 3, misaligned 3, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         19 cycles
strlen32sLAME        3 cycles
strlen64LingoB       1 cycles
_strlen (Agner Fog)  7 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         19 cycles
strlen32sLAME        21 cycles
strlen64LingoB       18 cycles
_strlen (Agner Fog)  7 cycles

Press any key to exit...

This is getting ridiculous. "1 cycles" - now we need logic to print "1 cycle"  :P

Since we are on the cutting-edge here, can we have some native 64-bit code for 64-bit windows please? I guess we'll need 'timers64.inc' or something...

horse>die>flog  :bg

edit: what horrible code in the .asm - most doesn't even get used, what a nightmare to follow...
Quote; this is a comment with trailing blanks
is that zen? :bdg
Light travels faster than sound, that's why some people seem bright until you hear them.

lingo

"can we have some native 64-bit code for 64-bit windows please?"
Thanks, it works for me fine...: :wink
Usage:
lea rax, szBuffer
call strlen64


align 16
db 8Dh,0A4h,24h,0,0,0,0,8Dh,48h,0,10h
strlen64:
pop rcx
movdqu         xmm2, [rax]
pxor xmm0, xmm0
pcmpeqb         xmm2, xmm0
pxor xmm1, xmm1
pmovmskb edx, xmm2
test edx, edx
jz @f
bsf eax, edx
jmp rcx
@@:
lea rcx, [rax+16]
and rax, -16
@@:
pcmpeqb         xmm0, [rax+16]
pcmpeqb         xmm1, [rax+32]
por xmm1, xmm0
add rax, 32
pmovmskb edx, xmm1
test edx, edx
jz @b
shl edx, 16
sub rax, rcx
pmovmskb ecx, xmm0
or edx, ecx
mov rcx, [rsp-8]
bsf edx, edx
add rax, rdx
jmp rcx



jj2007

Quote from: sinsi on March 13, 2009, 06:34:06 AM

edit: what horrible code in the .asm - most doesn't even get used, what a nightmare to follow...


You are perfectly right. I took over some "organically grown" code, and certainly have not added to its readability. But the purpose was not beauty but rather to find those 88 bytes or so that would be considered good enough to replace len() for the next ten years. I know it is a bad habit to reopen old threads, but this one was unfinished business - now we have two algos that do the job. Lingo's is an edge faster, but I hate paying royalties to people who call me madjj :toothy

Quote; this is a comment with trailing blanks
is that zen? :bdg

Well, kind of. It is a leftover of my attempt to teach RichMasm to autoformat code that comes along with spaces and varying tab sizes. Example:
strlen64:
pop rcx
movdqu         xmm2, [rax]
pxor xmm0, xmm0
pcmpeqb         xmm2, xmm0
pxor xmm1, xmm1
pmovmskb edx, xmm2
test edx, edx
jz @f
bsf eax, edx
jmp rcx

This is actually nicely formatted, in comparison to many other snippets I have seen, but it forces the eye to jump a lot from left to right. Here is the autoformatted version:
strlen64:
pop rcx
movdqu xmm2, [rax]
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pxor xmm1, xmm1
pmovmskb edx, xmm2
test edx, edx
jz @f
bsf eax, edx
jmp rcx


The first one looks more beautiful, the second one is less tiresome for the eyes. A matter of taste, I guess.

sinsi

jj, you are a mad bastard mate  :bg

Switch to using ml64, no worries about cpu types - don't they all support sse3 at least?
Opening old threads? I have no problem with that if it's relevant (I do it with threads I've started - saves remembering) and saves getting 6 million search results...

lingo, the trouble is building a native pe64 and getting some timings from it. If MichaelW doesn't mind, maybe someone can change the timers.asm...when I'm sober I might try (oops then it will never happen).
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: lingo on March 13, 2009, 06:01:51 AM
"It seems Lingo has a little edge here."
madjj is never tired to convert my code in lame code... :lol

The only "conversion" I have ever made to your code is to pass the source over the stack, to make the benchmarks comparable to the others.

Quote
but, 'Lingo has a BIG edge here' :lol....

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-- test 16k, misaligned 0, 16384 bytes
strlen32sLAME        4797 cycles   <**********
strlen64LingoB       5045 cycles   <**********
_strlen (Agner Fog)  9599 cycles
...
-- test 3, misaligned 3, 15 bytes
  Masm32 lib szLen   47 cycles
  crt strlen         35 cycles
strlen32sLAME        11 cycles   <**********
strlen64LingoB       15 cycles   <**********
_strlen (Agner Fog)  23 cycles


Ever heard about hardware differences?
:lol

sinsi

I fix a lot of computers, and most of them are as the following:

AMD Athlon(tm) XP  2600+ (SSE1)
100000000 bytes allocated
ERROR in StrSizeA at ebx=4096: 4101 bytes instead of 4096
ERROR in strlen32c at ebx=4096: 4101 bytes instead of 4096
ERROR in strlen32sLAME at ebx=4096: 4101 bytes instead of 4096
ERROR in strlen64B at ebx=4096: 4101 bytes instead of 4096

codesizes: strlen32sLAME=85, strlen64A=120, strlen64B=84, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32sLAME        22 cycles
strlen64LingoB       2758 cycles
_strlen (Agner Fog)  22683 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32sLAME        26 cycles
strlen64LingoB       729 cycles
_strlen (Agner Fog)  5713 cycles

-- test 1k, misaligned 15, 1024 bytes
strlen32sLAME        240 cycles
strlen64LingoB       220 cycles
_strlen (Agner Fog)  1476 cycles

-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   158 cycles
  crt strlen         107 cycles
strlen32sLAME        68 cycles
strlen64LingoB       40 cycles
_strlen (Agner Fog)  192 cycles

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   158 cycles
  crt strlen         114 cycles
strlen32sLAME        71 cycles
strlen64LingoB       39 cycles
_strlen (Agner Fog)  161 cycles

-- test 3, misaligned 3, 15 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         25 cycles
strlen32sLAME        60 cycles
strlen64LingoB       32 cycles
_strlen (Agner Fog)  50 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         25 cycles
strlen32sLAME        26 cycles
strlen64LingoB       33 cycles
_strlen (Agner Fog)  73 cycles

Press any key to exit...


I realise that we're looking at sse2+ for ourselves, but most of them I fix are like this (p4's even without hypershite, athlons/durons).
Curious, why do I not get c0000097 (invalid opcode) running on the athlon?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

#247
Quote from: sinsi on March 13, 2009, 09:33:48 AM
I fix a lot of computers, and most of them are as the following:

AMD Athlon(tm) XP  2600+ (SSE1)


I realise that we're looking at sse2+ for ourselves, but most of them I fix are like this (p4's even without hypershite, athlons/durons).
Curious, why do I not get c0000097 (invalid opcode) running on the athlon?
Good question indeed. A library version should include a check for the SSE version.

For the madmen, here a new version. It includes now the pretty competitive algo posted by Nightware, 10.03.2009, and Lingo's latest strlen64B with that cute pop ecx, jmp ecx trick. What a pity that it is a bit lame on an ordinary P4 :bg

EDIT: See below for current version.

[attachment deleted by admin]

herge


Hi jj2007:


The latest results from herge.



Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32s=85, strlen64A=120, strlen64B=84, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32s            1516 cycles
strlen64LingoB       1221 cycles
NWStrLen             1193 cycles
_strlen (Agner Fog)  2804 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            424 cycles
strlen64LingoB       322 cycles
NWStrLen             333 cycles
_strlen (Agner Fog)  722 cycles

-- test 1k, misaligned 15, 1024 bytes
strlen32s            122 cycles
strlen64LingoB       79 cycles
NWStrLen             102 cycles
_strlen (Agner Fog)  196 cycles

-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   101 cycles
  crt strlen         62 cycles
strlen32s            43 cycles
strlen64LingoB       11 cycles
NWStrLen             15 cycles
_strlen (Agner Fog)  19 cycles

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   99 cycles
  crt strlen         100 cycles
strlen32s            19 cycles
strlen64LingoB       11 cycles
NWStrLen             21 cycles
_strlen (Agner Fog)  20 cycles

-- test 7, misaligned 7, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         15 cycles
strlen32s            21 cycles
strlen64LingoB       18 cycles
NWStrLen             25 cycles
_strlen (Agner Fog)  6 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   20 cycles
  crt strlen         15 cycles
strlen32s            2 cycles
strlen64LingoB       2 cycles
NWStrLen             9 cycles
_strlen (Agner Fog)  7 cycles

Press any key to exit...


Regards herge

// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

sinsi

Sorry jj but this .asm is getting way too complex and hard to follow. How about using .inc's for each algo, then we can get rid of the old, slow ones.
Cmon man, there are 16(?) procs in there and we are testing 4 of them...a 64k source file is too much. My brain gets lost in the labyrinth  :bdg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

#250
Quote from: sinsi on March 13, 2009, 10:10:47 AM
Sorry jj but this .asm is getting way too complex and hard to follow. How about using .inc's for each algo, then we can get rid of the old, slow ones.
Cmon man, there are 16(?) procs in there and we are testing 4 of them...a 64k source file is too much. My brain gets lost in the labyrinth  :bdg

OK, I put an effort into simplifying it. 27k instead of 64...
For the legal department: My latest version contains ideas ruthlessly stolen from Lingo:
strlen32s proc src:DWORD ; with lots of inspiration from Lingo, NightWare and Agner Fog
pop eax ; trash the return address to pop...
pop eax ; ...the src pointer
...
jmp dword ptr [esp-8] ; Lingo style equivalent to ret 4 ;-)
strlen32s endp


Although I don't quite understand why he wastes opcodes:

mov ecx, [esp-8]  < not needed
bsf edx, edx
add eax, edx
jmp [esp-8]         ; ecx < not needed
strlen64B endp


EDIT: Replaced attachment, with very minor modifications (2 bytes less for strlen32s :bg).
EDIT(2): New version - 80 bytes, now as fast as Lingo's algo even on modern hardware... :green

[attachment deleted by admin]

lingo

Quote"It seems Lingo has a little edge here."
madjj is never tired to convert my code in lame code...

The only "conversion" I have ever made to your code is to pass the source over the stack, to make the benchmarks comparable to the others.

Really?
Do you want to read together?..."to make it comparable to the others"  :lol

Lingo's code:

align 16
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; align 15
strlen64xB proc szBuffer : dword ; old 85 bytes version, before [url=http://www.masm32.com/board/index.php?topic=1807.msg81266#msg81266]this post[/url]; commented by jj
;db 0cch
mov eax, [esp+4] ; pointer to src
movdqu xmm2, [eax] ; move 16 bytes into xmm2
pxor xmm0, xmm0 ; zero for comparison
pcmpeqb xmm2, xmm0 ; set bytes in xmm2 to FF if nullbytes found in xmm2
pxor xmm1, xmm1 ; zero for comparison
pmovmskb edx, xmm2 ; set byte mask in edx
lea ecx, [eax+16] ; ecx=eax+16
test edx, edx
jnz Ex_0
and eax, -16 ; align eax to para
@@:
add eax, 32
pcmpeqb xmm0, [eax-16]
pcmpeqb xmm1, [eax]
por xmm1, xmm0
pmovmskb edx, xmm1
test edx, edx
jz @B

shl edx, 16
sub eax, ecx
pmovmskb ecx, xmm0
;sub eax, [esp+4]
or edx, ecx
;sub eax, 16
bsf edx, edx
add eax, edx
ret 4
align 16
Ex_0:
bsf eax, edx
ret 4
strlen64xB endp



and LAME code:



align 16 ; jj2007, 12 March 2007, 85 bytes; 0.176 cycles/byte at 16k on Celeron M (0.3 on P4)
strlen32s proc src:DWORD ; with lots of inspiration from Lingo, NightWare and Agner Fog
mov eax, [esp+4] ; get pointer to string
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
pxor xmm0, xmm0 ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we can return the index in eax

@@:
push ecx ; all registers preserved, except eax = return value
push edx ; eax will be pointer to initial string, 16-byte aligned

mov ecx, [esp+12] ; get pointer to string
and ecx, -16 ; align initial pointer to 16-byte boundary
lea eax, [ecx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B

pmovmskb ecx, xmm0 ; set byte mask in ecx (has to be repeated, sorry)
shl edx, 16 ; create space for the ecx bytes
or edx, ecx ; combine xmm0 and xmm1 results
bsf edx, edx ; bit scan for the index
lea eax, [eax+edx-32] ; add scan index, subtract initial bytes

pop edx

sub eax, [esp+8]

pop ecx

Lt16: ret 4
strlen32s endp



" ; all registers preserved, except eax = return value"
Why except eax ?   You must preserve and eax too...to make the diagnosis pretty clear  :lol

"Although I don't quite understand why he wastes opcodes:"

I like your RichMasm editor and want to see more functionality in it..for instance: asm code highlighting,
so IMO will be better to spend your efforts to get there rather than to "understand why he wastes opcodes"... :lol


and later again...: :lol

Lingo's code:

align 16
db 8Dh,0A4h,24h,0,0,0,0,8Dh,48h,0,10h
strlen64B proc szBuffer : dword
pop ecx
pop eax
movdqu xmm2, [eax]
pxor xmm0, xmm0
pcmpeqb xmm2, xmm0
pxor xmm1, xmm1
pmovmskb edx, xmm2
test edx, edx
jz @f
bsf eax, edx
jmp ecx
@@:
lea ecx, [eax+16]
and eax, -16
@@:
pcmpeqb xmm0, [eax+16]
pcmpeqb xmm1, [eax+32]
por xmm1, xmm0
add eax, 32
pmovmskb edx, xmm1
test edx, edx
jz @B
shl edx, 16
sub eax, ecx
pmovmskb ecx, xmm0
or edx, ecx
mov ecx, [esp-8]
bsf edx, edx
add eax, edx
jmp ecx
strlen64B endp



and LAME code:



align 16 ; jj2007, 13 March 2007, 82 bytes; 0.176 cycles/byte at 16k on Celeron M (0.3 on P4)
strlen32s proc src:DWORD ; with lots of inspiration from Lingo, NightWare and Agner Fog
pop eax ; trash the return address to pop...
pop eax ; ...the src pointer
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
pxor xmm0, xmm0 ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we are done

mov edx, [esp-4] ; get pointer to string
and edx, -16 ; align initial pointer to 16-byte boundary
lea eax, [edx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B
        sub eax, [esp-4] ; subtract original src pointer
push ecx ; all registers preserved, except edx and eax = return value
shl edx, 16 ; create space for the ecx bytes
pmovmskb ecx, xmm0 ; set byte mask in ecx (has to be repeated, sorry)
or edx, ecx ; combine xmm0 and xmm1 results
bsf edx, edx ; bit scan for the index
pop ecx
lea eax, [eax+edx-32] ; add scan index
Lt16:
jmp dword ptr [esp-8] ; Lingo style equivalent to ret 4 ;-)
strlen32s endp




"We don't live in a perfect world so we cannot garrantee that the forum will always be "idiot free"
but we do our best to keep it friendly..."
by Hutch  :lol








jj2007

Quote from: lingo on March 13, 2009, 02:07:30 PM
Quote
madjj is never tired to convert my code in lame code...

The only "conversion" I have ever made to your code is to pass the source over the stack, to make the benchmarks comparable to the others.

Really?
Do you want to read together?..."to make it comparable to the others"

Lingo, I had understood that you had accused me of modifying your algos to make them slower. Misunderstanding, sorry. I have never denied that I stole ideas from you. Where else should I get good ideas? :bg
(although some of your code looks suspiciously similar to what NightWare posted a long time ago ::))

Quote
" ; all registers preserved, except eax = return value"
Why except eax ?   You must preserve and eax too...to make the diagnosis pretty clear  :lol

Finally a glimpse of humour, voilà! NightWare preserves ecx and edx, too. But I dropped support for "safe edx". IMHO, ecx should be preserved because it is so often used as a counter.

Quote
"Although I don't quite understand why he wastes opcodes:"

I like your RichMasm editor and want to see more functionality in it..for instance: asm code highlighting,
so IMO will be better to spend your efforts to get there rather than to "understand why he wastes opcodes"... :lol


Automatic highlighting is a matter of taste. I prefer to see (mostly) plain text, so that I can highlight myself the few areas where I still have a problem to solve. And then, it seems as if there is a conflict between automatic highlighting and making full use of the RichEdit features, such as hyperlinks ("before this post; ", see below)

As to the wasted opcodes, check yourself (even with Notepad :wink); two bytes less, and one or two cycles faster.


PBrennick

Okay, Lingo,

A little help here. Who is the better coder, Lingo or Lingo?

Paul
The GeneSys Project is available from:
The Repository or My crappy website

lingo

"NightWare preserves ecx and edx, too..."
but Hutch doesn't... :lol

"And then, it seems as if there is a conflict between automatic highlighting and making full use of the RichEdit features, such as hyperlinks .."
It seems...  you prefer to steal other's ideas and algos (for example: from lesson 35, Iczelion) rather than to  use your own automatic highlighting algo for .RTF  files to resolve the problems. :wink


"and one or two cycles faster."
Read A.Fog: Which one is faster - jump to register or jump to memory

For me:  mov ecx, [esp-8] ; this instruction is for free!!! :lol
       .......   
               jmp  ecx

is faster than
   jmp dword ptr [esp-8]

If you disagree just ask herge or sinsi to make tests for you (due to archaic type of your  CPUs).  :lol
Theirs CPUs are OK.