Print Page - MMX strlen

Title: MMX strlen
Post by: EduardoS on December 11, 2005, 05:48:48 PM

New strlen algo using MMX(+), 15% faster than Agner Fog's one for big strings,


Agner Fog's : 9406 cycles resulting in : 9999
MMX strlen  : 7855 cycles resulting in : 9999

On a Atlhon XP 1800+,

test and optimize!!!

[attachment deleted by admin]

Title: Re: MMX strlen
Post by: Vortex on December 11, 2005, 08:42:33 PM

QuoteAgner Fog's : 8273 cycles resulting in : 9999
MMX strlen : 6290 cycles resulting in : 9999

Tested on a P4 2.66 GHz , OS : Win Xp Home SP2

Title: Re: MMX strlen
Post by: EduardoS on December 12, 2005, 01:09:55 AM

Now on a PIII 600Mhz:

Code Select


Agner Fog's : 10385 cycles resulting in : 9999
MMX strlen  : 3793 cycles resulting in : 9999

This time i'm really impressed. :eek

Title: Re: MMX strlen
Post by: hutch-- on December 12, 2005, 01:19:47 AM

Results here on a PIV 2.8g Prescott.

Agner Fog's : 8435 cycles resulting in : 9999
MMX strlen : 6799 cycles resulting in : 9999

Title: Re: MMX strlen
Post by: hutch-- on December 12, 2005, 01:52:34 AM

Ok, I have had a quick play with the Agner algo, replaced a couple of immediates with registers and unrolled the main loop by 4. These are the results I get with it on this Prescott.

Agner Fog's : 6888 cycles resulting in : 9999
MMX strlen : 6803 cycles resulting in : 9999

It has closed the gap up but Eduardos MMX version is still faster. The main advantage of the Agner algo is it still is 486 compatible. It is unfortunate that later hardware has not kept supporting MMX at the lowest hardware level and its performance advantage is being wasted. Perhaps an XMM version will have some advantage.

Compliments on a fast string length algo. :U

[attachment deleted by admin]

Title: Re: MMX strlen
Post by: MichaelW on December 12, 2005, 06:11:28 AM

From test2.zip, running on a P3:

Agner Fog's : 8324 cycles resulting in : 9999
MMX strlen : 3796 cycles resulting in : 9999

Surely XMM can do better, otherwise it would seem that Intel has taken a step backwards.

Title: Re: MMX strlen
Post by: roticv on December 12, 2005, 12:46:30 PM

HMMM the code does look similiar as donkey's code.

There's another thread on strlen. I wish to see someone who can beat lingo12's code. :toothy

Title: Re: MMX strlen
Post by: Human on December 13, 2005, 02:47:00 AM

Agner Fog's : 9417 cycles resulting in : 9999
MMX strlen : 7862 cycles resulting in : 9999

and here what really should be considered because mostly string searching isnt aligned, same with compare on compression

Agner Fog's : 10665 cycles resulting in : 9998
MMX strlen : 9119 cycles resulting in : 9998

starting lookup on 403001 instead of 403000 on athlon xp1700+
and test2 from hutch

Agner Fog's : 8794 cycles resulting in : 9999
MMX strlen : 7859 cycles resulting in : 9999

and unaligned

Agner Fog's : 9422 cycles resulting in : 9998
MMX strlen : 9119 cycles resulting in : 9998

and here total mess, string, and buffer with 0's unaligned

Agner Fog's : 9422 cycles resulting in : 9998
MMX strlen : 9798 cycles resulting in : 9998

Title: Re: MMX strlen
Post by: shaldan on December 13, 2005, 08:21:30 AM

from test2 running on AMD Venice 3.2 Ghz:

Agner: 5867
MMX: 3784

Title: Re: MMX strlen
Post by: Seb on December 14, 2005, 11:47:07 AM

From test2.zip, running on a Intel Pentium 4 2.53GHz:

Agner Fog's : 6899 cycles resulting in : 9999
MMX strlen : 6351 cycles resulting in : 9999

Title: Re: MMX strlen
Post by: OS on December 14, 2005, 05:51:16 PM

First an easy save of a few cycles, I never use frame stuff unless I need to push values..

In which case I still manualy code the push ebp;mov ebp,esp;...;pop ebp..ret whatever.
I use the direct push value (skipping the ret address) with [ESP+4] (This gets hectic if you do need to push though)....
Switch...

Code Select

mmxlen proc item:DWORD
    mov edx, item

WITH

Code Select

mmxlen:
    mov edx, [esp+4]

Switch

Code Select


sub eax, item
mov result, eax
    ret
mmxlen endp

with

Code Select

    sub eax, [esp+4]
    ret 4

(You could reinsert the mov result,eax thing , I just don't see why(in RL practice not imaginary 9999+ string testing.)
Now I have been thinking.. Alot of assembly programmers are attracted to speed, however most of these (agners the MMX len and some others) have some init penalty and only really start making up ground in strings exceeding 60-100+ characters.. How often are your strings more than 256 or reach 9999 characters long ? So for some part it's kind of counter productive. I noticed this when I wanted to do the hex string to DWORD and made it so this code worked.

Code Select

HASH_IT:
;ESI HAS STRING EAX=RETURN
XOR EDX,EDX
MOV DX,W[ESI]
MOV ECX,[HH_TAB+EDX]
MOV AH,CL
MOV DX,W[ESI+2]
MOV ECX,[HH_TAB+EDX]
MOV AL,CL
SHL EAX,16
MOV DX,W[ESI+4]
MOV ECX,[HH_TAB+EDX]
MOV AH,CL
MOV DX,W[ESI+6]
MOV ECX,[HH_TAB+EDX]
MOV AL,CL
RET

OF course I had to build a huge table which takes an assinine amount of cycles. It's almost like the jump table debate in a window proc. I just want people to remember it's not about scanning huge arrays for 0's , after a while it's not even practical to contain data in string format that way. With that in mind the best one I could come up with was just a super unrolled byte counter (12 checks before re-gaining),it's the one that was quicker up to a point(around 140-152 characters). A few different string formats would of been better than a single 0 terminator. Dual zeros would allow you to skip every other byte until you hit a zero then just check to see if there was a previous one to get your len..... I wanted to come up with a better algo and couldn't everything I tried never could catch the MMX or Angner's array scanners. I don't know why I ever tried to be honest I never even use calls to count strings I just throw the pointer in EAX and ..

Code Select


DEC EAX
@:
INC EAX
CMP BYTE PTR [EAX],0
JNZ @B
SUB EAX,[original value]

Why? Cause it's manageable and to a certain point even making a call to a better routine would be a penalty.
Well heres my best effort ,dominates small strings, laugh at me if you want I know it's kind of kiddy but it beat the
hell out of doing things like small scale bit mask zero testing(like) MOV AX,W[ECX];AND (or TEST) AX,0808h; in fact that was
slower then the single byte scan . It's in G-ASM format and all byte code to save space.

Code Select

_PopulationA: ;INVOKE _PopulationA,ADDR of STRING
DB 8Bh,44h,24h,04h,80h,38h,00h
DB 74h,7Ah,80h,78h,01h,00h
DB 74h,73h,80h,78h,02h,00h
DB 74h,68h,80h,78h,03h,00h
DB 74h,5Dh,80h,78h,04h,00h
DB 74h,52h,80h,78h,05h,00h
DB 74h,47h,80h,78h,06h,00h
DB 74h,3Ch,80h,78h,07h,00h
DB 74h,31h,80h,78h,08h,00h
DB 74h,26h,80h,78h,09h,00h
DB 74h,1Bh,80h,78h,0Ah,00h
DB 74h,10h,80h,78h,0Bh,00h
DB 74h,05h,83h,0C0h,0Ch,0EBh,0B4h 
DB 83h,0C0h,0Bh,0EBh,2Eh,83h,0C0h,0Ah
DB 0EBh,29h,83h,0C0h,09h,0EBh,24h
DB 83h,0C0h,08h,0EBh,1Fh,83h,0C0h,07h
DB 0EBh,1Ah,83h,0C0h,06h,0EBh,15h
DB 83h,0C0h,05h,0EBh,10h,83h,0C0h,04h
DB 0EBh,0Bh,83h,0C0h,03h,0EBh,06h
DB 83h,0C0h,02h,0EBh,01h,40h
DB 2Bh,44h,24h,04h
RET 4

I stopped at 12 checks before a re-gain because this stops shy of long jumps (and really long code).
Maybe someday we wil get lucky and they willl put in a processor direct to replace the
slow string instructions (like SCUZB scan until zero byte) but I wouldn't hold my breath.
Good job on the MMX routine I would never of thought of it.

NOTE* Retested with the timers in the example...
I didn't try to optimize it any since it's meant to be used kind of "AS IS".

Code Select


The last it leads

Agner Fog's : 131 cycles resulting in : 107
MMX strlen  : 130 cycles resulting in : 107
unrolled byte count  : 125 cycles resulting in : 107

Press enter to exit...

Agner Fog's : 9674 cycles resulting in : 9999
MMX strlen  : 8089 cycles resulting in : 9999
unrolled byte count  : 10487 cycles resulting in : 9999

Press enter to exit...

UNALIGNED...

Agner Fog's : 10327 cycles resulting in : 9999
MMX strlen  : 8091 cycles resulting in : 9999
unrolled byte count  : 10488 cycles resulting in : 9999

Press enter to exit...

Jeez it did better than I thought it would...
-OS

Title: Re: MMX strlen
Post by: EduardoS on December 15, 2005, 01:03:27 AM

OS, a strlen for big strings isn't very usefull, just a form to practice,
Others, i found the other topic about strlen, donkey's algo is really very close to mine, i get some of the algos to join the test, don't ready the full-topic so i dunno if i have the most recent versions.
I made another version yesterday and made a version for XMM (included in source),
I can't test the XMM because i don't have SSE2 yet.

Code Select


Test routines for correctness:
lszLenSSE 9999
Ratch     9999
szLength  9999
szLen     9999
Jens_fast 9999
mmxlen    9999
StrLen2   9999

Proc/Byte 9999
========= ====

Misaligned by 0 bytes:
lszLenSSE 5559
Ratch     8167
szLength  7318
szLen    14008
Jens_fast 8820
mmxlen    5038
StrLen2   8785

Misaligned by 1 bytes:
lszLenSSE 5813
Ratch     9102
szLength  7330
szLen    14010
Jens_fast 9448
mmxlen    5089
StrLen2   9411

Misaligned by 2 bytes:
lszLenSSE 5813
Ratch     9102
szLength  7332
szLen    14008
Jens_fast 9444
mmxlen    5094
StrLen2   9410

Misaligned by 3 bytes:
lszLenSSE 5812
Ratch     9103
szLength  7330
szLen    14009
Jens_fast 9446
mmxlen    5095
StrLen2   9409

Press enter to exit...

[attachment deleted by admin]

Title: Re: MMX strlen
Post by: EduardoS on December 16, 2005, 01:34:18 AM

A small fix, in the code to align data in last algo i forgot to put the right register, it don't affect the result or timing for big strings but give a wrong result for small ones.
Also, the XMM version is finished, please someone test it for me.

[attachment deleted by admin]

Title: Re: MMX strlen
Post by: EduardoS on December 24, 2005, 04:11:28 PM

On a A64 3200+ Venice (939P 2.0GHz)

Code Select


Proc/Byte 9999
========= ====

Misaligned by 0 bytes:
lszLenSSE 3787
Ratch     7536
szLen    10283
szLength  6277
Jens_fast 8202
mmxlen    3158
StrLen2   5862

Misaligned by 1 bytes:
lszLenSSE 3788
Ratch     7535
szLen    10282
szLength  6281
Jens_fast 8284
mmxlen    3206
StrLen2   6284

Misaligned by 2 bytes:
lszLenSSE 3788
Ratch     7535
szLen    10284
szLength  6282
Jens_fast 8284
mmxlen    3203
StrLen2   6284

Misaligned by 3 bytes:
lszLenSSE 3788
Ratch     7536
szLen    10282
szLength  6284
Jens_fast 8284
mmxlen    3202
StrLen2   6284

Press enter to exit...

And with XMM SSE2

Code Select


Proc/Byte 9999
========= ====

Misaligned by 0 bytes:
lszLenSSE 3787
Ratch     7553
szLen    10294
szLength  6275
Jens_fast 8207
mmxlen    3170
xmmlen    2220
StrLen2   5863

Misaligned by 1 bytes:
lszLenSSE 3793
Ratch     7548
szLen    10288
szLength  6281
Jens_fast 8296
mmxlen    3207
xmmlen    2239
StrLen2   6284

Misaligned by 2 bytes:
lszLenSSE 3800
Ratch     7536
szLen    10296
szLength  6287
Jens_fast 8289
mmxlen    3203
xmmlen    2240
StrLen2   6289

Misaligned by 3 bytes:
lszLenSSE 3788
Ratch     7543
szLen    10293
szLength  6291
Jens_fast 8296
mmxlen    3203
xmmlen    2231
StrLen2   6290

Press enter to exit...

Now it is fast :8)

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: EduardoS on December 11, 2005, 05:48:48 PM