New strlen algo using MMX(+), 15% faster than Agner Fog's one for big strings,
Agner Fog's : 9406 cycles resulting in : 9999
MMX strlen : 7855 cycles resulting in : 9999
On a Atlhon XP 1800+,
test and optimize!!!
[attachment deleted by admin]
QuoteAgner Fog's : 8273 cycles resulting in : 9999
MMX strlenĀ : 6290 cycles resulting in : 9999
Tested on a P4 2.66 GHz , OS : Win Xp Home SP2
Now on a PIII 600Mhz:
Agner Fog's : 10385 cycles resulting in : 9999
MMX strlen : 3793 cycles resulting in : 9999
This time i'm really impressed. :eek
Results here on a PIV 2.8g Prescott.
Agner Fog's : 8435 cycles resulting in : 9999
MMX strlen : 6799 cycles resulting in : 9999
Ok, I have had a quick play with the Agner algo, replaced a couple of immediates with registers and unrolled the main loop by 4. These are the results I get with it on this Prescott.
Agner Fog's : 6888 cycles resulting in : 9999
MMX strlen : 6803 cycles resulting in : 9999
It has closed the gap up but Eduardos MMX version is still faster. The main advantage of the Agner algo is it still is 486 compatible. It is unfortunate that later hardware has not kept supporting MMX at the lowest hardware level and its performance advantage is being wasted. Perhaps an XMM version will have some advantage.
Compliments on a fast string length algo. :U
[attachment deleted by admin]
From test2.zip, running on a P3:
Agner Fog's : 8324 cycles resulting in : 9999
MMX strlenĀ : 3796 cycles resulting in : 9999
Surely XMM can do better, otherwise it would seem that Intel has taken a step backwards.
HMMM the code does look similiar as donkey's code.
There's another thread on strlen. I wish to see someone who can beat lingo12's code. :toothy
Agner Fog's : 9417 cycles resulting in : 9999
MMX strlen : 7862 cycles resulting in : 9999
and here what really should be considered because mostly string searching isnt aligned, same with compare on compression
Agner Fog's : 10665 cycles resulting in : 9998
MMX strlen : 9119 cycles resulting in : 9998
starting lookup on 403001 instead of 403000 on athlon xp1700+
and test2 from hutch
Agner Fog's : 8794 cycles resulting in : 9999
MMX strlen : 7859 cycles resulting in : 9999
and unaligned
Agner Fog's : 9422 cycles resulting in : 9998
MMX strlen : 9119 cycles resulting in : 9998
and here total mess, string, and buffer with 0's unaligned
Agner Fog's : 9422 cycles resulting in : 9998
MMX strlen : 9798 cycles resulting in : 9998
from test2 running on AMD Venice 3.2 Ghz:
Agner: 5867
MMX: 3784
From test2.zip, running on a Intel Pentium 4 2.53GHz:
Agner Fog's : 6899 cycles resulting in : 9999
MMX strlen : 6351 cycles resulting in : 9999
First an easy save of a few cycles, I never use frame stuff unless I need to push values..
In which case I still manualy code the push ebp;mov ebp,esp;...;pop ebp..ret whatever.
I use the direct push value (skipping the ret address) with [ESP+4] (This gets hectic if you do need to push though)....
Switch...
mmxlen proc item:DWORD
mov edx, item
WITH
mmxlen:
mov edx, [esp+4]
Switch
sub eax, item
mov result, eax
ret
mmxlen endp
with
sub eax, [esp+4]
ret 4
(You could reinsert the mov result,eax thing , I just don't see why(in RL practice not imaginary 9999+ string testing.)
Now I have been thinking.. Alot of assembly programmers are attracted to speed, however most of these (agners the MMX len and some others) have some init penalty and only really start making up ground in strings exceeding 60-100+ characters.. How often are your strings more than 256 or reach 9999 characters long ? So for some part it's kind of counter productive. I noticed this when I wanted to do the hex string to DWORD and made it so this code worked.
HASH_IT:
;ESI HAS STRING EAX=RETURN
XOR EDX,EDX
MOV DX,W[ESI]
MOV ECX,[HH_TAB+EDX]
MOV AH,CL
MOV DX,W[ESI+2]
MOV ECX,[HH_TAB+EDX]
MOV AL,CL
SHL EAX,16
MOV DX,W[ESI+4]
MOV ECX,[HH_TAB+EDX]
MOV AH,CL
MOV DX,W[ESI+6]
MOV ECX,[HH_TAB+EDX]
MOV AL,CL
RET
OF course I had to build a huge table which takes an assinine amount of cycles. It's almost like the jump table debate in a window proc. I just want people to remember it's not about scanning huge arrays for 0's , after a while it's not even practical to contain data in string format that way. With that in mind the best one I could come up with was just a super unrolled byte counter (12 checks before re-gaining),it's the one that was quicker up to a point(around 140-152 characters). A few different string formats would of been better than a single 0 terminator. Dual zeros would allow you to skip every other byte until you hit a zero then just check to see if there was a previous one to get your len..... I wanted to come up with a better algo and couldn't everything I tried never could catch the MMX or Angner's array scanners. I don't know why I ever tried to be honest I never even use calls to count strings I just throw the pointer in EAX and ..
DEC EAX
@:
INC EAX
CMP BYTE PTR [EAX],0
JNZ @B
SUB EAX,[original value]
Why? Cause it's manageable and to a certain point even making a call to a better routine would be a penalty.
Well heres my best effort ,dominates small strings, laugh at me if you want I know it's kind of kiddy but it beat the
hell out of doing things like small scale bit mask zero testing(like) MOV AX,W[ECX];AND (or TEST) AX,0808h; in fact that was
slower then the single byte scan . It's in G-ASM format and all byte code to save space.
_PopulationA: ;INVOKE _PopulationA,ADDR of STRING
DB 8Bh,44h,24h,04h,80h,38h,00h
DB 74h,7Ah,80h,78h,01h,00h
DB 74h,73h,80h,78h,02h,00h
DB 74h,68h,80h,78h,03h,00h
DB 74h,5Dh,80h,78h,04h,00h
DB 74h,52h,80h,78h,05h,00h
DB 74h,47h,80h,78h,06h,00h
DB 74h,3Ch,80h,78h,07h,00h
DB 74h,31h,80h,78h,08h,00h
DB 74h,26h,80h,78h,09h,00h
DB 74h,1Bh,80h,78h,0Ah,00h
DB 74h,10h,80h,78h,0Bh,00h
DB 74h,05h,83h,0C0h,0Ch,0EBh,0B4h
DB 83h,0C0h,0Bh,0EBh,2Eh,83h,0C0h,0Ah
DB 0EBh,29h,83h,0C0h,09h,0EBh,24h
DB 83h,0C0h,08h,0EBh,1Fh,83h,0C0h,07h
DB 0EBh,1Ah,83h,0C0h,06h,0EBh,15h
DB 83h,0C0h,05h,0EBh,10h,83h,0C0h,04h
DB 0EBh,0Bh,83h,0C0h,03h,0EBh,06h
DB 83h,0C0h,02h,0EBh,01h,40h
DB 2Bh,44h,24h,04h
RET 4
I stopped at 12 checks before a re-gain because this stops shy of long jumps (and really long code).
Maybe someday we wil get lucky and they willl put in a processor direct to replace the
slow string instructions (like SCUZB scan until zero byte) but I wouldn't hold my breath.
Good job on the MMX routine I would never of thought of it.
NOTE* Retested with the timers in the example...
I didn't try to optimize it any since it's meant to be used kind of "AS IS".
The last it leads
Agner Fog's : 131 cycles resulting in : 107
MMX strlen : 130 cycles resulting in : 107
unrolled byte count : 125 cycles resulting in : 107
Press enter to exit...
Agner Fog's : 9674 cycles resulting in : 9999
MMX strlen : 8089 cycles resulting in : 9999
unrolled byte count : 10487 cycles resulting in : 9999
Press enter to exit...
UNALIGNED...
Agner Fog's : 10327 cycles resulting in : 9999
MMX strlen : 8091 cycles resulting in : 9999
unrolled byte count : 10488 cycles resulting in : 9999
Press enter to exit...
Jeez it did better than I thought it would...
-OS
OS, a strlen for big strings isn't very usefull, just a form to practice,
Others, i found the other topic about strlen, donkey's algo is really very close to mine, i get some of the algos to join the test, don't ready the full-topic so i dunno if i have the most recent versions.
I made another version yesterday and made a version for XMM (included in source),
I can't test the XMM because i don't have SSE2 yet.
Test routines for correctness:
lszLenSSE 9999
Ratch 9999
szLength 9999
szLen 9999
Jens_fast 9999
mmxlen 9999
StrLen2 9999
Proc/Byte 9999
========= ====
Misaligned by 0 bytes:
lszLenSSE 5559
Ratch 8167
szLength 7318
szLen 14008
Jens_fast 8820
mmxlen 5038
StrLen2 8785
Misaligned by 1 bytes:
lszLenSSE 5813
Ratch 9102
szLength 7330
szLen 14010
Jens_fast 9448
mmxlen 5089
StrLen2 9411
Misaligned by 2 bytes:
lszLenSSE 5813
Ratch 9102
szLength 7332
szLen 14008
Jens_fast 9444
mmxlen 5094
StrLen2 9410
Misaligned by 3 bytes:
lszLenSSE 5812
Ratch 9103
szLength 7330
szLen 14009
Jens_fast 9446
mmxlen 5095
StrLen2 9409
Press enter to exit...
[attachment deleted by admin]
A small fix, in the code to align data in last algo i forgot to put the right register, it don't affect the result or timing for big strings but give a wrong result for small ones.
Also, the XMM version is finished, please someone test it for me.
[attachment deleted by admin]
On a A64 3200+ Venice (939P 2.0GHz)
Proc/Byte 9999
========= ====
Misaligned by 0 bytes:
lszLenSSE 3787
Ratch 7536
szLen 10283
szLength 6277
Jens_fast 8202
mmxlen 3158
StrLen2 5862
Misaligned by 1 bytes:
lszLenSSE 3788
Ratch 7535
szLen 10282
szLength 6281
Jens_fast 8284
mmxlen 3206
StrLen2 6284
Misaligned by 2 bytes:
lszLenSSE 3788
Ratch 7535
szLen 10284
szLength 6282
Jens_fast 8284
mmxlen 3203
StrLen2 6284
Misaligned by 3 bytes:
lszLenSSE 3788
Ratch 7536
szLen 10282
szLength 6284
Jens_fast 8284
mmxlen 3202
StrLen2 6284
Press enter to exit...
And with XMM SSE2
Proc/Byte 9999
========= ====
Misaligned by 0 bytes:
lszLenSSE 3787
Ratch 7553
szLen 10294
szLength 6275
Jens_fast 8207
mmxlen 3170
xmmlen 2220
StrLen2 5863
Misaligned by 1 bytes:
lszLenSSE 3793
Ratch 7548
szLen 10288
szLength 6281
Jens_fast 8296
mmxlen 3207
xmmlen 2239
StrLen2 6284
Misaligned by 2 bytes:
lszLenSSE 3800
Ratch 7536
szLen 10296
szLength 6287
Jens_fast 8289
mmxlen 3203
xmmlen 2240
StrLen2 6289
Misaligned by 3 bytes:
lszLenSSE 3788
Ratch 7543
szLen 10293
szLength 6291
Jens_fast 8296
mmxlen 3203
xmmlen 2231
StrLen2 6290
Press enter to exit...
Now it is fast :8)