News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InString Speed Test!

Started by ecube, November 18, 2006, 06:05:17 AM

Previous topic - Next topic

hutch--

I had a quick play with the library version of InString by onrolling a couple of the loops and it is a bit faster but the algo properly needs a rewrite. An interesting factor is both the msvcrt and lingo's algos get faster with longer patterns where a byte scanner gets slower in its matching loop.

Here are the timings on my 2 PIVs.


Northwood code 2.8 gig PIV

719 timing msvcrt strstr
763 timing InStringx
480 timing InStringL
699 timing msvcrt strstr
768 timing InStringx
474 timing InStringL
696 timing msvcrt strstr
769 timing InStringx
465 timing InStringL
713 timing msvcrt strstr
764 timing InStringx
465 timing InStringL
698 timing msvcrt strstr
767 timing InStringx
470 timing InStringL
704 timing msvcrt strstr
763 timing InStringx
466 timing InStringL
716 timing msvcrt strstr
767 timing InStringx
462 timing InStringL
702 timing msvcrt strstr
767 timing InStringx
464 timing InStringL
----------------------
705 average msvcrt strstr
766 average InStringx
468 average InStringL
----------------------
Press any key to continue ...

Prescott core 3.0 gig PIV.

845 timing msvcrt strstr
848 timing InStringx
434 timing InStringL
846 timing msvcrt strstr
846 timing InStringx
438 timing InStringL
844 timing msvcrt strstr
845 timing InStringx
447 timing InStringL
845 timing msvcrt strstr
845 timing InStringx
464 timing InStringL
842 timing msvcrt strstr
844 timing InStringx
445 timing InStringL
844 timing msvcrt strstr
843 timing InStringx
450 timing InStringL
843 timing msvcrt strstr
847 timing InStringx
458 timing InStringL
847 timing msvcrt strstr
845 timing InStringx
470 timing InStringL
----------------------
844 average msvcrt strstr
845 average InStringx
450 average InStringL
----------------------
Press any key to continue ...


Lingo's algo has clearly got the legs on the other two.


[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

I have coded a new version of Instring "finstr" which has a lot less overhead than the library version. This version differs from the library version in that it does not use a preset length for either the source string or the search pattern but calculates them as zero terminated strings. The new version is faster on normal to short length strings than either Lingo's or the MSVCRT strstr but as the string gets longer, Lingos starts to become faster. The timings below are on my 2 PIVs but I did run a test on an AMD I own and Lingo's code better suits the AMD than either the MSVCRT or my own does.

These are the results on my 2 PIVs.



2.8 gig Northwood PIV

163 timing msvcrt strstr
70 timing finstr
118 timing InStringL
163 timing msvcrt strstr
71 timing finstr
127 timing InStringL
164 timing msvcrt strstr
71 timing finstr
120 timing InStringL
162 timing msvcrt strstr
70 timing finstr
120 timing InStringL
162 timing msvcrt strstr
71 timing finstr
119 timing InStringL
163 timing msvcrt strstr
71 timing finstr
119 timing InStringL
161 timing msvcrt strstr
72 timing finstr
119 timing InStringL
160 timing msvcrt strstr
71 timing finstr
118 timing InStringL
----------------------
162 average msvcrt strstr
70 average finstr
120 average InStringL
----------------------
Press any key to continue ...

3.0 gig Prescott PIV

257 timing msvcrt strstr
110 timing finstr
141 timing InStringL
259 timing msvcrt strstr
110 timing finstr
141 timing InStringL
258 timing msvcrt strstr
110 timing finstr
142 timing InStringL
253 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
141 timing InStringL
----------------------
256 average msvcrt strstr
110 average finstr
141 average InStringL
----------------------
Press any key to continue ...

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

My AMD, K-8:
instr2:

813 timing msvcrt strstr
749 timing InStringx
313 timing InStringL
769 timing msvcrt strstr
749 timing InStringx
371 timing InStringL
768 timing msvcrt strstr
771 timing InStringx
318 timing InStringL
782 timing msvcrt strstr
749 timing InStringx
307 timing InStringL
768 timing msvcrt strstr
751 timing InStringx
310 timing InStringL
768 timing msvcrt strstr
751 timing InStringx
319 timing InStringL
766 timing msvcrt strstr
749 timing InStringx
318 timing InStringL
771 timing msvcrt strstr
750 timing InStringx
307 timing InStringL
----------------------
775 average msvcrt strstr
752 average InStringx
320 average InStringL
----------------------
Press any key to continue ...


instr3:

122 timing msvcrt strstr
102 timing finstr
87 timing InStringL
118 timing msvcrt strstr
90 timing finstr
91 timing InStringL
118 timing msvcrt strstr
99 timing finstr
89 timing InStringL
119 timing msvcrt strstr
31 timing finstr
56 timing InStringL
118 timing msvcrt strstr
87 timing finstr
92 timing InStringL
117 timing msvcrt strstr
89 timing finstr
89 timing InStringL
118 timing msvcrt strstr
89 timing finstr
91 timing InStringL
118 timing msvcrt strstr
89 timing finstr
91 timing InStringL
----------------------
118 average msvcrt strstr
84 average finstr
85 average InStringL
----------------------
Press any key to continue ...

lingo

Sorry Hutch,

I apologize... :(
just forgot about that...

Regards,
Lingo

hutch--

The source code for the "finstr" proc has been modified by removing the loop unrolls. I removed the comments and the timings are as below. The algo was designed with the large unroll to compensate for the extra testing for zero in both the scan loop and the match loop.


2.8 gig Northwood core PIV

160 timing msvcrt strstr
72 timing finstr
104 timing InStringL
162 timing msvcrt strstr
72 timing finstr
106 timing InStringL
164 timing msvcrt strstr
75 timing finstr
103 timing InStringL
163 timing msvcrt strstr
72 timing finstr
106 timing InStringL
162 timing msvcrt strstr
72 timing finstr
106 timing InStringL
157 timing msvcrt strstr
73 timing finstr
105 timing InStringL
163 timing msvcrt strstr
75 timing finstr
105 timing InStringL
159 timing msvcrt strstr
73 timing finstr
105 timing InStringL
----------------------
161 average msvcrt strstr
73 average finstr
105 average InStringL
----------------------
Press any key to continue ...

3.0 gig Prescott core PIV

257 timing msvcrt strstr
110 timing finstr
140 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
255 timing msvcrt strstr
111 timing finstr
140 timing InStringL
257 timing msvcrt strstr
110 timing finstr
142 timing InStringL
257 timing msvcrt strstr
110 timing finstr
143 timing InStringL
257 timing msvcrt strstr
112 timing finstr
141 timing InStringL
258 timing msvcrt strstr
110 timing finstr
142 timing InStringL
----------------------
256 average msvcrt strstr
110 average finstr
141 average InStringL
----------------------
Press any key to continue ...


Later.

Tested with the longer string commented out in the test piece, Lingo's algo is clearly faster as the string length gets longer.


710 timing msvcrt strstr
526 timing finstr
343 timing InStringL
700 timing msvcrt strstr
525 timing finstr
349 timing InStringL
708 timing msvcrt strstr
526 timing finstr
341 timing InStringL
707 timing msvcrt strstr
524 timing finstr
350 timing InStringL
708 timing msvcrt strstr
530 timing finstr
334 timing InStringL
706 timing msvcrt strstr
525 timing finstr
334 timing InStringL
705 timing msvcrt strstr
522 timing finstr
354 timing InStringL
701 timing msvcrt strstr
524 timing finstr
352 timing InStringL
----------------------
705 average msvcrt strstr
525 average finstr
344 average InStringL
----------------------
Press any key to continue ...

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

ecube

114 timing msvcrt strstr
95 timing finstr
108 timing InStringL
114 timing msvcrt strstr
97 timing finstr
106 timing InStringL
115 timing msvcrt strstr
95 timing finstr
107 timing InStringL
115 timing msvcrt strstr
96 timing finstr
112 timing InStringL
113 timing msvcrt strstr
97 timing finstr
106 timing InStringL
115 timing msvcrt strstr
95 timing finstr
107 timing InStringL
113 timing msvcrt strstr
108 timing finstr
107 timing InStringL
113 timing msvcrt strstr
96 timing finstr
107 timing InStringL
----------------------
114 average msvcrt strstr
97 average finstr
107 average InStringL
----------------------
Press any key to continue ...

amd 64 3800+

Jimg

Hi Hutch-

Looking at your code, I don't understand why you duplicate the code inside the repeat one more time outside the repeat loop.

Isn't this-
scanloop:
  REPEAT 15
    add esi, ebp
    movzx edx, BYTE PTR [esi]
    cmp eax, edx
    je prematch
    test edx, edx                   ; test for source terminator
    jz no_match
  ENDM

    add esi, ebp
    movzx edx, BYTE PTR [esi]
    cmp eax, edx
    je prematch
    test edx, edx                   ; test for source terminator
    jnz scanloop
    jmp no_match

the same as this-
  REPEAT 16
    add esi, ebp
    movzx edx, BYTE PTR [esi]
    cmp eax, edx
    je prematch
    test edx, edx                   ; test for source terminator
    jz no_match
  ENDM
    jmp scanloop
?  At best it saves one cycle every 16 bytes tested for the most likely jump being to scanloop.

And if you were to get four bytes to test at once, you could replace 3 movzx's with one rol edx,16 like-
  REPEAT 4
    add esi, ebp
    mov edx,[esi]
    test dl, dl                   ; test for source terminator
    jz no_match
    cmp al, dl
    je prematch

    add esi,ebp
    test dh, dh                   ; test for source terminator
    jz no_match
    cmp al, dh
    je prematch
   
    add esi,ebp
    rol edx,16
    test dl, dl                   ; test for source terminator
    jz no_match
    cmp al, dl
    je prematch

    add esi,ebp
    test dh, dh                   ; test for source terminator
    jz no_match
    cmp al, dh
    je prematch
   
  ENDM

    jmp scanloop

which run slightly faster on my amd.

hutch--

Jim,

> Looking at your code, I don't understand why you duplicate the code inside the repeat one more time outside the repeat loop.

Habit, I usually do an unroll by repeating the first one less than the required unroll count and changing any necessary jumps.

I would be interested in seeing your alternate version with the DWORD read and the ROL. It would probably need to be aligned before the DWORD read which is easy enough to do. The only thing that would worry me is the stall on some older Intel hardware for doing byte reads after a DWORD write.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

P3:

129 timing msvcrt strstr
114 timing finstr
111 timing InStringL
129 timing msvcrt strstr
113 timing finstr
111 timing InStringL
129 timing msvcrt strstr
113 timing finstr
110 timing InStringL
129 timing msvcrt strstr
113 timing finstr
111 timing InStringL
129 timing msvcrt strstr
114 timing finstr
110 timing InStringL
129 timing msvcrt strstr
113 timing finstr
111 timing InStringL
128 timing msvcrt strstr
113 timing finstr
111 timing InStringL
129 timing msvcrt strstr
113 timing finstr
110 timing InStringL
----------------------
128 average msvcrt strstr
113 average finstr
110 average InStringL
----------------------
Press any key to continue ...
eschew obfuscation

zooba

P3 Mobile (probably 1400MHz, though maybe less, SpeedStep doesn't seem to work properly...):

93 timing msvcrt strstr
108 timing finstr
71 timing InStringL
96 timing msvcrt strstr
111 timing finstr
71 timing InStringL
96 timing msvcrt strstr
131 timing finstr
70 timing InStringL
93 timing msvcrt strstr
108 timing finstr
71 timing InStringL
95 timing msvcrt strstr
108 timing finstr
70 timing InStringL
94 timing msvcrt strstr
118 timing finstr
70 timing InStringL
93 timing msvcrt strstr
119 timing finstr
69 timing InStringL
93 timing msvcrt strstr
108 timing finstr
70 timing InStringL
----------------------
94 average msvcrt strstr
113 average finstr
70 average InStringL
----------------------
Press any key to continue ...


Interesting that finstr is the same for both mine and MichaelW's computers, though the others are approx 30-40 faster on mine (consistently). I guess I am running at 1400MHz :bg

Cheers,

Zooba :U

six_L

QuoteGenuineIntel  Family 6  Model 13  Stepping 6
Intel Brand String: Intel(R) Pentium(R) M processor
Features: FPU TSC CX8 CMOV CLFSH FXSR MMX SSE SSE2
Quote94 timing msvcrt strstr
107 timing finstr
71 timing InStringL
93 timing msvcrt strstr
122 timing finstr
71 timing InStringL
95 timing msvcrt strstr
129 timing finstr
71 timing InStringL
94 timing msvcrt strstr
127 timing finstr
71 timing InStringL
94 timing msvcrt strstr
122 timing finstr
71 timing InStringL
94 timing msvcrt strstr
107 timing finstr
71 timing InStringL
93 timing msvcrt strstr
123 timing finstr
71 timing InStringL
94 timing msvcrt strstr
107 timing finstr
71 timing InStringL
----------------------
93 average msvcrt strstr
118 average finstr
71 average InStringL
----------------------
Press any key to continue ...
regards

Jimg

#26
Hutch-

Quote from: hutch-- on November 21, 2006, 05:54:26 AM
>
I would be interested in seeing your alternate version with the DWORD read and the ROL. It would probably need to be aligned before the DWORD read which is easy enough to do. The only thing that would worry me is the stall on some older Intel hardware for doing byte reads after a DWORD write.

Look at the last block of code in my previous post.  All I did is replace your block with that one, and it reduced the run from 115 to 107 cycles on the shorter string.  Still much slower than Lingo on the longer string however.  But look carefully, I could have screwed up the endian thing.  I haven't done any masm since february and am a littly rusty.

edit.   Actually, SHR edx,16 would be better than the ROL, and move it up one line.

edit2.  I just dusted off my old celeron laptop and gave it a try.  No difference in speed so basically, forget everything I said.

lingo

Shorter substrings and rolling out?
OK, ... new InStringL with new results:  :lol
207 timing msvcrt strstr
111 timing finstr
70 timing InStringL
209 timing msvcrt strstr
117 timing finstr
71 timing InStringL
209 timing msvcrt strstr
111 timing finstr
70 timing InStringL
209 timing msvcrt strstr
112 timing finstr
70 timing InStringL
206 timing msvcrt strstr
113 timing finstr
70 timing InStringL
209 timing msvcrt strstr
113 timing finstr
70 timing InStringL
209 timing msvcrt strstr
111 timing finstr
70 timing InStringL
213 timing msvcrt strstr
110 timing finstr
70 timing InStringL
----------------------
208 average msvcrt strstr
112 average finstr
70 average InStringL
----------------------
Press any key to continue ...


Regards,
Lingo


[attachment deleted by admin]

ecube

114 timing msvcrt strstr
94 timing finstr
49 timing InStringL
113 timing msvcrt strstr
94 timing finstr
49 timing InStringL
112 timing msvcrt strstr
95 timing finstr
49 timing InStringL
113 timing msvcrt strstr
93 timing finstr
50 timing InStringL
113 timing msvcrt strstr
95 timing finstr
49 timing InStringL
113 timing msvcrt strstr
94 timing finstr
48 timing InStringL
113 timing msvcrt strstr
99 timing finstr
50 timing InStringL
112 timing msvcrt strstr
95 timing finstr
49 timing InStringL
----------------------
112 average msvcrt strstr
94 average finstr
49 average InStringL
----------------------
Press any key to continue ...

Seb

Very nice timings! :U


107 timing msvcrt strstr
90 timing finstr
47 timing InStringL
107 timing msvcrt strstr
90 timing finstr
47 timing InStringL
107 timing msvcrt strstr
89 timing finstr
46 timing InStringL
107 timing msvcrt strstr
89 timing finstr
47 timing InStringL
107 timing msvcrt strstr
90 timing finstr
47 timing InStringL
107 timing msvcrt strstr
89 timing finstr
47 timing InStringL
106 timing msvcrt strstr
102 timing finstr
47 timing InStringL
106 timing msvcrt strstr
90 timing finstr
46 timing InStringL
----------------------
106 average msvcrt strstr
91 average finstr
46 average InStringL
----------------------