szLen optimize...

Jimg · June 21, 2005, 01:18:32 PM

Phil-

QuoteI certainly agree that discussion is key to understanding what's happening here. I plugged your unrolled code into the test as szLength2 and didn't see much of a difference as indicated by the result. I'm testing on a 996 MHz P3 and the trials I see are very consistent on this machine.

The routine is definately the fastest on my athlon (other than the sse code). I've seen these timing differences between the P3 and Athlons before....

ic2 · June 21, 2005, 02:05:01 PM

Here are the two block of code that that deserve serious notice. I hope this may be included into your (today's) advancements on this subject ... I founded it though serious searching. Could this be included in your test. We all really need to see the results here.

Thank you

Code Select

Jens_Duttke_StrLen proc PROC Source:DWORD

mov	ecx, Source

	@@:
		mov	eax, dword ptr [ecx]
		add	ecx, 4

		lea	edx, [eax - 01010101h]
		xor	eax, edx
		and	eax, 80808080h
	jz	@B
		and	eax, edx
	jz	@B

	bsf	edx, eax

	sub	edx, 4
	shr	edx, 3

	lea	eax, [ecx + edx - 4]
	sub	eax, Source

RET

Jens_Duttke_StrLen endp

Code Select

Jens_fast_strlen PROC item:DWORD

mov	ecx, item

	@@:
		mov	eax, dword ptr [ecx]
		add	ecx, 4

		lea	edx, [eax - 01010101h]
		xor	eax, edx
		and	eax, 80808080h
	jz	@B
		and	eax, edx
	jz	@B

	bsf	edx, eax

	sub	edx, 4
	shr	edx, 3

	lea	eax, [ecx + edx - 4]
	sub	eax, item

RET

Jens_fast_strlen ENDP

Jimg · June 21, 2005, 02:24:02 PM

Code Select

Results on Athlon XP 3000+


Test routines for correctness:
lszLenSSE     0    1    2    3    5    8   13   21   34   55   89  144  233
FStrLen       0    1    2    3    5    8   13   21   34   55   89  144  233
Ratch         0    1    2    3    5    8   13   21   34   55   89  144  233
szLength      0    1    2    3    5    8   13   21   34   55   89  144  233
szLen         0    1    2    3    5    8   13   21   34   55   89  144  233
Jens_fast_    0    1    2    3    5    8   13   21   34   55   89  144  233

Strings aligned:
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    25   26   22   25   25   28   29   32   38   48   80  118  169
FStrLen       6    7   12   12   11   13   16   21   32   63   88  135  201
Ratch         7   12   12   15   14   14   20   29   39   77  101  142  220
szLength      9   10    9   10   11   15   16   26   34   49   90  132  198
szLen         7    8   13   13   18   23   28   38   54   91  140  207  323
Jens_fast_   20   20   20   20   21   27   29   36   46   69   99  145  217

Strings misaligned by 1 byte:
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    30   28   28   28   29   30   30   37   42   55   90  124  179
FStrLen       6    7    9    8   11   12   17   26   35   71  102  150  230
Ratch         8   10   11   15   18   16   23   31   40   86  108  156  241
szLength     13   14   14   15   15   18   21   28   37   56   96  137  207
szLen         9    9   12   13   18   22   29   39   54   93  140  207  322
Jens_fast_   21   21   20   20   28   29   33   42   50   77  107  156  234

Strings misalinged by 2 bytes:
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    28   28   28   28   28   30   30   37   41   55   90  126  178
FStrLen       6    7    9   10   10   12   19   25   35   72  102  150  229
Ratch         7   11   12   15   18   15   23   30   40   85  109  155  240
szLength     14   14   15   15   15   19   21   27   52   54   96  139  206
szLen         8    9   13   13   18   22   28   38   54   94  139  207  323
Jens_fast_   20   21   20   20   29   29   33   42   51   77  107  157  235

ic2 · June 21, 2005, 03:36:10 PM

Jimg, I made a mistake and posted identical Jens Duttke code. Below is the one that was supposed to be slower. Funny it gave slightly difference results for the same code. Could it be back to back run in. I guest it really don’t matter seeing that FstrLen is the fastest anyway. This is really great.

Also i see you caught the flaw.

Thanks a lot for displaying the results quickly

Code Select

Jens_Duttke_StrLen proc PROC item:DWORD

mov ecx, item

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz@B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]
sub eax, item

ret
Jens_Duttke_StrLen endp

Jimg · June 21, 2005, 03:41:59 PM

Phil- Ok, here is a version that tests the string misalignment automatically. I also added a print to verify that the routines were working correctly, and I added a string with all the possible ascii characters (the 999 string). As you can see, the FStrLen routine stops at the first ascii character over 128 an so it's cycle counts for that string are not correct.

Code Select

Test routines for correctness:
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
FStrLen      0    1    2    3    5    8   13   21   34   55   89  144  233  128
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength     0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLen        0    1    2    3    5    8   13   21   34   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999

Proc/Byte    0    1    2    3    5    8   13   21   34   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

Misaligned by 0 bytes:
lszLenSSE   25   25   27   25   25   28   29   32   38   47   82  118  168  596
FStrLen      6    8    9    9   11   12   16   21   33   60   89  131  199  119
Ratch        9   10   12   15   14   14   20   29   39   77  101  142  220  857
szLength    10    8    9   10   11   15   18   26   33   64   90  133  199  784
szLen        6   11   12   15   19   23   30   61   80   93  140  208  323 1293
Jens_fast   20   20   20   21   21   26   29   36   44   69  100  146  217  923

Misaligned by 1 bytes:
lszLenSSE   28   28   28   28   28   30   31   33   43   55   92  123  179  623
FStrLen      6    8    9    9   11   12   17   25   35   71  101  149  229  135
Ratch        8   11   11   15   18   15   23   30   40   85  108  154  240  955
szLength    14   13   14   15   15   18   21   28   38   55   96  135  205  785
szLen        7   11   12   15   20   23   33   39   54   92  185  208  351 1293
Jens_fast   20   20   20   20   25   28   32   40   48   75  105  154  233 1001

Misaligned by 2 bytes:
lszLenSSE   27   29   27   28   28   31   30   36   41   56   89  127  179  621
FStrLen      6    7    9    9   12   12   17   25   35   73  102  149  228  136
Ratch        8   11   12   14   18   16   23   30   40   87  110  157  241  954
szLength    14   14   15   15   15   21   21   28   40   55   97  140  207  787
szLen        8   11   12   12   18   22   27   40   54   93  139  213  322 1291
Jens_fast   20   20   20   20   24   28   32   40   47   75  105  153  232  994

Misaligned by 3 bytes:
lszLenSSE   28   28   28   28   28   30   31   33   41   56   91  124  177  629
FStrLen      7    8    9    9   10   12   17   27   35   71  103  151  230  136
Ratch        8   11   13   15   18   15   23   31   40   85  110  156  242  953
szLength    14   15   16   15   17   20   24   31   40   55   98  140  207  792
szLen        8   11   12   15   20   23   28   61   69   92  148  208  321 1291
Jens_fast   20   20   21   20   24   28   32   41   47   76  104  154  234  998

Press enter to exit...

[attachment deleted by admin]

Phil · June 21, 2005, 09:04:07 PM

Jimg: Thanks for automating ... Especially for the verification routine and string 999 that shows FStrLen 7-bit short-comings!

Here are the results for a 996 MHz P3:

Code Select

Test routines for correctness:
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
FStrLen      0    1    2    3    5    8   13   21   34   55   89  144  233  128
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength     0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLen        0    1    2    3    5    8   13   21   34   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999

Proc/Byte    0    1    2    3    5    8   13   21   34   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

Misaligned by 0 bytes:
lszLenSSE   16   16   16   16   17   19   19   22   29   48   63   86  117  404
FStrLen      7    7   10    8   10   13   16   37   47   59   86  128  194  116
Ratch       18   25   32   39   29   25   35   51   66   95  105  144  227  872
szLength    19   19   19   19   23   25   30   37   47   81  116  173  261 1025
szLen        8   10   12   14   19   22   30   56   77  115  175  270  428 1767
Jens_fast   12   12   12   12   15   18   19   38   47   63   88  130  196  870

Misaligned by 1 bytes:
lszLenSSE   16   16   16   16   16   26   19   40   32   54   93  113  171  649
FStrLen      7    7   10    8   10   13   18   53   49   59   87  133  210  125
Ratch       18   25   32   39   29   25   35   74   77   96  124  182  286 1149
szLength    24   25   25   30   29   30   33   41   54   88  120  176  265 1032
szLen        8   10   12   14   19   22   30   56   77  115  175  270  428 1768
Jens_fast   12   12   12   12   15   18   19   63   51   66   99  145  224  981

Misaligned by 2 bytes:
lszLenSSE   16   17   16   16   16   26   19   40   32   54   93  113  251  648
FStrLen      7    7   10    8   10   13   16   53   49   59   87  133  288  124
Ratch       18   25   32   39   29   25   35   73   77   96  124  182  371 1149
szLength    25   25   30   30   29   31   33   41   58   88  120  177  265 1032
szLen        8   10   12   14   19   22   30   56   77  115  175  270  428 1770
Jens_fast   12   12   12   12   15   18   19   63   51   64   99  145  310  978

Misaligned by 3 bytes:
lszLenSSE   16   16   16   16   16   26   19   22   45   60   81  117  184  627
FStrLen      7    7   10    8   10   13   16   37   48   66   96  144  209  123
Ratch       18   25   32   39   29   25   35   51   73  110  135  191  287 1140
szLength    25   30   30   29   30   31   37   45   58   88  125  178  269 1033
szLen        8   10   12   14   19   22   30   59   76  115  176  270  428 1767
Jens_fast   12   12   12   12   15   18   19   38   53   77  106  155  227 1005

Press enter to exit...

Codewarp · June 21, 2005, 10:10:02 PM

Quote from: ic2 on June 21, 2005, 02:05:01 PM
Here are the two block of code that that deserve serious notice. I hope this may be included into your (today's) advancements on this subject ... I founded it though serious searching. Could this be included in your test. We all really need to see the results here.

Thank you

ic2: Interesting algorithm, though is has some shortcomings:

(1) Is seems to work only on 7-bit ascii, not 8-bit.
(2) Its loop uses two jmps instead of one. I believe the first one is unnecessary.
(3) the BSR implementation has been tried and examined thoroughly. It looks so elegant...
Too bad the BSR is such a dog, see the szLength( ) for a better impl. of this tail-end part
of the routine.
(4) No misalignment handling makes this method slow for long misaligned strings.

Codewarp · June 21, 2005, 10:16:04 PM

Quote from: hutch-- on June 21, 2005, 08:35:03 AM
Quote
This is in fact an interesting notion but I am wary of what is left as it will still depend on the opcode implementation from processor to processor which differ substantially over time and between different manufacturers. Usually the reference to a known code is more useful but this also has its limitations in that an algo that is fast one one machine can be slow on another if its written to use a specific characteristic of one form of hardware.

Memory speed is of course a factor but on the same box testing two different routines, one known and the other developmental there is no advantage or disadvantage to either. What I am inclined to trust is algo comparison on a range of different boxes with different processors to see which works better on what box which is the basics of writing mixed model code that is general purpose.

Hutch --

This benchmarking thing really gets down to the heart of the matter, doesn't it? I agree with everything you have said, and it gets right down to what your code is written for. Code tends to stick around, but processors tend to fade away. There simply isn't any way to code something so that is runs the fastest on all CPUs. You have to pick and choose, and to know what your strategy is. Several strategies come to mind:

(1) Separate libraries for each processor
(2) An Intel library, and an AMD library
(3) Single library optimized for the present day hardware, but compatible back to the PII.
(4) Single library like (3), with dynamic inclusion of advanced cpu features (like sse, etc)
(5) Single library optimized with every trick from tomorrows hardware.

Actually, all of these are desirable, each with serious benefits and baggage. However, clients on 5 year old hardware don't tend to complain about software performance too much. It's the one's driving the shiny new XP-zazz that want all that speed. Do you really want to avoid MUL instructions, simply because somebody might run it on a P4? I think not, and as for my own effort, most of it goes in the direction of approach (3)--as in my szLength( ) routine, and in (4) when needed.

I been so pleased with the szLength( ) results, that I turned it into a killer memchr( ) implementation (faster than anything I had before). Memchr( ) is a much more useful function than strlen( ) that can have a bigger impact on overall sofware speed than strlen( ). Should I post this as a new topic, or as further evolution in szLen( ) ??

Phil · June 21, 2005, 11:07:04 PM

Quote from: Codewarp on June 21, 2005, 10:16:04 PM
I been so pleased with the szLength( ) results, that I turned it into a killer memchr( ) implementation (faster than anything I had before). Memchr( ) is a much more useful function than strlen( ) that can have a bigger impact on overall sofware speed than strlen( ). Should I post this as a new topic, or as further evolution in szLen( ) ??

My vote would be a new topic. That would allow others to pick up the new discussion from the beginning. We already have a great deal of discussion going on here and a lot to be considered.

Quote from: Codewarp on June 21, 2005, 10:10:02 PM
Quote from: ic2 on June 21, 2005, 02:05:01 PM
Here are the two block of code that that deserve serious notice. I hope this may be included into your (today's) advancements on this subject ... I founded it though serious searching. Could this be included in your test. We all really need to see the results here.

Thank you

ic2: Interesting algorithm, though is has some shortcomings:

(1) Is seems to work only on 7-bit ascii, not 8-bit.
(2) Its loop uses two jmps instead of one. I believe the first one is unnecessary.
(3) the BSR implementation has been tried and examined thoroughly. It looks so elegant...
Too bad the BSR is such a dog, see the szLength( ) for a better impl. of this tail-end part
of the routine.
(4) No misalignment handling makes this method slow for long misaligned strings.

Thanks to JimG's validation it's clear that FStrLen is the only procedure with the 7-bit ASCII limitation. Also, on the P3 I am using Jens_fast is quicker than szLength with all alignments. JimG's results show that szLength is quicker on an Atholon. I'm not sure if that is related to the BSR usage or not. Anyway, that's my two-cents worth for the moment.

Codewarp · June 22, 2005, 01:51:36 AM

Phil,

First of all, thank you for your response to all of this, along with everyone else too, of course.

I wanted to point out some things regarding (what I call) the DWORD search method, which is used by all of the faster strlen( ) implementations. Let's look at logic of it:

[<fix alignment>] optional misalignment fixup

<locate dword> find the dword containing a zero

<locate byte> find the first zero in the dword

<return len> return the byte address - string base

You will notice that the <fix align> is optional, but all other steps are mandatory--you cannot omit any to speed it up without breaking it.

Now, the point of all this that <locate byte> has a variety of implementations, soom good, some not so good, but every call passes through it, so clocks saved here speed up every call :thumbu.

There are a number of methods for <locate byte>:

(1) inc, test and jz each byte (3 times)
(2) bsr div 8
(3) inc, shr 8 and jc each byte (3 times)
(4) separate upper/lower, add 1-bit7 to address

Ratch uses (3), szLength uses (4). I use (4) because substituting the other methods in anybody's implementation will increase clock counts (by 2-5), and because it requires fewer jmps. BSR would be perfect, if it were not so poorly ScotchTaped to the CPU as an afterthought :tdown--its performance is an extreme disapointment. BSR seems marginally useful when you have no idea where the bit of interest resides within the dword. If you know more than that, shifts and masks will be faster. Method (1) looks promising, because no shifts are involved, but both (1) and (3) suffer from having so many instructions.

So, for example, you could take Ratch, substitute its <locate byte> method (3) with (4), and voila, you shave 2 or 3 cycles off every call (for faster short strings). This is where my comments to ic2 came from--no method using BSR will ever beat method (4), unless a future CPU changes things.

Phil · June 22, 2005, 02:49:54 AM

Quote from: Codewarp on June 22, 2005, 01:51:36 AM
Ratch uses (3), szLength uses (4). I use (4) because substituting the other methods in anybody's implementation will increase clock counts (by 2-5), and because it requires fewer jmps. BSR would be perfect, if it were not so poorly ScotchTaped to the CPU as an afterthought :tdown--its performance is an extreme disapointment. BSR seems marginally useful when you have no idea where the bit of interest resides within the dword. If you know more than that, shifts and masks will be faster. Method (1) looks promising, because no shifts are involved, but both (1) and (3) suffer from having so many instructions.

So, for example, you could take Ratch, substitute its <locate byte> method (3) with (4), and voila, you shave 2 or 3 cycles off every call (for faster short strings). This is where my comments to ic2 came from--no method using BSR will ever beat method (4), unless a future CPU changes things.

Thank you for your analysis. What you've said makes sense but it doesn't seem to flow with the results I'm seeing on this machine.

Please download the attached zip, browse thru the source to make sure I have incorporated your routine correctly, assemble if you like or run the included exe file and share the results on your machine with us. I bumped LOOP_COUNT back up to 1000000 and ran the test 3 times to make sure my results were consistent. They varied in some cases by 4 or 5 clocks but the trends are quite consistent. Again, for *some reason* Jens_fast is topping szLength in all cases on a 996 MHz P3. I removed the unnecessary jz as you and P1 suggested and it slowed it down considerably for mis-aligned strings. szLength is certainly least affected by the alignments as you can see from these results but all of the other procedures use BSF and Jens_fast is always slightly faster than szLength. The SBB instruction that you use is slower on this machine ... maybe that's the difference?

Code Select

Proc/Byte    0    1    2    3    5    8   13   21   34   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

Misaligned by 0 bytes:
szLength    19   19   19   19   23   24   30   37   47   81  116  173  261 1026
Ratch       18   25   32   39   29   25   35   51   66   92  105  144  227  871
Jens_fast   12   12   12   12   15   18   19   41   47   63   88  130  196  870
Jens_slow   10   10   10   10   15   17   20   41   52   68   99  146  220  849

Misaligned by 1 bytes:
szLength    24   25   25   30   29   30   33   41   54   88  120  176  265 1033
Ratch       18   25   32   39   29   25   35   72   77   96  124  182  286 1150
Jens_fast   12   12   12   12   15   17   20   63   51   65   99  145  224  978
Jens_slow   10   10   10   10   15   17   20   56   62   75  113  181  283 1146

Misaligned by 2 bytes:
szLength    25   25   30   30   29   31   33   41   58   88  120  177  266 1033
Ratch       18   25   32   39   29   25   35   72   77   96  124  182  371 1150
Jens_fast   12   12   12   12   15   18   19   63   51   65   99  145  310  979
Jens_slow   10   10   10   10   15   17   20   53   62   74  113  181  362 1147

Misaligned by 3 bytes:
szLength    25   30   30   29   30   31   37   45   58   88  125  177  269 1032
Ratch       18   25   32   39   29   25   35   51   73  110  135  191  287 1141
Jens_fast   12   12   12   12   15   18   19   38   53   77  106  155  227 1005
Jens_slow   10   10   10   10   15   17   20   41   57   85  128  192  282 1140

To me, it's not about who's got the fastest procedure or algo here ... it's about understanding what some of the differences in our architectures or CPU's are that cause us the see things that don't fully make any sense until we understand why and what's happenin' :dance:

[attachment deleted by admin]

Codewarp · June 22, 2005, 04:39:02 AM

Phil --

I think we are talking about different things. At the moment, let me address the p3 issue... I love the p3, it has everything that is necessary, its fast, and it doesn't heat up the room. But to go ever faster, the silicon guys had to start slanting things. Certain instructions, the basic ones like add, adc, and, or, not, mov, cmp, etc get the serious silicon, while others get the micro-coded put-on. By sticking to the basic set, your code fits into the groove that the CPU has been finely tuned to perform. Add to this, some careful instruction ordering to keep multiple execution units humming, and you have code that executes considerably faster on a contemporary CPU.

The p3 doesn't know how to take advantage of all that. If you want the fastest code on a p3, then--hands down--use a p3-only library and optimize the @#$%@% out of it :bdg! However, my interest is in code that runs the fastest on today's machines, but compatible all the way back to the PII. If that code ran really terrible on a p3 :red, a compromise might be order--but that doesn't appear to be an issue in this case.

======================
By the way, there is actually another idea for an even faster szLength( ):

- start off with the 7-bit search
- when the "zero" if found, return if it really is a zero
- otherwise continue from there with an 8-bit search to completion

For the vast majority of arguments to strlen( ) which are 7-bit sz, the faster search will suffice. But as soon as bit7=1, it would switch over to 8-bit. The 7-bit search would be unrolled like the 8-bit search, so it would be faster than any of the 7-bit impl we have seen so far.

Phil · June 22, 2005, 05:51:24 AM

Codewarp: Thanks for the 7-bit to 8-bit suggestion. I've been considering ways that to fit FStrLen so it can handle 8-bit ASCII.

I've also found this All About Strings link that was written by tenkey, roticv, and others. It also contains many algorithms that aren't in our tests yet.

To make sure we are talking about the same thing, can you download the test suite and post the results on your machine? You said earlier that it's not good to use BSR because it's slow but the routines that are using it in this test suite on the P3 I am using appear to be faster than the one that doesn't. I understand what you are saying about many non-crucial instructions being relagated to microcode and that can, in some instances, slow them down considerably. However, the bit instructions are crucial to many operating systems and the trace cache might just help make it fast enough in short loops like this that it might be okay to use. I'm just looking for results that confirm much of what you are saying. It seems that you are quite happy with szLength as it is and it is faster on the Athlon XP 3000+. I don't recall seeing any results for these recent tests from a PIV yet and I'm curious to know what the results would be. In trying to determine where the differences are I'm guessing that the SBB might be slowing your routine down on my machine ... but then, I think it is also slow on the PIV.

I'm going to play with a new test that incorporates some of the procedures described in the previous link and see if I can fix the FStrLen procedure so that handles 8-bit ASCII. For me, this is all about learning more about the various architectures, limitations, and advantages and certainly what you have said has been quite helpful. Thanks again.

It's okay if you are using Linux and can't run the tests. It's, obviously, okay too if you just don't have the time or if you just don't want to. I had offered earlier to produce the results of your benchmark on this machine if you could zip it up and post it but I obviously can't do that if its not Windows or Dos. It just helps to know some of the story behind the story sometimes. I am reading what you are saying, understanding, and learning as much as I can ... but without an apples to apples comparison of the same procedures in different orchards (various machine architectures) our words are just that. Food for thought.

I would also like to see a new thread for your memchr algorithm as well. I'm sure others would also be interested.

Codewarp · June 22, 2005, 08:30:38 AM

Phil --

What's happening is this: I thought your tests are not valid because Jen-fast/slow are both 7-bit routines. You are pitting 8-bit strlen( ) calls (i.e. szLength( ) and ratch( )) against 7-bit routines, then declaring the 7-bit routines the fastest--that's utter nonsense, I thought. But I had actually misinterpreted Jens as 7-bit, but it was actually 8-bit, creating confusion in my mind--my apologies Phil. The only difference between szLength loop and Jens (now) is szLength uses NOT EDX, and Jens uses XOR EDX, ECX, for the same effect. The NOT is necessary in later processors to avoid a register dependency and subsequent slowdown.

Further, don't get hung up on one SBB instruction at the very end--the loop is where all the action is. BSR remains a poor choice, and you could speed up Jens a tiny amount using the byte locator from my code.

Phil · June 22, 2005, 08:42:54 AM

Codewarp: I certainly hope that are not raving mad! Both Jens_fast and Jens_slow handle 8-bit extended ASCII. JimG put in the validation routine before the timing tests and added the 999 byte string with 8-bit ASCII. I removed the 7-bit FStrLen test.

It's okay, Bud. You can can be right and have your cake too. I understand.

News:

szLen optimize...

Jimg

ic2

Jimg

ic2

Jimg

Phil

Codewarp

Codewarp

Phil

Codewarp

Phil

Codewarp

Phil

Codewarp

Phil