szLen optimize...

roticv · June 26, 2005, 04:15:07 PM

Btw even Ollydbg 1.10 did not recongnise the SSE2 instruction and decoded it wrongly. It is sad that such things do happen. Do remind me to only use up to SSE instructions next time :P

Well, it might be good or bad depending on how you at it. It is not nice for someone to run a program and get unknown opcode error just because his/her processor does not support it. Most probably he/she will not know what happened.

I think the programmer have to be proactive in ensuring that his target users have the instruction set before running it. I think I am a lousy programmer :toothy Haven't been coding in asm for quite some time. Coding mainly in C, solving programming qn.

Codewarp · June 26, 2005, 08:04:46 PM

Roticv--

While we have you in this vulnerable, contrite state, let me suggest another little change to your code, to make it faster on really short strings...

Code Select


roticv2 proc lpstring:dword
	;int 3
		mov  eax, [esp+4]   ; removed your 1st test for zero, it's coming up soon most of the time anyway
		test   eax, 15          ; removed unnecessary code
		jz      aligned
	@@:
		cmp  byte ptr [eax], 0   ; inc eax afterward so it's ready now
		jz     done
		add   eax, 1
		test  eax, 15     ; simplified...
		jnz   @B
	aligned:
		pxor	xmm1, xmm1
		align 16
	@@:
		movdqa     xmm0, [eax]
		pcmpeqb   xmm0, xmm1
		add           eax, 16
		pmovmskb ecx, xmm0
		test          ecx, ecx
		jz             @B
		bsf           ecx, ecx        ; nice use of bsr !
		lea           eax, [eax+ecx-16]
	done:
		sub          eax, [esp+4]
		retn  4
roticv2 endp

Jimg · June 27, 2005, 12:01:57 AM

Codewarp-

Unless this was a joke and I just didn't get it, the code you just posted doesn't give the correct answers. I changed it's name to roticv3 to avoid conflict with 2 that I'm still looking at. The results:

Code Select

Test routines for correctness:
0 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticv3      0    1    2    3    5   17   22   22   39   55   98  144 1255  999
1 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticv3      0    1    2    3    5    8   13   22   48   64   98  144  239 1008
2 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticv3      0    1    2    3    5    8   13   31   48   64   98  144  239 1008
3 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticv3      0    1    2    3    5    8   13   31   48   64   98  144  239 1008

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
szLength     9    8    9    9   12   16   17   24   34   48   89  132  204  783
roticv3     26   26   27   26   26   28   28   29   30   35   47   74  370  301

1 byte misalignment
szLength    13   14   14   15   15   19   21   26   40   52   91  134  207  785
roticv3      7   11   16   21   32   46   84   78   87   88   95  112  153  365

2 byte misalignment
szLength    14   14   16   16   15   19   20   29   40   52   91  136  207  784
roticv3      6   11   16   20   31   46   84   78   82   85   95  111  153  364

3 byte misalignment
szLength    14   16   16   16   19   20   24   30   42   51   94  135  209  790
roticv3      7   11   16   21   31   46   72   75   78   90   93  106  150  360

Press enter to exit...

Codewarp · June 27, 2005, 05:03:59 AM

Jimg--

No joke, but neither did I attempt to fix the sse2 issue (movdqa instruction). I just tightened up on the initial byte scan. Otherwise I don't see any errors in the code. I would never knowingly post bad code, but I might unknowingly do it... Are you using an sse2 capable machine.

Jimg · June 27, 2005, 01:53:22 PM

Duh... Now I get it. Sorry. Even though Intel has 90% of market to AMD's 10% or so, and it's interesting to test this stuff out here in the laboratory, I wouldn't think sse2 would be a good choice for a general purpose rountine just yet.

Jimg · June 27, 2005, 03:52:23 PM

I modified Jens_mmx a little, and am getting some incredible times on the longer strings. I've looked and can't find out how it's cheating on the rest of the routines. Is there something going on here I don't see?

Code Select

Test routines for correctness:
0 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
1 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
2 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
3 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
szLength     8    8   10   10   12   15   17   24   34   63   89  131  203  778
Ratch        8   11   11   13   14   14   20   30   64   77  100  143  229  853
Jens_fast   20   20   20   20   21   26   29   36   57   69   99  145  219  923
roticvSSE    4   28   28   29   28   32   32   35   39   52   80  111  166  563
lszLenSSE   25   25   25   25   25   28   28   32   39   47   83  122  167  587
Jens_mmx2    7   33   31   30   37   43   47   55   79   46   92   60  123  286

1 byte misalignment
szLength    13   14   14   16   15   19   20   26   56   67   91  134  206  784
Ratch       19   11   11   15   18   17   23   31   69   85  111  156  255  955
Jens_fast   20   20   23   20   24   28   33   41   61   76  105  158  239 1003
roticvSSE    3    7   10   12   18   58   58   51   94   71   94  125  175  583
lszLenSSE   28   28   26   28   28   40   29   34   40   56   92  126  179  625
Jens_mmx2    6    8   12   17   25   73   59   65   85   62   96  122  143  294

2 byte misalignment
szLength    14   14   16   16   15   20   20   31   40   51   92  136  210  786
Ratch        8    2   12   15   18   17   23   32   69   85  110  154  253  953
Jens_fast   20   20   20   20   24   28   32   40   63   75  105  154  237  999
roticvSSE    6    7   12   12   19   57   52   51   59   67   90  120  194  582
lszLenSSE   28   28   28   28   28   30   29   32   41   55   91  123  176  625
Jens_mmx2    8    9   12   16   26   56   54   68   86   61   95  120  140  292

3 byte misalignment
szLength    14   16   16   15   19   20   24   29   40   50   93  135  208  786
Ratch        7   10   24   15   18   16   26   32   68   87  109  157  253  951
Jens_fast   20   20   20   20   24   28   32   40   62   76  104  156  238  998
roticvSSE    4    7   10   12   18   45   46   49   54   67   89  124  171  582
lszLenSSE   28   28   28   28   28   31   30   35   43   56   89  123  173  635
Jens_mmx2    6    8   12   13   26   53   62   71   86   63  113  125  143  294

Press enter to exit...

Of course, Jens_mmx is no good for a general purpose routine as it needs 64 bytes past the possible end of the test string, but if you are writing the program and can assure enough headroom, it's fast. For general purpose, I'll stick with szLength.

[attachment deleted by admin]

Mark Jones · June 27, 2005, 03:59:12 PM

Nice, I get similar results with AMD XP 2500+

roticv · June 27, 2005, 04:19:05 PM

Quote from: Jimg on June 27, 2005, 03:52:23 PM
I modified Jens_mmx a little, and am getting some incredible times on the longer strings. I've looked and can't find out how it's cheating on the rest of the routines. Is there something going on here I don't see?

Of course, Jens_mmx is no good for a general purpose routine as it needs 64 bytes past the possible end of the test string, but if you are writing the program and can assure enough headroom, it's fast. For general purpose, I'll stick with szLength.

There are a couple of reasons why it is possible to achieve such good speed.
1) Alignment. It ensures that strings are aligned to 8bytes before getting into the main loop that scans using mmx registers. (Maybe we can make use of Codewarp's improvements to speed it up)
2) Unrolling of loops. It speeds up the routine as it unroll all the data and fit it into the L1 code cache.
3) Usuage of lea is not found in the main loop. Instead it is only found in the second loop to determine where is the null terminator found. Maybe this could be improved by using pmovmskb.
4) Grouped read/compares and ors. (Rule no 2 of the advanced part of optimisation in mark larson's tips)

PS: I don't think MichaelW's timing marcos are as stable as I want it to be. Oh well.

Codewarp · June 28, 2005, 12:27:33 AM

Quote from: roticv on June 27, 2005, 04:19:05 PM
Quote from: Jimg on June 27, 2005, 03:52:23 PM
I modified Jens_mmx a little, and am getting some incredible times on the longer strings. I've looked and can't find out how it's cheating on the rest of the routines. Is there something going on here I don't see?

Of course, Jens_mmx is no good for a general purpose routine as it needs 64 bytes past the possible end of the test string, but if you are writing the program and can assure enough headroom, it's fast. For general purpose, I'll stick with szLength.

There are a couple of reasons why it is possible to achieve such good speed.
1) Alignment. It ensures that strings are aligned to 8bytes before getting into the main loop that scans using mmx registers. (Maybe we can make use of Codewarp's improvements to speed it up)
2) Unrolling of loops. It speeds up the routine as it unroll all the data and fit it into the L1 code cache.
3) Usuage of lea is not found in the main loop. Instead it is only found in the second loop to determine where is the null terminator found. Maybe this could be improved by using pmovmskb.
4) Grouped read/compares and ors. (Rule no 2 of the advanced part of optimisation in mark larson's tips)

PS: I don't think MichaelW's timing marcos are as stable as I want it to be. Oh well.

All those reasons are ok, but the big one is this--it's damn hard and awkward to find a single byte at any alignment in any dword using the normal cpu instructions. But mmx is designed to operate on bigger chunks like and it gets right down to it. You can unroll, align, lea or not lea, and reorder instructions all you want, I did, but the tripling in speed is a different animal. It gets better with each extension (mmx -->sse -->sse2 -->sse3...), but most of the improvement can be implemented with just mmx. PMOVMSKB is a very useful instruction here, but is SSE, not MMX. SSE requires a P3 or later, but PII's are still around.

Also, the scan overshoot is not a problem for a nondestructive operation like strlen( ), as long as it doesn't go off the end of the 4k page. That problem is easily remedied by processing 32-byte chunks, with 32-byte alignment--goodbye page faults...

Now, it seems to me that mmx is so standard that it could be used for "everyday" use without checking every time. However, it's host library start-up code should still abort if no mmx support exists. Can we consider the P1 and PPro dead, or are there other non-mmx pentiums out there?

roticv · June 28, 2005, 02:04:04 PM

Here's my 2 cents.

It is not right to take things for granted. It is better to first check whether cpuid exist by checking the EFLAG, then call CPUID. After that, set the flag for MMX/SSE/SSE2/SSE3 and then from then on just compare with the flag. We only need to figure out whether the processor supports certain extenstion once, then we can proceed to using the correct instruction set.

There's a reason why MMX/SSE/SSE2/SSE3 instruction sets are invented :toothy

Let's declare Jen's MMX variant of strlen the winner.

Codewarp · June 28, 2005, 05:56:59 PM

Quote from: roticv on June 28, 2005, 02:04:04 PM

It is not right to take things for granted. It is better to first check whether cpuid exist by checking the EFLAG, then call CPUID. After that, set the flag for MMX/SSE/SSE2/SSE3 and then from then on just compare with the flag. We only need to figure out whether the processor supports certain extenstion once, then we can proceed to using the correct instruction set.

I tend to agree, however, anything that destroys performance on string lengths of a few bytes is dead on arrival. In "real" applications, the bulk of the clock cycles spent in strlen( ) is usually on the short strings--not on 1000 bytes+ strings. Therefore, no cpuid and no eflags is ok with me. Would you be doing all this real-time conditional coding in the memchr( ) and in the memmove( ), etc...? No, this has to be performed at application start-up time, not inside these low level routines. That way, the decision overhead is reduced to a single memory test instruction.

Vortex · June 28, 2005, 06:36:53 PM

Tested on a P4 2.66 GHz

Code Select


Test routines for correctness:
0 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
1 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
2 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999
3 byte misalignment
szLength     0    1    2    3    5    8   13   22   39   55   89  144  239  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  239  999
roticvSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  239  999
Jens_mmx2    0    1    2    3    5    8   13   22   39   55   89  144  239  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
szLength    12    3    2    3   17   12   15   17   48   55   92  138  225  777
Ratch       -1    1   35    8    9    8   15   21   50   58   97  134  231  881
Jens_fast   12    0    0    0   16    6   36   41   68   74   92  130  200  930
roticvSSE    8   20   17   15   26   19   20   23   40   68   57  132  184  623
lszLenSSE   53   12   11   11   22   16   15   19   25   51   53  129  183  609
Jens_mmx2    8   33   23   23   38   33   33   39   62   39   72   65  132  416

1 byte misalignment
szLength    10   20   10   13    9   26   20   21   40   60  123  139  232  783
Ratch        1   12    6    7    7   16   12   21   42   69  125  185  356 1132
Jens_fast    3   10    1   -1    4   16   35   40   49  109  163  162  243 1087
roticvSSE   -2   12    7    7   13   53   40   45   59   91  100  149  206  664
lszLenSSE   11   22   12   12   11   26   15   19   25  113  120  158  214  729
Jens_mmx2   -3   12    3    9   17   70   73   70   85   78  102  122  140  445

2 byte misalignment
szLength     9   10   24   10   11   14   28   28   38   49   97  137  210  838
Ratch        1    1    5    8    7    5   39   20   54   58  124  185  296 1107
Jens_fast    3    0   10    2    3    5   79   43   57   60  115  168  278 1100
roticvSSE   -3    1   14   10   13   33   35   41   47   52   65  106  200  625
lszLenSSE   13   11   22   13   11   15   60   21   24   37   84  159  224  726
Jens_mmx2   -3    1   14   11   17   94   62   69  113   62  102  114  136  430

3 byte misalignment
szLength    11   16    9   20   16   11   21   38   66   82  122  137  209  788
Ratch        1    1    5    8    7    5   15   30   48   59  124  211  332 1073
Jens_fast    3   -1   -1   10    4    5   36   42   86   92  114  180  262 1080
roticvSSE   -2    1    5   11    9   59   37   70   79   85  103  145  201  631
lszLenSSE   15   14   11   22   13   15   15   30   39  114   73  184  241  746
Jens_mmx2   -2    1    3   20   18   48   85   72   99   60   95  118  164  424

Phil · June 28, 2005, 06:46:52 PM

Tested on 996 MHz P3 taken from timelen5.zip above. Only the timings are included here. I visually verified the correctness section and excluded it from the results.

Code Select

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
szLength    19   19   19   19   23   25   30   37   55   79  116  175  263 1023
Ratch       18   25   32   39   29   25   35   58   78   92  105  144  245  872
Jens_fast   12   12   12   12   15   18   19   38   48   63   88  130  199  871
roticvSSE    4   18   18   18   18   20   20   23   28   48   59   78  104  341
lszLenSSE   16   16   16   16   16   19   19   22   29   48   63   86  116  402
Jens_mmx2    4   31   30   31   35   38   44   63   73   45   85   60  121  287

1 byte misalignment
szLength    24   26   25   28   28   30   34   41   72   87  121  177  274 1033
Ratch       18   25   32   39   29   25   35   58   86  110  124  191  326 1149
Jens_fast   12   12   12   12   15   18   19   38   55   77   99  158  242  978
roticvSSE    4   10   13   16   22   48   49   50   59   78   93  108  143  387
lszLenSSE   16   16   16   16   16   19   19   22   32   60   92  117  185  650
Jens_mmx2    4    9   12   15   21   62   67   70  104   71  117  133  152  332

2 byte misalignment
szLength    25   25   28   28   28   31   35   45   72   87  119  177  274 1033
Ratch       18   25   32   39   29   25   47   81   85   97  135  183  303 1141
Jens_fast   12   12   12   12   15   18   30   62   57   66  107  148  231 1008
roticvSSE    4   10   13   16   22   46   46   50   57   75   87  106  133  374
lszLenSSE   16   16   16   15   17   26   26   40   36   54   82  114  171  630
Jens_mmx2    4    9   12   15   21   59   66   74  103   72  116  139  149  324

3 byte misalignment
szLength    25   28   28   28   30   31   39   45   88   89  128  178  272 1037
Ratch       18   25   32   39   29   25   35   58   86  110  125  192  327 1153
Jens_fast   12   12   12   12   15   18   19   42   55   77   99  158  243  982
roticvSSE    4   10   13   16   22   32   35   37   43   63   83   95  123  366
lszLenSSE   16   16   16   16   16   19   19   22   32   60   93  118  184  651
Jens_mmx2    4    9   12   15   21   45   56   59   88   55  101  115  136  305

roticv · July 15, 2005, 12:29:18 PM

Sorry to wake this dead thread but I just found another interesting strlen routine by r22.

Code Select

align 16
strLenAlign16SSE:
        mov ecx,[esp+4]
        movdqa xmm2,dqword[filled]
        lea eax,[ecx+16]
        movdqa xmm0,dqword[ecx]
    .lp:
        movdqa xmm1,xmm0
        pxor xmm0,xmm2     ;xor -1
        paddb xmm1,xmm2    ;sub 1
        movdqa xmm3,[eax]  ;used for unroll
        pand xmm0,xmm1
        pmovmskb edx,xmm0
        add eax,16
        test dx,-1 ;1111 1111 1111 1111b
        jnz .unrol
        movdqa xmm1,xmm3
        pxor xmm3,xmm2     ;xor -1
        paddb xmm1,xmm2    ;sub 1
        pand xmm3,xmm1
        movdqa xmm0,[eax]  ;back to first roll
        pmovmskb edx,xmm3
        add eax,16
        test dx,-1 ;1111 1111 1111 1111b
        jz .lp
     .unrol:
        add ecx,32
        sub eax,ecx
        xor ecx,ecx
        sub ecx,edx
        and edx,ecx
        CVTSI2SD xmm0,edx
        PEXTRW edx,xmm0,3
        shr dx,4
        add dx,0fc01h
        ;          bsf edx,edx replaced by crazy SSE version
        add eax,edx
        ret 4
align 16
filled dq 0FFFFFFFFFFFFFFFFh,0FFFFFFFFFFFFFFFFh

lingo · August 04, 2005, 03:45:32 PM

Victor,
"Sorry to wake this dead thread but I just found another interesting strlen routine by r22."

I'm wondering what is so interesting... :bg
A lot of code in the main loop and slow exchange of the bsf... :'(
It is not a big deal to create something faster with 128-bit instructions

Here is the proof tested on my P4 3.6 GHz:

Code Select



Proc/Byte  0   1   2   3   5   8  13  22  39  55  89 144 239  999
0 byte misalignment	
szLength  16  15  14  15  20  22  24  32  56  79 144 245 250  861
Ratch     12  17  20  22  19  21  25  35  59  80 135 180 268 1046
Jens_fast 18  20  20  19  21  26  35  73  84  99 127 168 243 1097
roticvSSE  9  32  31  32  31  36  36  42  55  68 111 235 325  919
lszLenSSE 30  28  28  30  30  32  32  37  51  65 101 193 275  978
Jens_mmx2  9  48  45  45  50  51  54  62  79  64  98  90 152  460
slenLingo 14  14  15  15  15  15  15  22  27  36  48  72  96  405

1 byte misalignment											
szLength  20  20  18  25  23  26  29  35  51  63 210 178 262  869
Ratch     13  15  18  20  19  19  25  37  59  91 171 257 420 1666
Jens_fast 19  19  19  18  20  24  29  72  85 131 155 207 322 1450
roticvSSE  7  13  16  19  24  57  57  62  79  94 122 169 319  967
lszLenSSE 29  29  29  29  28  32  32  42  74 113 116 230 329 1095
Jens_mmx2  9  13  17  21  30  76  82  88 113  97 134 147 179  564
slenLingo 15  15  15  15  15  15  15  22  27  37  49  71 101  407

2 byte misalignment
szLength  21  18  22  21  21  27  26  37  51  63 131 176 255  965
Ratch     13  17  19  22  21  22  36  34  59  79 166 246 422 1491
Jens_fast 19  23  18  20  22  23  88 101 119 101 156 211 330 1355
roticvSSE  9  13  16  18  24  56  55  63  76  92 169 260 299  939
lszLenSSE 29  29  29  28  28  33  45  38  56  69 115 221 330 1195
Jens_mmx2  9  14  17  21  32  77  77  88 118  89 120 151 183  502
slenLingo 15  15  15  15  14  15  15  22  28  37  49  71 102  407

3 byte misalignment
szLength  20  24  23  23  26  25  32  38  92 105 135 176 253  880
Ratch     13  18  20  22  21  20  24  34  72  95 169 318 410 1434
Jens_fast 20  21  19  20  22  24  27  74 119 135 159 237 331 1474
roticvSSE  8  14  16  18  24  51  56  60  72  86 114 160 296 1090
lszLenSSE 29  30  28  29  28  32  31  38  59  76 119 247 343 1124
Jens_mmx2  8  11  17  21  30  66  81 141 192  85 119 136 170  484
slenLingo 14  14  14  15  15  14  15  22  28  36  48  72 103  406

Press enter to exit...

Regards,
Lingo

[attachment deleted by admin]

News:

szLen optimize...

roticv

Codewarp

Jimg

Codewarp

Jimg

Jimg

Mark Jones

roticv

Codewarp

roticv

Codewarp

Vortex

Phil

roticv

lingo