News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster alternative to .While ... .Endw

Started by jj2007, December 27, 2009, 09:32:10 AM

Previous topic - Next topic

jj2007

Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
rags,
my condolencies :thumbu
Just checked with my latest toy, an Olidata JumPC that cost me the horrendous sum of 99€ (142US$), Win XP included. It does the test in 3 cycles, claiming it has a Celeron M installed. Well...

2-Bit Chip

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
43      cycles for LoopDecAl
45      cycles for LoopDec
47      cycles for LoopWhile
43      cycles for LoopJmpAl
50      cycles for LoopJmp

31      cycles for LoopDecAl
41      cycles for LoopDec
34      cycles for LoopWhile
30      cycles for LoopJmpAl
29      cycles for LoopJmp

26      cycles for LoopDecAl
32      cycles for LoopDec
29      cycles for LoopWhile
24      cycles for LoopJmpAl
33      cycles for LoopJmp

17      cycles for LoopDecAl
20      cycles for LoopDec
19      cycles for LoopWhile
17      cycles for LoopJmpAl
19      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


MichaelW

P3:

☺☺☻♥ (SSE1)
15      cycles for LoopDecAl
15      cycles for LoopDecZx
16      cycles for LoopDec
21      cycles for LoopWhile
21      cycles for LoopJmpAl
21      cycles for LoopJmp

10      cycles for LoopDecAl
10      cycles for LoopDecZx
11      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
14      cycles for LoopJmp

10      cycles for LoopDecAl
9       cycles for LoopDecZx
11      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
11      cycles for LoopJmp

5       cycles for LoopDecAl
5       cycles for LoopDecZx
5       cycles for LoopDec
6       cycles for LoopWhile
6       cycles for LoopJmpAl
6       cycles for LoopJmp


☺☺☻♥ (SSE1)
21      cycles for LoopJmpAlInc
17      cycles for LoopJmpAlAdd
17      cycles for LoopJmpAlSub
16      cycles for LoopJmpZxInc
14      cycles for LoopJmpZxAdd
14      cycles for LoopJmpZxSub

14      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
12      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
10      cycles for LoopJmpAlSub
8       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

6       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub


eschew obfuscation

jj2007

#18
As far as I can see, this is the winning algo, for Core & Celeron & P3 & P4:

SkipLeadingWhiteSpace proc ; pSrc$:DWORD
mov edx, [esp+4] ; get source string
dec edx
@@: inc edx
movzx eax, byte ptr [edx]
cmp al, "0"
je @B
cmp al, 0
je @F
cmp al, 32
jle @B
@@:
  ret 4 ; edx points to first non-"0" and non-white space char
SkipLeadingWhiteSpace endp


EDIT: Added check for zero byte - thanks Sinsi :U

dedndave


jj2007

Quote from: dedndave on December 27, 2009, 10:33:05 PM
a close race, eh Jochen ?
Very close indeed. On the other hand, it is that kind of loop that is typically run not even once, so why waste a single cycle on it? What remains useful from this exercise might be that a dec ptr/inc ptr combination is slightly more efficient than the jmp generated by .While - which confirmed my aversion against .While loops. In contrast, .Repeat ... .Until can't be beaten by a hand-coded loop.

sinsi

It doesn't seem to work too well for a string like "  0" though...no check for a terminating 00.
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on December 28, 2009, 12:26:56 AM
It doesn't seem to work too well for a string like "  0" though...no check for a terminating 00.
Assuming a string ends with a nullbyte, it would stop right there. By design :bg

dedndave

nah - it kinda keeps going, Jochen - lol
but that isn't a requirement for the tests - the tests showed us what we wanted to know

jj2007

Quote from: dedndave on December 28, 2009, 10:01:05 AM
nah - it kinda keeps going, Jochen - lol
So Sinsi was right :red
Corrected above. Surprisingly enough, it still runs the test in three cycles for db "This is a string", 0

dedndave

that doesn't sound right - lol
figure at LEAST one clock cycle per byte   :bg

EDIT - oh - lol
there are no bytes striped in that example   :bg

dedndave

i might be inclined to make the tests in this order:

SkipLeadingWhiteSpace proc          ;pSrc$:DWORD

        mov     edx,[esp+4]         ;get source string
        dec     edx

@@:     inc     edx
        movzx   eax,byte ptr [edx]
        or      al,al
        jz      @F

        cmp     al, 32
        jle     @B

        cmp     al, "0"
        je      @B

@@:     ret     4                   ;edx points to first non-"0" and non-white space char

SkipLeadingWhiteSpace endp

test for the most likely first, when practical
we have to test for null first so that it is culled out before the white space test
this order assumes white space is more likely than "0" - that may not be the case
if leading "0" is more likely, test for that before null (like you have it)

jj2007

Quote from: dedndave on December 28, 2009, 10:52:14 AM
if leading "0" is more likely, test for that before null (like you have it)
Excerpt from Windows.inc. The assumption is you used Instr for "equ", and added 4, so the string starts with "0":

ENM_SCROLLEVENTS                 equ 00000008h
ENM_DRAGDROPDONE                 equ 00000010h
ENM_PARAGRAPHEXPANDED            equ 00000020h
ENM_PAGECHANGE                   equ 00000040h
ENM_LANGCHANGE                   equ 01000000h
ENM_OBJECTPOSITIONS              equ 02000000h
ENM_LINK                         equ 04000000h
ENM_LOWFIRTF                     equ 08000000h
ES_NOOLEDRAGDROP                 equ 00000008h

dacid

First program:


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
5       cycles for LoopDec
14      cycles for LoopWhile
4       cycles for LoopJmpAl
16      cycles for LoopJmp

20      cycles for LoopDecAl
9       cycles for LoopDec
10      cycles for LoopWhile
10      cycles for LoopJmpAl
-2      cycles for LoopJmp

8       cycles for LoopDecAl
-3      cycles for LoopDec
18      cycles for LoopWhile
18      cycles for LoopJmpAl
8       cycles for LoopJmp

-5      cycles for LoopDecAl
16      cycles for LoopDec
-7      cycles for LoopWhile
4       cycles for LoopJmpAl
25      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


Second:


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
2       cycles for LoopDecZx
2       cycles for LoopDec
15      cycles for LoopWhile
14      cycles for LoopJmpAl
24      cycles for LoopJmp

19      cycles for LoopDecAl
24      cycles for LoopDecZx
-1      cycles for LoopDec
-1      cycles for LoopWhile
10      cycles for LoopJmpAl
10      cycles for LoopJmp

-2      cycles for LoopDecAl
6       cycles for LoopDecZx
18      cycles for LoopDec
30      cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

-4      cycles for LoopDecAl
6       cycles for LoopDecZx
-5      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
15      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


lingo

"Here are the times for Daves version on my quad."

Hutch, I received the same times: :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub
7       cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub
5       cycles for LoopJmpLingo

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub
3       cycles for LoopJmpLingo

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub
2       cycles for LoopJmpLingo

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
43      LoopJmpLingo
--- ok ---