The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on December 27, 2009, 09:32:10 AM

Title: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 09:32:10 AM
A simple loop to get rid of leading white space and zeroes:
  mov edx, [esp+4] ; get address to source string
  .While byte ptr [edx]<=32 || byte ptr [edx]=="0"
inc edx
  .Endw

This version is a little bit faster on my Celeron M ("Core", not Core 2) CPU:
mov edx, [esp+4]
dec edx
@@: inc edx
mov al, [edx]
cmp al, 32
jle @B
cmp al, "0"
je @B

Can somebody post timings for a P4 please?
Thanks, JJ

Intel(R) Celeron(R) M CPU
12      cycles for LoopDecAl   3 leading discardable chars
13      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

8       cycles for LoopDecAl   2 chars
9       cycles for LoopDec
9       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

6       cycles for LoopDecAl   1 char
7       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
7       cycles for LoopJmp

3       cycles for LoopDecAl   none
3       cycles for LoopDec
4       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 09:58:48 AM
JJ,

Try this on a PIV.


; mov al, [edx]
movzx eax, BYTE PTR [edx]


Also try and time ADD and SUB as against INC and DEC as I have found that it is still faster on this Core quad. it is publiched by Intel the preference for ADD SUB on a PIV and as far as I have seen its no slower on other hardware.
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 10:53:57 AM
prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
32      cycles for LoopDecAl
35      cycles for LoopDec
21      cycles for LoopWhile
23      cycles for LoopJmpAl
22      cycles for LoopJmp

20      cycles for LoopDecAl
71      cycles for LoopDec
41      cycles for LoopWhile
17      cycles for LoopJmpAl
15      cycles for LoopJmp

16      cycles for LoopDecAl
19      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

8       cycles for LoopDecAl
11      cycles for LoopDec
7       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 11:21:19 AM
Quote from: hutch-- on December 27, 2009, 09:58:48 AM
Try this on a PIV.


; mov al, [edx]
movzx eax, BYTE PTR [edx]


Also try and time ADD and SUB as against INC and DEC
Attached, as LoopDecZx. Not faster on my Celeron, but maybe on a PIV it helps.
@DednDave: Thanks. Inconsistent timings as always, the Prescott is difficult to time...
Intel(R) Celeron(R) M CPU
13      cycles for LoopDecAl
12      cycles for LoopDecZx
14      cycles for LoopDec
14      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

9       cycles for LoopDecAl
9       cycles for LoopDecZx
10      cycles for LoopDec
9       cycles for LoopWhile
10      cycles for LoopJmpAl
9       cycles for LoopJmp

7       cycles for LoopDecAl
6       cycles for LoopDecZx
7       cycles for LoopDec
8       cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

3       cycles for LoopDecAl
4       cycles for LoopDecZx
4       cycles for LoopDec
5       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 11:42:41 AM
GA Jochen

i took a different approach for the test - lol
i also changed to REALTIME to help get more consistent times on my prescott
for these brief tests - it shouldn't hurt anything
you are a lot faster than i am   :bg

                    IncAddSub.exe

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23      cycles for LoopJmpAlInc
21      cycles for LoopJmpAlAdd
20      cycles for LoopJmpAlSub
21      cycles for LoopJmpZxInc
24      cycles for LoopJmpZxAdd
19      cycles for LoopJmpZxSub

20      cycles for LoopJmpAlInc
16      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
14      cycles for LoopJmpZxInc
13      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

15      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
14      cycles for LoopJmpAlSub
13      cycles for LoopJmpZxInc
12      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

9       cycles for LoopJmpAlInc
10      cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
8       cycles for LoopJmpZxInc
11      cycles for LoopJmpZxAdd
11      cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub   <- edit - lol
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 11:45:42 AM
prescott...

              LoopDecWhile.exe (v2)

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23      cycles for LoopDecAl
20      cycles for LoopDecZx
23      cycles for LoopDec
21      cycles for LoopWhile
26      cycles for LoopJmpAl
23      cycles for LoopJmp

18      cycles for LoopDecAl
14      cycles for LoopDecZx
41      cycles for LoopDec
57      cycles for LoopWhile
20      cycles for LoopJmpAl
24      cycles for LoopJmp

16      cycles for LoopDecAl
11      cycles for LoopDecZx
14      cycles for LoopDec
15      cycles for LoopWhile
17      cycles for LoopJmpAl
13      cycles for LoopJmp

11      cycles for LoopDecAl
11      cycles for LoopDecZx
10      cycles for LoopDec
11      cycles for LoopWhile
10      cycles for LoopJmpAl
9       cycles for LoopJmp
Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 11:59:34 AM
Here are the times for Daves version on my quad. Really ain't much in it. :)


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 12:00:22 PM
Hmmm....
dave's version yields virtually identical timings for all algos. The only thing that gains a cycle is to eliminate the first jump (my initial choice).

Intel(R) Celeron(R) M CPU
13      cycles for LoopJmpAlInc
13      cycles for LoopJmpAlAdd
13      cycles for LoopJmpAlSub
13      cycles for LoopJmpZxInc
12      cycles for LoopJmpZxIncJJ
13      cycles for LoopJmpZxAdd
13      cycles for LoopJmpZxSub

9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
8       cycles for LoopJmpZxIncJJ
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
6       cycles for LoopJmpZxIncJJ
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

4       cycles for LoopJmpAlInc
4       cycles for LoopJmpAlAdd
4       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxIncJJ
4       cycles for LoopJmpZxAdd
4       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
20      LoopJmpZxIncJJ
23      LoopJmpZxAdd
23      LoopJmpZxSub
Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:01:54 PM
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 12:03:07 PM
yes - i should have used that for all of them - saves a byte - lol
we have to stop giving Hutch opportunities to show off his quad   :P
that ShowCPU proc is Jochen's
if you call it with 0/1, you get terse/verbose display
not sure how he feels about PowerBASIC - lol
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 12:15:00 PM
Quote from: hutch-- on December 27, 2009, 12:01:54 PM
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
I derived it from various sources, most of all Wikipedia. For the history, search the forum for ShowCPU.
Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:24:01 PM
 :bg

So I can get away with blaming it on you.  :bdg
Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:32:09 PM
Here are the results on my PIV.


Genuine Intel(R) CPU 3.80GHz (SSE3)
23      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
23      cycles for LoopJmpAlSub
23      cycles for LoopJmpZxInc
23      cycles for LoopJmpZxAdd
23      cycles for LoopJmpZxSub

17      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

15      cycles for LoopJmpAlInc
15      cycles for LoopJmpAlAdd
14      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
11      cycles for LoopJmpAlAdd
11      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
11      cycles for LoopJmpZxAdd
6       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---
Title: Re: Faster alternative to .While ... .Endw
Post by: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
I don't know if it's because of all the pre-installed sh*t that I haven't gotten around
to removing yet, Win 7 or something else.
It just seems to me that my timings should be higher, given the processor.

AMD Athlon(tm) II X2 215 Processor (SSE3)
17      cycles for LoopDecAl
30      cycles for LoopDecZx
30      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
52      cycles for LoopJmp

20      cycles for LoopDecAl
55      cycles for LoopDecZx
27      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
55      cycles for LoopJmp

33      cycles for LoopDecAl
15      cycles for LoopDecZx
22      cycles for LoopDec
20      cycles for LoopWhile
20      cycles for LoopJmpAl
13      cycles for LoopJmp

15      cycles for LoopDecAl
16      cycles for LoopDecZx
21      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
12      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 01:41:41 PM
i sometimes need to add a few lines of code to get consistent timings...

.
.
start:
   push 1
   call ShowCpu

      invoke GetCurrentProcess
      invoke SetProcessAffinityMask,eax,1

   ct = 0
.
.

that restricts execution to a single core

also, if the tests are brief, i change to REALTIME_PRIORITY_CLASS, rather than HIGH_PRIORITY_CLASS
that appears in each of the "counter_begin" macro calls

EDIT - i should also mention that these tests are in no way intended to benchmark your machine
they are intended to give you relative timings to compare one algorithm with another
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 06:57:51 PM
Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
rags,
my condolencies :thumbu
Just checked with my latest toy, an Olidata JumPC that cost me the horrendous sum of 99€ (142US$), Win XP included. It does the test in 3 cycles, claiming it has a Celeron M installed. Well...
Title: Re: Faster alternative to .While ... .Endw
Post by: 2-Bit Chip on December 27, 2009, 07:44:01 PM
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
43      cycles for LoopDecAl
45      cycles for LoopDec
47      cycles for LoopWhile
43      cycles for LoopJmpAl
50      cycles for LoopJmp

31      cycles for LoopDecAl
41      cycles for LoopDec
34      cycles for LoopWhile
30      cycles for LoopJmpAl
29      cycles for LoopJmp

26      cycles for LoopDecAl
32      cycles for LoopDec
29      cycles for LoopWhile
24      cycles for LoopJmpAl
33      cycles for LoopJmp

17      cycles for LoopDecAl
20      cycles for LoopDec
19      cycles for LoopWhile
17      cycles for LoopJmpAl
19      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: MichaelW on December 27, 2009, 07:50:35 PM
P3:

☺☺☻♥ (SSE1)
15      cycles for LoopDecAl
15      cycles for LoopDecZx
16      cycles for LoopDec
21      cycles for LoopWhile
21      cycles for LoopJmpAl
21      cycles for LoopJmp

10      cycles for LoopDecAl
10      cycles for LoopDecZx
11      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
14      cycles for LoopJmp

10      cycles for LoopDecAl
9       cycles for LoopDecZx
11      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
11      cycles for LoopJmp

5       cycles for LoopDecAl
5       cycles for LoopDecZx
5       cycles for LoopDec
6       cycles for LoopWhile
6       cycles for LoopJmpAl
6       cycles for LoopJmp


☺☺☻♥ (SSE1)
21      cycles for LoopJmpAlInc
17      cycles for LoopJmpAlAdd
17      cycles for LoopJmpAlSub
16      cycles for LoopJmpZxInc
14      cycles for LoopJmpZxAdd
14      cycles for LoopJmpZxSub

14      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
12      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
10      cycles for LoopJmpAlSub
8       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

6       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub


Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 09:58:28 PM
As far as I can see, this is the winning algo, for Core & Celeron & P3 & P4:

SkipLeadingWhiteSpace proc ; pSrc$:DWORD
mov edx, [esp+4] ; get source string
dec edx
@@: inc edx
movzx eax, byte ptr [edx]
cmp al, "0"
je @B
cmp al, 0
je @F
cmp al, 32
jle @B
@@:
  ret 4 ; edx points to first non-"0" and non-white space char
SkipLeadingWhiteSpace endp


EDIT: Added check for zero byte - thanks Sinsi :U
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 10:33:05 PM
a close race, eh Jochen ?
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 10:41:40 PM
Quote from: dedndave on December 27, 2009, 10:33:05 PM
a close race, eh Jochen ?
Very close indeed. On the other hand, it is that kind of loop that is typically run not even once, so why waste a single cycle on it? What remains useful from this exercise might be that a dec ptr/inc ptr combination is slightly more efficient than the jmp generated by .While - which confirmed my aversion against .While loops. In contrast, .Repeat ... .Until can't be beaten by a hand-coded loop.
Title: Re: Faster alternative to .While ... .Endw
Post by: sinsi on December 28, 2009, 12:26:56 AM
It doesn't seem to work too well for a string like "  0" though...no check for a terminating 00.
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 08:07:12 AM
Quote from: sinsi on December 28, 2009, 12:26:56 AM
It doesn't seem to work too well for a string like "  0" though...no check for a terminating 00.
Assuming a string ends with a nullbyte, it would stop right there. By design :bg
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:01:05 AM
nah - it kinda keeps going, Jochen - lol
but that isn't a requirement for the tests - the tests showed us what we wanted to know
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 10:36:36 AM
Quote from: dedndave on December 28, 2009, 10:01:05 AM
nah - it kinda keeps going, Jochen - lol
So Sinsi was right :red
Corrected above. Surprisingly enough, it still runs the test in three cycles for db "This is a string", 0
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:43:01 AM
that doesn't sound right - lol
figure at LEAST one clock cycle per byte   :bg

EDIT - oh - lol
there are no bytes striped in that example   :bg
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:52:14 AM
i might be inclined to make the tests in this order:

SkipLeadingWhiteSpace proc          ;pSrc$:DWORD

        mov     edx,[esp+4]         ;get source string
        dec     edx

@@:     inc     edx
        movzx   eax,byte ptr [edx]
        or      al,al
        jz      @F

        cmp     al, 32
        jle     @B

        cmp     al, "0"
        je      @B

@@:     ret     4                   ;edx points to first non-"0" and non-white space char

SkipLeadingWhiteSpace endp

test for the most likely first, when practical
we have to test for null first so that it is culled out before the white space test
this order assumes white space is more likely than "0" - that may not be the case
if leading "0" is more likely, test for that before null (like you have it)
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 11:08:57 AM
Quote from: dedndave on December 28, 2009, 10:52:14 AM
if leading "0" is more likely, test for that before null (like you have it)
Excerpt from Windows.inc. The assumption is you used Instr for "equ", and added 4, so the string starts with "0":

ENM_SCROLLEVENTS                 equ 00000008h
ENM_DRAGDROPDONE                 equ 00000010h
ENM_PARAGRAPHEXPANDED            equ 00000020h
ENM_PAGECHANGE                   equ 00000040h
ENM_LANGCHANGE                   equ 01000000h
ENM_OBJECTPOSITIONS              equ 02000000h
ENM_LINK                         equ 04000000h
ENM_LOWFIRTF                     equ 08000000h
ES_NOOLEDRAGDROP                 equ 00000008h
Title: Re: Faster alternative to .While ... .Endw
Post by: dacid on February 10, 2010, 11:01:46 PM
First program:


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
5       cycles for LoopDec
14      cycles for LoopWhile
4       cycles for LoopJmpAl
16      cycles for LoopJmp

20      cycles for LoopDecAl
9       cycles for LoopDec
10      cycles for LoopWhile
10      cycles for LoopJmpAl
-2      cycles for LoopJmp

8       cycles for LoopDecAl
-3      cycles for LoopDec
18      cycles for LoopWhile
18      cycles for LoopJmpAl
8       cycles for LoopJmp

-5      cycles for LoopDecAl
16      cycles for LoopDec
-7      cycles for LoopWhile
4       cycles for LoopJmpAl
25      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


Second:


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
2       cycles for LoopDecZx
2       cycles for LoopDec
15      cycles for LoopWhile
14      cycles for LoopJmpAl
24      cycles for LoopJmp

19      cycles for LoopDecAl
24      cycles for LoopDecZx
-1      cycles for LoopDec
-1      cycles for LoopWhile
10      cycles for LoopJmpAl
10      cycles for LoopJmp

-2      cycles for LoopDecAl
6       cycles for LoopDecZx
18      cycles for LoopDec
30      cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

-4      cycles for LoopDecAl
6       cycles for LoopDecZx
-5      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
15      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: lingo on February 11, 2010, 03:26:57 AM
"Here are the times for Daves version on my quad."

Hutch, I received the same times: :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub
7       cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub
5       cycles for LoopJmpLingo

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub
3       cycles for LoopJmpLingo

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub
2       cycles for LoopJmpLingo

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
43      LoopJmpLingo
--- ok ---
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 04:34:43 AM
nice machine, Lingo

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
22      cycles for LoopJmpAlInc
21      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
20      cycles for LoopJmpZxInc
20      cycles for LoopJmpZxAdd
27      cycles for LoopJmpZxSub
21      cycles for LoopJmpLingo

16      cycles for LoopJmpAlInc
20      cycles for LoopJmpAlAdd
17      cycles for LoopJmpAlSub
14      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub
16      cycles for LoopJmpLingo

13      cycles for LoopJmpAlInc
15      cycles for LoopJmpAlAdd
15      cycles for LoopJmpAlSub
12      cycles for LoopJmpZxInc
14      cycles for LoopJmpZxAdd
13      cycles for LoopJmpZxSub
11      cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
8       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
8       cycles for LoopJmpZxAdd
8       cycles for LoopJmpZxSub
6       cycles for LoopJmpLingo

nice to finally get some repeatable numbers from my prescott   :P
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 08:18:39 AM
Compliments, Lingo! I have added a shortened version of your algo, 22 instead of 34 bytes, with almost identical timings.
EDIT: Since finding leading white space is not a frequent task, "inlining" instead of calling a proc might be more appropriate. So I added two inline versions. It turns out that the align 4 version is an edge faster.
EDIT(2): Two more inline versions added.
mov eax, offset Src
mov ecx, "00"
dec eax
align 4 ; align may change flags in Masm
@@: inc eax
mov cl, ch
sub cl, [eax] ; for [eax]==48, cl=0
je @B
cmp cl, 16 ; for [eax]==32, cl=48-16=16
jge @B


Quotealign 16
LoopJmpLingo_proc:      ; the original Lingo algo
LoopLingo:
   add   eax, 1
   mov   cl,   ch
   add   cl,   [eax]
   je      LoopLingo
   add   cl,   10h
   jle      LoopLingo
   jmp   edx
align 16
LoopJmpLingo    proc
   pop   edx
   mov   ecx,   0D0D0h
   pop   eax
   add   cl,   [eax]      ; for [eax]==48, cl=208+48=256 aka zero
   je      LoopLingo
   add   cl,   10h         ; for [eax]==32, cl=208+32+16=256 aka zero
   jle      LoopLingo
   jmp   edx
LoopJmpLingo    endp
LoopJmpLingo_endp:

align 16
LoopJmpLingoJ_proc:
LoopJmpLingoJ proc      ; variant to Lingo's code (http://www.masm32.com/board/index.php?topic=12984.msg104011#msg104011)
   pop edx      ; the return address
   mov ecx, 0D0D0h
   pop eax
   dec eax
@@:   inc eax
   mov cl, ch
   add cl, [eax]   ; for [eax]==48, cl=208+48=256 aka zero
   je @B
   add cl, 10h   ; for [eax]==32, cl=208+32+16=256 aka zero
   jle @B
   jmp edx
LoopJmpLingoJ endp
LoopJmpLingoJ_endp:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
34      LoopJmpLingo
22      LoopJmpLingoJ
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 10:59:05 AM
not so good on a prescott, Jochen

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
15      cycles for inline loop, add, align 16
14      cycles for inline loop, sub, align 4
20      cycles for LoopJmpLingo
44      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, align 16
11      cycles for inline loop, sub, align 4
18      cycles for LoopJmpLingo
27      cycles for LoopJmpLingoJ

9       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
11      cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

5       cycles for inline loop, add, align 16
4       cycles for inline loop, sub, align 4
9       cycles for LoopJmpLingo
10      cycles for LoopJmpLingoJ
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 11:06:57 AM
Dave,
Can you try the new version, please? The inline algos seem to perform well, and the "two byte immediates" variant is pretty short, too. If the mov eax, offset src happens to be two bytes later, the size is only 12 bytes.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
7       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
5       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
7       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
4       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
2       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
1       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
20      inline, cmp, two byte regs
14      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 11:15:01 AM
prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
16      cycles for inline loop, add, align 16
14      cycles for inline loop, sub, align 4
17      cycles for inline loop, cmp two byte regs, align 4
19      cycles for inline loop, cmp two immediates, align 4
19      cycles for LoopJmpLingo
41      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, align 16
12      cycles for inline loop, sub, align 4
10      cycles for inline loop, cmp two byte regs, align 4
14      cycles for inline loop, cmp two immediates, align 4
18      cycles for LoopJmpLingo
31      cycles for LoopJmpLingoJ

8       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
9       cycles for inline loop, cmp two byte regs, align 4
7       cycles for inline loop, cmp two immediates, align 4
10      cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

5       cycles for inline loop, add, align 16
5       cycles for inline loop, sub, align 4
5       cycles for inline loop, cmp two byte regs, align 4
4       cycles for inline loop, cmp two immediates, align 4
7       cycles for LoopJmpLingo
11      cycles for LoopJmpLingoJ

i don't think alignment is that critical for smaller loops, Jochen - at least not on a P4
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 11:25:32 AM
prescott - i have removed all align's

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
14      cycles for inline loop, add, no align
14      cycles for inline loop, sub, no align
11      cycles for inline loop, cmp two byte regs, no align
17      cycles for inline loop, cmp two immediates, no align
19      cycles for LoopJmpLingo
23      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, no align
12      cycles for inline loop, sub, no align
11      cycles for inline loop, cmp two byte regs, no align
12      cycles for inline loop, cmp two immediates, no align
17      cycles for LoopJmpLingo
27      cycles for LoopJmpLingoJ

8       cycles for inline loop, add, no align
7       cycles for inline loop, sub, no align
9       cycles for inline loop, cmp two byte regs, no align
8       cycles for inline loop, cmp two immediates, no align
9       cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

3       cycles for inline loop, add, no align
4       cycles for inline loop, sub, no align
4       cycles for inline loop, cmp two byte regs, no align
3       cycles for inline loop, cmp two immediates, no align
6       cycles for LoopJmpLingo
10      cycles for LoopJmpLingoJ

Sizes:
18      inline, add
18      inline, sub
20      inline, cmp, two byte regs
12      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ
Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 11:47:26 AM
Thanks, Dave. The timings look a little bit inconsistent, also on my machine, but it seems we can safely vote for the shortest version ;-)

mov eax, offset Src
dec eax
@@: inc eax
cmp byte ptr [eax], 48 ; skip "0"...
je @B
cmp byte ptr [eax], 32 ; and anything from space downwards
jle @B


EDIT: And I forgot the end of string case...!

; mov eax, offset Src
dec eax ; no align before the loop - it's slower
@@: inc eax
cmp byte ptr [eax], 48 ; "0"
je @B
cmp byte ptr [eax], 0 ; zero delimiter?
je @F
cmp byte ptr [eax], 32 ; " " or less
jle @B
@@:


17 bytes starting with dec eax, 1 cycle for the default case (no 0 or space at string start).
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 12:06:33 PM
not so fast - lol
it would be nice to see what difference, if any, alignment has on some other processors
from my testing on a P4, if it is a short jump to get back to the top of the loop, then alignment does little good
that may not be so for some of the more modern cores   :U

EDIT - maybe we need a new thread to get some tests run in the purely "alignment/timing" catagory
Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 12:23:07 AM
Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.

I got the same story on my new Q7 - I suspect some of the new processors are designed for data streaming and not computation. So we now have shitty computers but great televisions.!
Congratz to us... grumble grumble... sigh two years til wife will let me replace.
Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 23, 2010, 06:21:43 AM
What cpu is that Q7 ? (atom-based?)

Anyway, never trust laptop cpus. Still I'm sure the general performance is good, don't let some specific timings discourage you guys.
Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 11:48:57 AM
Sorry my bad. Meant Intel Core  i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.
Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 23, 2010, 01:16:33 PM
WryBugz
these timing tests are not intended to benchmark your machine
comparing clock cycles on one machine to clock cycles on another machine is a little like comparing apples to oranges
from what i know, the i7 is a good performer - evidenced by the fact that you are happy with the overall performance

the information that is meaningful is the performance ratio of one method to another on any given machine
Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 23, 2010, 01:52:19 PM
Quote from: WryBugz on February 23, 2010, 11:48:57 AM
Sorry my bad. Meant Intel Core  i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.
Could you post the timings with the exe posted at this thread ? I'm curious about something.
http://www.masm32.com/board/index.php?topic=13385.0
Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 10:06:41 PM
Intel Core I7

10      cycles for LoopDecAl
10      cycles for LoopDec
15      cycles for LoopWhile
17      cycles for LoopJmpAl
17      cycles for LoopJmp

6       cycles for LoopDecAl
6       cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
12      cycles for LoopJmp

3       cycles for LoopDecAl
3       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
6       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---
There you go.
I guess I don't understand the processor differences enough Dave. I thought that was the reason for the machine citation.
I am just a hobbyist and while having put in a lot of time, my knowledge is pretty erratic. For instance, I just realized this past week that eax - edx are not all equal.
Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 10:12:44 PM
The other one....

Loop
991     clock cycles
1054    clock cycles
1045    clock cycles
Dec ECX
505     clock cycles
505     clock cycles
513     clock cycles
Press any key to continue ...


From the link.
Title: Re: Faster alternative to .While ... .Endw
Post by: Gunner on February 23, 2010, 10:40:22 PM
Here is mine...
Intel(R) Pentium(R) 4 CPU 2.40GHz (SSE2)
6       cycles for inline loop, add, no align
4       cycles for inline loop, sub, no align
7       cycles for inline loop, cmp two byte regs, no align
3       cycles for inline loop, cmp two immediates, no align
5       cycles for LoopJmpLingo
9       cycles for LoopJmpLingoJ

4       cycles for inline loop, add, no align
8       cycles for inline loop, sub, no align
2       cycles for inline loop, cmp two byte regs, no align
2       cycles for inline loop, cmp two immediates, no align
5       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

-2      cycles for inline loop, add, no align
-2      cycles for inline loop, sub, no align
6       cycles for inline loop, cmp two byte regs, no align
-4      cycles for inline loop, cmp two immediates, no align
0       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

-7      cycles for inline loop, add, no align
5       cycles for inline loop, sub, no align
-2      cycles for inline loop, cmp two byte regs, no align
-7      cycles for inline loop, cmp two immediates, no align
-2      cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

Sizes:
18      inline, add
18      inline, sub
20      inline, cmp, two byte regs
12      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ
--- ok ---
Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 24, 2010, 06:15:19 AM
Quote from: WryBugz on February 23, 2010, 10:12:44 PM
The other one....

Loop
991     clock cycles
1054    clock cycles
1045    clock cycles
Dec ECX
505     clock cycles
505     clock cycles
513     clock cycles
Press any key to continue ...


From the link.
Why does it take twice as many cycles as mine ? Not very efficient.
Title: Re: Faster alternative to .While ... .Endw
Post by: FairLight on March 11, 2010, 06:08:00 AM
...more data...


Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz (SSE4)
7       cycles for LoopDecAl
7       cycles for LoopDecZx
9       cycles for LoopDec
9       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

5       cycles for LoopDecAl
5       cycles for LoopDecZx
5       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
7       cycles for LoopJmp

3       cycles for LoopDecAl
3       cycles for LoopDecZx
3       cycles for LoopDec
5       cycles for LoopWhile
5       cycles for LoopJmpAl
5       cycles for LoopJmp

1       cycles for LoopDecAl
1       cycles for LoopDecZx
1       cycles for LoopDec
3       cycles for LoopWhile
3       cycles for LoopJmpAl
3       cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---







Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub
7       cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub
5       cycles for LoopJmpLingo

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub
3       cycles for LoopJmpLingo

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub
2       cycles for LoopJmpLingo

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
43      LoopJmpLingo
--- ok ---







Intel(R) Core(TM)2 Duo CPU     E6850  @ 3.00GHz (SSE4)
6       cycles for inline loop, add, no align
9       cycles for inline loop, sub, no align
3       cycles for inline loop, cmp two byte regs, no align
2       cycles for inline loop, cmp two immediates, no align
8       cycles for LoopJmpLingo
11      cycles for LoopJmpLingoJ

23      cycles for inline loop, add, no align
22      cycles for inline loop, sub, no align
1       cycles for inline loop, cmp two byte regs, no align
1       cycles for inline loop, cmp two immediates, no align
6       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

20      cycles for inline loop, add, no align
10      cycles for inline loop, sub, no align
0       cycles for inline loop, cmp two byte regs, no align
0       cycles for inline loop, cmp two immediates, no align
4       cycles for LoopJmpLingo
5       cycles for LoopJmpLingoJ

0       cycles for inline loop, add, no align
0       cycles for inline loop, sub, no align
0       cycles for inline loop, cmp two byte regs, no align
0       cycles for inline loop, cmp two immediates, no align
1       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
18      inline, add
18      inline, sub
20      inline, cmp, two byte regs
12      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ
--- ok ---





Loop
1294    clock cycles
1294    clock cycles
1294    clock cycles
Dec ECX
279     clock cycles
279     clock cycles
279     clock cycles
Press any key to continue ...
Title: Re: Faster alternative to .While ... .Endw
Post by: joemc on March 30, 2010, 04:20:37 AM

Intel(R) Core(TM)2 CPU         T5600  @ 1.83GHz (SSE4)
8       cycles for LoopDecAl
10      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

6       cycles for LoopDecAl
6       cycles for LoopDec
9       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

4       cycles for LoopDecAl
4       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
7       cycles for LoopJmp

2       cycles for LoopDecAl
2       cycles for LoopDec
4       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp

-------------

3       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
3       cycles for inline loop, cmp two byte regs, align 4
3       cycles for inline loop, cmp two immediates, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
1       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
5       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

0       cycles for inline loop, add, align 16
0       cycles for inline loop, sub, align 4
0       cycles for inline loop, cmp two byte regs, align 4
0       cycles for inline loop, cmp two immediates, align 4
2       cycles for LoopJmpLingo
3       cycles for LoopJmpLingoJ

0       cycles for inline loop, add, align 16
0       cycles for inline loop, sub, align 4
0       cycles for inline loop, cmp two byte regs, align 4
0       cycles for inline loop, cmp two immediates, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
20      inline, cmp, two byte regs
14      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ
--- ok ---
Title: Re: Faster alternative to .While ... .Endw
Post by: Greenhorn__ on April 13, 2010, 12:00:23 AM
Hi,

here are my results ...

LoopDecWhile.exe (2nd version)

AMD Phenom(tm) II X4 955 Processor (SSE3)
16 cycles for LoopDecAl
16 cycles for LoopDecZx
9 cycles for LoopDec
15 cycles for LoopWhile
14 cycles for LoopJmpAl
15 cycles for LoopJmp


8 cycles for LoopDecAl
17 cycles for LoopDecZx
8 cycles for LoopDec
26 cycles for LoopWhile
29 cycles for LoopJmpAl
26 cycles for LoopJmp


7 cycles for LoopDecAl
7 cycles for LoopDecZx
7 cycles for LoopDec
6 cycles for LoopWhile
6 cycles for LoopJmpAl
6 cycles for LoopJmp


4 cycles for LoopDecAl
4 cycles for LoopDecZx
6 cycles for LoopDec
2 cycles for LoopWhile
2 cycles for LoopJmpAl
2 cycles for LoopJmp

Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---


... and for IncAddSub.exe

AMD Phenom(tm) II X4 955 Processor (SSE3)
14 cycles for LoopJmpAlInc
15 cycles for LoopJmpAlAdd
15 cycles for LoopJmpAlSub
15 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub


27 cycles for LoopJmpAlInc
29 cycles for LoopJmpAlAdd
27 cycles for LoopJmpAlSub
29 cycles for LoopJmpZxInc
27 cycles for LoopJmpZxAdd
27 cycles for LoopJmpZxSub


6 cycles for LoopJmpAlInc
6 cycles for LoopJmpAlAdd
6 cycles for LoopJmpAlSub
6 cycles for LoopJmpZxInc
6 cycles for LoopJmpZxAdd
6 cycles for LoopJmpZxSub


2 cycles for LoopJmpAlInc
2 cycles for LoopJmpAlAdd
2 cycles for LoopJmpAlSub
2 cycles for LoopJmpZxInc
2 cycles for LoopJmpZxAdd
2 cycles for LoopJmpZxSub

Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
--- ok ---


Regards
Greenhorn