Print Page - Faster alternative to .While ... .Endw

Title: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 09:32:10 AM

A simple loop to get rid of leading white space and zeroes:

  mov edx, [esp+4]	; get address to source string
  .While byte ptr [edx]<=32 || byte ptr [edx]=="0"
	inc edx
  .Endw

This version is a little bit faster on my Celeron M ("Core", not Core 2) CPU:

Code Select

	mov edx, [esp+4]
	dec edx
@@:	inc edx
	mov al, [edx]
	cmp al, 32
	jle @B
	cmp al, "0"
	je @B

Can somebody post timings for a P4 please?
Thanks, JJ

Code Select

Intel(R) Celeron(R) M CPU
12      cycles for LoopDecAl   3 leading discardable chars
13      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

8       cycles for LoopDecAl   2 chars
9       cycles for LoopDec
9       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

6       cycles for LoopDecAl   1 char
7       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
7       cycles for LoopJmp

3       cycles for LoopDecAl   none
3       cycles for LoopDec
4       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp

Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 09:58:48 AM

JJ,

Try this on a PIV.

Code Select


; mov al, [edx]
movzx eax, BYTE PTR [edx]

Also try and time ADD and SUB as against INC and DEC as I have found that it is still faster on this Core quad. it is publiched by Intel the preference for ADD SUB on a PIV and as far as I have seen its no slower on other hardware.

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 10:53:57 AM

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
32 cycles for LoopDecAl
35 cycles for LoopDec
21 cycles for LoopWhile
23 cycles for LoopJmpAl
22 cycles for LoopJmp

20 cycles for LoopDecAl
71 cycles for LoopDec
41 cycles for LoopWhile
17 cycles for LoopJmpAl
15 cycles for LoopJmp

16 cycles for LoopDecAl
19 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp

8 cycles for LoopDecAl
11 cycles for LoopDec
7 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 11:21:19 AM

Quote from: hutch-- on December 27, 2009, 09:58:48 AM
Try this on a PIV.

Code Select Expand
; mov al, [edx] movzx eax, BYTE PTR [edx]

Also try and time ADD and SUB as against INC and DEC

Attached, as LoopDecZx. Not faster on my Celeron, but maybe on a PIV it helps.
@DednDave: Thanks. Inconsistent timings as always, the Prescott is difficult to time...

Code Select

Intel(R) Celeron(R) M CPU
13      cycles for LoopDecAl
12      cycles for LoopDecZx
14      cycles for LoopDec
14      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

9       cycles for LoopDecAl
9       cycles for LoopDecZx
10      cycles for LoopDec
9       cycles for LoopWhile
10      cycles for LoopJmpAl
9       cycles for LoopJmp

7       cycles for LoopDecAl
6       cycles for LoopDecZx
7       cycles for LoopDec
8       cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

3       cycles for LoopDecAl
4       cycles for LoopDecZx
4       cycles for LoopDec
5       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 11:42:41 AM

GA Jochen

i took a different approach for the test - lol
i also changed to REALTIME to help get more consistent times on my prescott
for these brief tests - it shouldn't hurt anything
you are a lot faster than i am :bg

IncAddSub.exe

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23 cycles for LoopJmpAlInc
21 cycles for LoopJmpAlAdd
20 cycles for LoopJmpAlSub
21 cycles for LoopJmpZxInc
24 cycles for LoopJmpZxAdd
19 cycles for LoopJmpZxSub

20 cycles for LoopJmpAlInc
16 cycles for LoopJmpAlAdd
21 cycles for LoopJmpAlSub
14 cycles for LoopJmpZxInc
13 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub

15 cycles for LoopJmpAlInc
12 cycles for LoopJmpAlAdd
14 cycles for LoopJmpAlSub
13 cycles for LoopJmpZxInc
12 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub

9 cycles for LoopJmpAlInc
10 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
8 cycles for LoopJmpZxInc
11 cycles for LoopJmpZxAdd
11 cycles for LoopJmpZxSub

Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub <- edit - lol

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 11:45:42 AM

prescott...

LoopDecWhile.exe (v2)

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23 cycles for LoopDecAl
20 cycles for LoopDecZx
23 cycles for LoopDec
21 cycles for LoopWhile
26 cycles for LoopJmpAl
23 cycles for LoopJmp

18 cycles for LoopDecAl
14 cycles for LoopDecZx
41 cycles for LoopDec
57 cycles for LoopWhile
20 cycles for LoopJmpAl
24 cycles for LoopJmp

16 cycles for LoopDecAl
11 cycles for LoopDecZx
14 cycles for LoopDec
15 cycles for LoopWhile
17 cycles for LoopJmpAl
13 cycles for LoopJmp

11 cycles for LoopDecAl
11 cycles for LoopDecZx
10 cycles for LoopDec
11 cycles for LoopWhile
10 cycles for LoopJmpAl
9 cycles for LoopJmp

Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 11:59:34 AM

Here are the times for Daves version on my quad. Really ain't much in it. :)

Code Select


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 12:00:22 PM

Hmmm....
dave's version yields virtually identical timings for all algos. The only thing that gains a cycle is to eliminate the first jump (my initial choice).

Code Select

Intel(R) Celeron(R) M CPU
13      cycles for LoopJmpAlInc
13      cycles for LoopJmpAlAdd
13      cycles for LoopJmpAlSub
13      cycles for LoopJmpZxInc
12      cycles for LoopJmpZxIncJJ
13      cycles for LoopJmpZxAdd
13      cycles for LoopJmpZxSub

9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
8       cycles for LoopJmpZxIncJJ
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
6       cycles for LoopJmpZxIncJJ
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

4       cycles for LoopJmpAlInc
4       cycles for LoopJmpAlAdd
4       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxIncJJ
4       cycles for LoopJmpZxAdd
4       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
20      LoopJmpZxIncJJ
23      LoopJmpZxAdd
23      LoopJmpZxSub

Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:01:54 PM

Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 12:03:07 PM

yes - i should have used that for all of them - saves a byte - lol
we have to stop giving Hutch opportunities to show off his quad :P
that ShowCPU proc is Jochen's
if you call it with 0/1, you get terse/verbose display
not sure how he feels about PowerBASIC - lol

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 12:15:00 PM

Quote from: hutch-- on December 27, 2009, 12:01:54 PM
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.

I derived it from various sources, most of all Wikipedia. For the history, search the forum for ShowCPU.

Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:24:01 PM

:bg

So I can get away with blaming it on you. :bdg

Title: Re: Faster alternative to .While ... .Endw
Post by: hutch-- on December 27, 2009, 12:32:09 PM

Here are the results on my PIV.

Code Select


Genuine Intel(R) CPU 3.80GHz (SSE3)
23      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
23      cycles for LoopJmpAlSub
23      cycles for LoopJmpZxInc
23      cycles for LoopJmpZxAdd
23      cycles for LoopJmpZxSub

17      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

15      cycles for LoopJmpAlInc
15      cycles for LoopJmpAlAdd
14      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
11      cycles for LoopJmpAlAdd
11      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
11      cycles for LoopJmpZxAdd
6       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: rags on December 27, 2009, 01:25:22 PM

I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
I don't know if it's because of all the pre-installed sh*t that I haven't gotten around
to removing yet, Win 7 or something else.
It just seems to me that my timings should be higher, given the processor.

Code Select


AMD Athlon(tm) II X2 215 Processor (SSE3)
17      cycles for LoopDecAl
30      cycles for LoopDecZx
30      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
52      cycles for LoopJmp

20      cycles for LoopDecAl
55      cycles for LoopDecZx
27      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
55      cycles for LoopJmp

33      cycles for LoopDecAl
15      cycles for LoopDecZx
22      cycles for LoopDec
20      cycles for LoopWhile
20      cycles for LoopJmpAl
13      cycles for LoopJmp

15      cycles for LoopDecAl
16      cycles for LoopDecZx
21      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
12      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 01:41:41 PM

i sometimes need to add a few lines of code to get consistent timings...

.
.
start:
   push 1
   call ShowCpu

invoke GetCurrentProcess
invoke SetProcessAffinityMask,eax,1

   ct = 0
.
.

that restricts execution to a single core

also, if the tests are brief, i change to REALTIME_PRIORITY_CLASS, rather than HIGH_PRIORITY_CLASS
that appears in each of the "counter_begin" macro calls

EDIT - i should also mention that these tests are in no way intended to benchmark your machine
they are intended to give you relative timings to compare one algorithm with another

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 06:57:51 PM

Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.

rags,
my condolencies :thumbu
Just checked with my latest toy, an Olidata JumPC that cost me the horrendous sum of 99€ (142US$), Win XP included. It does the test in 3 cycles, claiming it has a Celeron M installed. Well...

Title: Re: Faster alternative to .While ... .Endw
Post by: 2-Bit Chip on December 27, 2009, 07:44:01 PM

Code Select

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
43      cycles for LoopDecAl
45      cycles for LoopDec
47      cycles for LoopWhile
43      cycles for LoopJmpAl
50      cycles for LoopJmp

31      cycles for LoopDecAl
41      cycles for LoopDec
34      cycles for LoopWhile
30      cycles for LoopJmpAl
29      cycles for LoopJmp

26      cycles for LoopDecAl
32      cycles for LoopDec
29      cycles for LoopWhile
24      cycles for LoopJmpAl
33      cycles for LoopJmp

17      cycles for LoopDecAl
20      cycles for LoopDec
19      cycles for LoopWhile
17      cycles for LoopJmpAl
19      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: MichaelW on December 27, 2009, 07:50:35 PM

P3:

Code Select


☺☺☻♥ (SSE1)
15      cycles for LoopDecAl
15      cycles for LoopDecZx
16      cycles for LoopDec
21      cycles for LoopWhile
21      cycles for LoopJmpAl
21      cycles for LoopJmp

10      cycles for LoopDecAl
10      cycles for LoopDecZx
11      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
14      cycles for LoopJmp

10      cycles for LoopDecAl
9       cycles for LoopDecZx
11      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
11      cycles for LoopJmp

5       cycles for LoopDecAl
5       cycles for LoopDecZx
5       cycles for LoopDec
6       cycles for LoopWhile
6       cycles for LoopJmpAl
6       cycles for LoopJmp

Code Select


☺☺☻♥ (SSE1)
21      cycles for LoopJmpAlInc
17      cycles for LoopJmpAlAdd
17      cycles for LoopJmpAlSub
16      cycles for LoopJmpZxInc
14      cycles for LoopJmpZxAdd
14      cycles for LoopJmpZxSub

14      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
12      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
10      cycles for LoopJmpAlSub
8       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

6       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 09:58:28 PM

As far as I can see, this is the winning algo, for Core & Celeron & P3 & P4:

Code Select

SkipLeadingWhiteSpace proc	; pSrc$:DWORD
	mov edx, [esp+4]	; get source string
	dec edx
@@:	inc edx
	movzx eax, byte ptr [edx]
	cmp al, "0"
	je @B
	cmp al, 0
	je @F
	cmp al, 32
	jle @B
@@:
  ret 4	; edx points to first non-"0" and non-white space char
SkipLeadingWhiteSpace endp

EDIT: Added check for zero byte - thanks Sinsi :U

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 27, 2009, 10:33:05 PM

a close race, eh Jochen ?

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 27, 2009, 10:41:40 PM

Quote from: dedndave on December 27, 2009, 10:33:05 PM
a close race, eh Jochen ?

Very close indeed. On the other hand, it is that kind of loop that is typically run not even once, so why waste a single cycle on it? What remains useful from this exercise might be that a dec ptr/inc ptr combination is slightly more efficient than the jmp generated by .While - which confirmed my aversion against .While loops. In contrast, .Repeat ... .Until can't be beaten by a hand-coded loop.

Title: Re: Faster alternative to .While ... .Endw
Post by: sinsi on December 28, 2009, 12:26:56 AM

It doesn't seem to work too well for a string like " 0" though...no check for a terminating 00.

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 08:07:12 AM

Quote from: sinsi on December 28, 2009, 12:26:56 AM
It doesn't seem to work too well for a string like " 0" though...no check for a terminating 00.

Assuming a string ends with a nullbyte, it would stop right there. By design :bg

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:01:05 AM

nah - it kinda keeps going, Jochen - lol
but that isn't a requirement for the tests - the tests showed us what we wanted to know

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 10:36:36 AM

Quote from: dedndave on December 28, 2009, 10:01:05 AM
nah - it kinda keeps going, Jochen - lol

So Sinsi was right :red
Corrected above. Surprisingly enough, it still runs the test in three cycles for db "This is a string", 0

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:43:01 AM

that doesn't sound right - lol
figure at LEAST one clock cycle per byte :bg

EDIT - oh - lol
there are no bytes striped in that example :bg

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on December 28, 2009, 10:52:14 AM

i might be inclined to make the tests in this order:

SkipLeadingWhiteSpace proc ;pSrc$:DWORD

mov edx,[esp+4] ;get source string
dec edx

@@: inc edx
movzx eax,byte ptr [edx]
or al,al
jz @F

cmp al, 32
jle @B

cmp al, "0"
je @B

@@: ret 4 ;edx points to first non-"0" and non-white space char

SkipLeadingWhiteSpace endp

test for the most likely first, when practical
we have to test for null first so that it is culled out before the white space test
this order assumes white space is more likely than "0" - that may not be the case
if leading "0" is more likely, test for that before null (like you have it)

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on December 28, 2009, 11:08:57 AM

Quote from: dedndave on December 28, 2009, 10:52:14 AM
if leading "0" is more likely, test for that before null (like you have it)

Excerpt from Windows.inc. The assumption is you used Instr for "equ", and added 4, so the string starts with "0":

Code Select

ENM_SCROLLEVENTS                 equ 00000008h
ENM_DRAGDROPDONE                 equ 00000010h
ENM_PARAGRAPHEXPANDED            equ 00000020h
ENM_PAGECHANGE                   equ 00000040h
ENM_LANGCHANGE                   equ 01000000h
ENM_OBJECTPOSITIONS              equ 02000000h
ENM_LINK                         equ 04000000h
ENM_LOWFIRTF                     equ 08000000h
ES_NOOLEDRAGDROP                 equ 00000008h

Title: Re: Faster alternative to .While ... .Endw
Post by: dacid on February 10, 2010, 11:01:46 PM

First program:

Code Select


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
5       cycles for LoopDec
14      cycles for LoopWhile
4       cycles for LoopJmpAl
16      cycles for LoopJmp

20      cycles for LoopDecAl
9       cycles for LoopDec
10      cycles for LoopWhile
10      cycles for LoopJmpAl
-2      cycles for LoopJmp

8       cycles for LoopDecAl
-3      cycles for LoopDec
18      cycles for LoopWhile
18      cycles for LoopJmpAl
8       cycles for LoopJmp

-5      cycles for LoopDecAl
16      cycles for LoopDec
-7      cycles for LoopWhile
4       cycles for LoopJmpAl
25      cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Second:

Code Select


AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3       cycles for LoopDecAl
2       cycles for LoopDecZx
2       cycles for LoopDec
15      cycles for LoopWhile
14      cycles for LoopJmpAl
24      cycles for LoopJmp

19      cycles for LoopDecAl
24      cycles for LoopDecZx
-1      cycles for LoopDec
-1      cycles for LoopWhile
10      cycles for LoopJmpAl
10      cycles for LoopJmp

-2      cycles for LoopDecAl
6       cycles for LoopDecZx
18      cycles for LoopDec
30      cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

-4      cycles for LoopDecAl
6       cycles for LoopDecZx
-5      cycles for LoopDec
14      cycles for LoopWhile
14      cycles for LoopJmpAl
15      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: lingo on February 11, 2010, 03:26:57 AM

"Here are the times for Daves version on my quad."

Hutch, I received the same times: :wink

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub
7       cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub
5       cycles for LoopJmpLingo

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub
3       cycles for LoopJmpLingo

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub
2       cycles for LoopJmpLingo

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
43      LoopJmpLingo
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 04:34:43 AM

nice machine, Lingo

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
22 cycles for LoopJmpAlInc
21 cycles for LoopJmpAlAdd
21 cycles for LoopJmpAlSub
20 cycles for LoopJmpZxInc
20 cycles for LoopJmpZxAdd
27 cycles for LoopJmpZxSub
21 cycles for LoopJmpLingo

16 cycles for LoopJmpAlInc
20 cycles for LoopJmpAlAdd
17 cycles for LoopJmpAlSub
14 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
16 cycles for LoopJmpLingo

13 cycles for LoopJmpAlInc
15 cycles for LoopJmpAlAdd
15 cycles for LoopJmpAlSub
12 cycles for LoopJmpZxInc
14 cycles for LoopJmpZxAdd
13 cycles for LoopJmpZxSub
11 cycles for LoopJmpLingo

7 cycles for LoopJmpAlInc
8 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
8 cycles for LoopJmpZxAdd
8 cycles for LoopJmpZxSub
6 cycles for LoopJmpLingo

nice to finally get some repeatable numbers from my prescott :P

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 08:18:39 AM

Compliments, Lingo! I have added a shortened version of your algo, 22 instead of 34 bytes, with almost identical timings.
EDIT: Since finding leading white space is not a frequent task, "inlining" instead of calling a proc might be more appropriate. So I added two inline versions. It turns out that the align 4 version is an edge faster.
EDIT(2): Two more inline versions added.

Code Select

	mov eax, offset Src
	mov ecx, "00"
	dec eax
	align 4		; align may change flags in Masm
@@:	inc eax
	mov cl, ch
	sub cl, [eax]	; for [eax]==48, cl=0
	je @B
	cmp cl, 16	; for [eax]==32, cl=48-16=16
	jge @B

Quotealign 16
LoopJmpLingo_proc:      ; the original Lingo algo
LoopLingo:
   add   eax, 1
   mov   cl,   ch
   add   cl,   [eax]
   je      LoopLingo
   add   cl,   10h
   jle      LoopLingo
   jmp   edx
align 16
LoopJmpLingo    proc
   pop   edx
   mov   ecx,   0D0D0h
   pop   eax
   add   cl,   [eax]      ; for [eax]==48, cl=208+48=256 aka zero
   je      LoopLingo
   add   cl,   10h         ; for [eax]==32, cl=208+32+16=256 aka zero
   jle      LoopLingo
   jmp   edx
LoopJmpLingo    endp
LoopJmpLingo_endp:

align 16
LoopJmpLingoJ_proc:
LoopJmpLingoJ proc      ; variant to Lingo's code (http://www.masm32.com/board/index.php?topic=12984.msg104011#msg104011)
   pop edx      ; the return address
   mov ecx, 0D0D0h
   pop eax
   dec eax
@@:   inc eax
   mov cl, ch
   add cl, [eax]   ; for [eax]==48, cl=208+48=256 aka zero
   je @B
   add cl, 10h   ; for [eax]==32, cl=208+32+16=256 aka zero
   jle @B
   jmp edx
LoopJmpLingoJ endp
LoopJmpLingoJ_endp:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
34      LoopJmpLingo
22      LoopJmpLingoJ

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 10:59:05 AM

not so good on a prescott, Jochen

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
15 cycles for inline loop, add, align 16
14 cycles for inline loop, sub, align 4
20 cycles for LoopJmpLingo
44 cycles for LoopJmpLingoJ

13 cycles for inline loop, add, align 16
11 cycles for inline loop, sub, align 4
18 cycles for LoopJmpLingo
27 cycles for LoopJmpLingoJ

9 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
11 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ

5 cycles for inline loop, add, align 16
4 cycles for inline loop, sub, align 4
9 cycles for LoopJmpLingo
10 cycles for LoopJmpLingoJ

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 11:06:57 AM

Dave,
Can you try the new version, please? The inline algos seem to perform well, and the "two byte immediates" variant is pretty short, too. If the mov eax, offset src happens to be two bytes later, the size is only 12 bytes.

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
7       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
5       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
7       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
4       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
2       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
1       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
20      inline, cmp, two byte regs
14      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 11:15:01 AM

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
16 cycles for inline loop, add, align 16
14 cycles for inline loop, sub, align 4
17 cycles for inline loop, cmp two byte regs, align 4
19 cycles for inline loop, cmp two immediates, align 4
19 cycles for LoopJmpLingo
41 cycles for LoopJmpLingoJ

13 cycles for inline loop, add, align 16
12 cycles for inline loop, sub, align 4
10 cycles for inline loop, cmp two byte regs, align 4
14 cycles for inline loop, cmp two immediates, align 4
18 cycles for LoopJmpLingo
31 cycles for LoopJmpLingoJ

8 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
9 cycles for inline loop, cmp two byte regs, align 4
7 cycles for inline loop, cmp two immediates, align 4
10 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ

5 cycles for inline loop, add, align 16
5 cycles for inline loop, sub, align 4
5 cycles for inline loop, cmp two byte regs, align 4
4 cycles for inline loop, cmp two immediates, align 4
7 cycles for LoopJmpLingo
11 cycles for LoopJmpLingoJ

i don't think alignment is that critical for smaller loops, Jochen - at least not on a P4

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 11:25:32 AM

prescott - i have removed all align's

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
14 cycles for inline loop, add, no align
14 cycles for inline loop, sub, no align
11 cycles for inline loop, cmp two byte regs, no align
17 cycles for inline loop, cmp two immediates, no align
19 cycles for LoopJmpLingo
23 cycles for LoopJmpLingoJ

13 cycles for inline loop, add, no align
12 cycles for inline loop, sub, no align
11 cycles for inline loop, cmp two byte regs, no align
12 cycles for inline loop, cmp two immediates, no align
17 cycles for LoopJmpLingo
27 cycles for LoopJmpLingoJ

8 cycles for inline loop, add, no align
7 cycles for inline loop, sub, no align
9 cycles for inline loop, cmp two byte regs, no align
8 cycles for inline loop, cmp two immediates, no align
9 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ

3 cycles for inline loop, add, no align
4 cycles for inline loop, sub, no align
4 cycles for inline loop, cmp two byte regs, no align
3 cycles for inline loop, cmp two immediates, no align
6 cycles for LoopJmpLingo
10 cycles for LoopJmpLingoJ

Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ

Title: Re: Faster alternative to .While ... .Endw
Post by: jj2007 on February 11, 2010, 11:47:26 AM

Thanks, Dave. The timings look a little bit inconsistent, also on my machine, but it seems we can safely vote for the shortest version ;-)

Code Select

	mov eax, offset Src
	dec eax
@@:	inc eax
	cmp byte ptr [eax], 48		; skip "0"...
	je @B
	cmp byte ptr [eax], 32		; and anything from space downwards
	jle @B

EDIT: And I forgot the end of string case...!

Code Select

;	mov eax, offset Src
	dec eax		; no align before the loop - it's slower
@@:	inc eax
	cmp byte ptr [eax], 48	; "0"
	je @B
	cmp byte ptr [eax], 0	; zero delimiter?
	je @F
	cmp byte ptr [eax], 32	; " " or less
	jle @B
@@:

17 bytes starting with dec eax, 1 cycle for the default case (no 0 or space at string start).

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 11, 2010, 12:06:33 PM

not so fast - lol
it would be nice to see what difference, if any, alignment has on some other processors
from my testing on a P4, if it is a short jump to get back to the top of the loop, then alignment does little good
that may not be so for some of the more modern cores :U

EDIT - maybe we need a new thread to get some tests run in the purely "alignment/timing" catagory

Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 12:23:07 AM

Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.

I got the same story on my new Q7 - I suspect some of the new processors are designed for data streaming and not computation. So we now have shitty computers but great televisions.!
Congratz to us... grumble grumble... sigh two years til wife will let me replace.

Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 23, 2010, 06:21:43 AM

What cpu is that Q7 ? (atom-based?)

Anyway, never trust laptop cpus. Still I'm sure the general performance is good, don't let some specific timings discourage you guys.

Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 11:48:57 AM

Sorry my bad. Meant Intel Core i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.

Title: Re: Faster alternative to .While ... .Endw
Post by: dedndave on February 23, 2010, 01:16:33 PM

WryBugz
these timing tests are not intended to benchmark your machine
comparing clock cycles on one machine to clock cycles on another machine is a little like comparing apples to oranges
from what i know, the i7 is a good performer - evidenced by the fact that you are happy with the overall performance

the information that is meaningful is the performance ratio of one method to another on any given machine

Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 23, 2010, 01:52:19 PM

Quote from: WryBugz on February 23, 2010, 11:48:57 AM
Sorry my bad. Meant Intel Core i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.

Could you post the timings with the exe posted at this thread ? I'm curious about something.
http://www.masm32.com/board/index.php?topic=13385.0

Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 10:06:41 PM

Intel Core I7

10 cycles for LoopDecAl
10 cycles for LoopDec
15 cycles for LoopWhile
17 cycles for LoopJmpAl
17 cycles for LoopJmp

6 cycles for LoopDecAl
6 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
12 cycles for LoopJmp

3 cycles for LoopDecAl
3 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
6 cycles for LoopJmp

Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
There you go.
I guess I don't understand the processor differences enough Dave. I thought that was the reason for the machine citation.
I am just a hobbyist and while having put in a lot of time, my knowledge is pretty erratic. For instance, I just realized this past week that eax - edx are not all equal.

Title: Re: Faster alternative to .While ... .Endw
Post by: WryBugz on February 23, 2010, 10:12:44 PM

The other one....

Loop
991 clock cycles
1054 clock cycles
1045 clock cycles
Dec ECX
505 clock cycles
505 clock cycles
513 clock cycles
Press any key to continue ...

From the link.

Title: Re: Faster alternative to .While ... .Endw
Post by: Gunner on February 23, 2010, 10:40:22 PM

Here is mine...
Intel(R) Pentium(R) 4 CPU 2.40GHz (SSE2)
6 cycles for inline loop, add, no align
4 cycles for inline loop, sub, no align
7 cycles for inline loop, cmp two byte regs, no align
3 cycles for inline loop, cmp two immediates, no align
5 cycles for LoopJmpLingo
9 cycles for LoopJmpLingoJ

4 cycles for inline loop, add, no align
8 cycles for inline loop, sub, no align
2 cycles for inline loop, cmp two byte regs, no align
2 cycles for inline loop, cmp two immediates, no align
5 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ

-2 cycles for inline loop, add, no align
-2 cycles for inline loop, sub, no align
6 cycles for inline loop, cmp two byte regs, no align
-4 cycles for inline loop, cmp two immediates, no align
0 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ

-7 cycles for inline loop, add, no align
5 cycles for inline loop, sub, no align
-2 cycles for inline loop, cmp two byte regs, no align
-7 cycles for inline loop, cmp two immediates, no align
-2 cycles for LoopJmpLingo
4 cycles for LoopJmpLingoJ

Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: BlackVortex on February 24, 2010, 06:15:19 AM

Quote from: WryBugz on February 23, 2010, 10:12:44 PM
The other one....

Loop
991 clock cycles
1054 clock cycles
1045 clock cycles
Dec ECX
505 clock cycles
505 clock cycles
513 clock cycles
Press any key to continue ...

From the link.

Why does it take twice as many cycles as mine ? Not very efficient.

Title: Re: Faster alternative to .While ... .Endw
Post by: FairLight on March 11, 2010, 06:08:00 AM

...more data...

Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
7 cycles for LoopDecAl
7 cycles for LoopDecZx
9 cycles for LoopDec
9 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp

5 cycles for LoopDecAl
5 cycles for LoopDecZx
5 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
7 cycles for LoopJmp

3 cycles for LoopDecAl
3 cycles for LoopDecZx
3 cycles for LoopDec
5 cycles for LoopWhile
5 cycles for LoopJmpAl
5 cycles for LoopJmp

1 cycles for LoopDecAl
1 cycles for LoopDecZx
1 cycles for LoopDec
3 cycles for LoopWhile
3 cycles for LoopJmpAl
3 cycles for LoopJmp

Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---

Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
9 cycles for LoopJmpAlInc
9 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
9 cycles for LoopJmpZxInc
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
7 cycles for LoopJmpLingo

7 cycles for LoopJmpAlInc
7 cycles for LoopJmpAlAdd
7 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
7 cycles for LoopJmpZxAdd
7 cycles for LoopJmpZxSub
5 cycles for LoopJmpLingo

5 cycles for LoopJmpAlInc
5 cycles for LoopJmpAlAdd
5 cycles for LoopJmpAlSub
5 cycles for LoopJmpZxInc
5 cycles for LoopJmpZxAdd
5 cycles for LoopJmpZxSub
3 cycles for LoopJmpLingo

3 cycles for LoopJmpAlInc
3 cycles for LoopJmpAlAdd
3 cycles for LoopJmpAlSub
3 cycles for LoopJmpZxInc
3 cycles for LoopJmpZxAdd
3 cycles for LoopJmpZxSub
2 cycles for LoopJmpLingo

Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
43 LoopJmpLingo
--- ok ---

Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
6 cycles for inline loop, add, no align
9 cycles for inline loop, sub, no align
3 cycles for inline loop, cmp two byte regs, no align
2 cycles for inline loop, cmp two immediates, no align
8 cycles for LoopJmpLingo
11 cycles for LoopJmpLingoJ

23 cycles for inline loop, add, no align
22 cycles for inline loop, sub, no align
1 cycles for inline loop, cmp two byte regs, no align
1 cycles for inline loop, cmp two immediates, no align
6 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ

20 cycles for inline loop, add, no align
10 cycles for inline loop, sub, no align
0 cycles for inline loop, cmp two byte regs, no align
0 cycles for inline loop, cmp two immediates, no align
4 cycles for LoopJmpLingo
5 cycles for LoopJmpLingoJ

0 cycles for inline loop, add, no align
0 cycles for inline loop, sub, no align
0 cycles for inline loop, cmp two byte regs, no align
0 cycles for inline loop, cmp two immediates, no align
1 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ

Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---

Loop
1294 clock cycles
1294 clock cycles
1294 clock cycles
Dec ECX
279 clock cycles
279 clock cycles
279 clock cycles
Press any key to continue ...

Title: Re: Faster alternative to .While ... .Endw
Post by: joemc on March 30, 2010, 04:20:37 AM

Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz (SSE4)
8 cycles for LoopDecAl
10 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp

6 cycles for LoopDecAl
6 cycles for LoopDec
9 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp

4 cycles for LoopDecAl
4 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
7 cycles for LoopJmp

2 cycles for LoopDecAl
2 cycles for LoopDec
4 cycles for LoopWhile
4 cycles for LoopJmpAl
4 cycles for LoopJmp

Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp

-------------

3 cycles for inline loop, add, align 16
6 cycles for inline loop, sub, align 4
3 cycles for inline loop, cmp two byte regs, align 4
3 cycles for inline loop, cmp two immediates, align 4
6 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ

3 cycles for inline loop, add, align 16
2 cycles for inline loop, sub, align 4
1 cycles for inline loop, cmp two byte regs, align 4
1 cycles for inline loop, cmp two immediates, align 4
5 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ

0 cycles for inline loop, add, align 16
0 cycles for inline loop, sub, align 4
0 cycles for inline loop, cmp two byte regs, align 4
0 cycles for inline loop, cmp two immediates, align 4
2 cycles for LoopJmpLingo
3 cycles for LoopJmpLingoJ

0 cycles for inline loop, add, align 16
0 cycles for inline loop, sub, align 4
0 cycles for inline loop, cmp two byte regs, align 4
0 cycles for inline loop, cmp two immediates, align 4
2 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ

Sizes:
23 inline, add
19 inline, sub
20 inline, cmp, two byte regs
14 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---

Title: Re: Faster alternative to .While ... .Endw
Post by: Greenhorn__ on April 13, 2010, 12:00:23 AM

Hi,

here are my results ...

LoopDecWhile.exe (2nd version)

Code Select


AMD Phenom(tm) II X4 955 Processor (SSE3)
16	cycles for LoopDecAl
16	cycles for LoopDecZx
9	cycles for LoopDec
15	cycles for LoopWhile
14	cycles for LoopJmpAl
15	cycles for LoopJmp


8	cycles for LoopDecAl
17	cycles for LoopDecZx
8	cycles for LoopDec
26	cycles for LoopWhile
29	cycles for LoopJmpAl
26	cycles for LoopJmp


7	cycles for LoopDecAl
7	cycles for LoopDecZx
7	cycles for LoopDec
6	cycles for LoopWhile
6	cycles for LoopJmpAl
6	cycles for LoopJmp


4	cycles for LoopDecAl
4	cycles for LoopDecZx
6	cycles for LoopDec
2	cycles for LoopWhile
2	cycles for LoopJmpAl
2	cycles for LoopJmp

Sizes:
19	LoopDecAl
26	LoopDecZx
19	LoopDec
20	LoopWhile
20	LoopJmpAl
20	LoopJmp
--- ok ---

... and for IncAddSub.exe

Code Select


AMD Phenom(tm) II X4 955 Processor (SSE3)
14	cycles for LoopJmpAlInc
15	cycles for LoopJmpAlAdd
15	cycles for LoopJmpAlSub
15	cycles for LoopJmpZxInc
15	cycles for LoopJmpZxAdd
15	cycles for LoopJmpZxSub


27	cycles for LoopJmpAlInc
29	cycles for LoopJmpAlAdd
27	cycles for LoopJmpAlSub
29	cycles for LoopJmpZxInc
27	cycles for LoopJmpZxAdd
27	cycles for LoopJmpZxSub


6	cycles for LoopJmpAlInc
6	cycles for LoopJmpAlAdd
6	cycles for LoopJmpAlSub
6	cycles for LoopJmpZxInc
6	cycles for LoopJmpZxAdd
6	cycles for LoopJmpZxSub


2	cycles for LoopJmpAlInc
2	cycles for LoopJmpAlAdd
2	cycles for LoopJmpAlSub
2	cycles for LoopJmpZxInc
2	cycles for LoopJmpZxAdd
2	cycles for LoopJmpZxSub

Sizes:
20	LoopJmpAlInc
22	LoopJmpAlAdd
22	LoopJmpAlSub
21	LoopJmpZxInc
23	LoopJmpZxAdd
23	LoopJmpZxSub
--- ok ---

Regards
Greenhorn

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on December 27, 2009, 09:32:10 AM