News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster alternative to .While ... .Endw

Started by jj2007, December 27, 2009, 09:32:10 AM

Previous topic - Next topic

dedndave

nice machine, Lingo

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
22      cycles for LoopJmpAlInc
21      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
20      cycles for LoopJmpZxInc
20      cycles for LoopJmpZxAdd
27      cycles for LoopJmpZxSub
21      cycles for LoopJmpLingo

16      cycles for LoopJmpAlInc
20      cycles for LoopJmpAlAdd
17      cycles for LoopJmpAlSub
14      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub
16      cycles for LoopJmpLingo

13      cycles for LoopJmpAlInc
15      cycles for LoopJmpAlAdd
15      cycles for LoopJmpAlSub
12      cycles for LoopJmpZxInc
14      cycles for LoopJmpZxAdd
13      cycles for LoopJmpZxSub
11      cycles for LoopJmpLingo

7       cycles for LoopJmpAlInc
8       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
8       cycles for LoopJmpZxAdd
8       cycles for LoopJmpZxSub
6       cycles for LoopJmpLingo

nice to finally get some repeatable numbers from my prescott   :P

jj2007

#31
Compliments, Lingo! I have added a shortened version of your algo, 22 instead of 34 bytes, with almost identical timings.
EDIT: Since finding leading white space is not a frequent task, "inlining" instead of calling a proc might be more appropriate. So I added two inline versions. It turns out that the align 4 version is an edge faster.
EDIT(2): Two more inline versions added.
mov eax, offset Src
mov ecx, "00"
dec eax
align 4 ; align may change flags in Masm
@@: inc eax
mov cl, ch
sub cl, [eax] ; for [eax]==48, cl=0
je @B
cmp cl, 16 ; for [eax]==32, cl=48-16=16
jge @B


Quotealign 16
LoopJmpLingo_proc:      ; the original Lingo algo
LoopLingo:
   add   eax, 1
   mov   cl,   ch
   add   cl,   [eax]
   je      LoopLingo
   add   cl,   10h
   jle      LoopLingo
   jmp   edx
align 16
LoopJmpLingo    proc
   pop   edx
   mov   ecx,   0D0D0h
   pop   eax
   add   cl,   [eax]      ; for [eax]==48, cl=208+48=256 aka zero
   je      LoopLingo
   add   cl,   10h         ; for [eax]==32, cl=208+32+16=256 aka zero
   jle      LoopLingo
   jmp   edx
LoopJmpLingo    endp
LoopJmpLingo_endp:

align 16
LoopJmpLingoJ_proc:
LoopJmpLingoJ proc      ; variant to Lingo's code
   pop edx      ; the return address
   mov ecx, 0D0D0h
   pop eax
   dec eax
@@:   inc eax
   mov cl, ch
   add cl, [eax]   ; for [eax]==48, cl=208+48=256 aka zero
   je @B
   add cl, 10h   ; for [eax]==32, cl=208+32+16=256 aka zero
   jle @B
   jmp edx
LoopJmpLingoJ endp
LoopJmpLingoJ_endp:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
34      LoopJmpLingo
22      LoopJmpLingoJ

dedndave

not so good on a prescott, Jochen

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
15      cycles for inline loop, add, align 16
14      cycles for inline loop, sub, align 4
20      cycles for LoopJmpLingo
44      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, align 16
11      cycles for inline loop, sub, align 4
18      cycles for LoopJmpLingo
27      cycles for LoopJmpLingoJ

9       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
11      cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

5       cycles for inline loop, add, align 16
4       cycles for inline loop, sub, align 4
9       cycles for LoopJmpLingo
10      cycles for LoopJmpLingoJ

jj2007

Dave,
Can you try the new version, please? The inline algos seem to perform well, and the "two byte immediates" variant is pretty short, too. If the mov eax, offset src happens to be two bytes later, the size is only 12 bytes.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
7       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
5       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
7       cycles for LoopJmpLingo
8       cycles for LoopJmpLingoJ

7       cycles for inline loop, add, align 16
6       cycles for inline loop, sub, align 4
4       cycles for inline loop, cmp two byte regs, align 4
5       cycles for inline loop, cmp two immediates, align 4
6       cycles for LoopJmpLingo
7       cycles for LoopJmpLingoJ

3       cycles for inline loop, add, align 16
2       cycles for inline loop, sub, align 4
2       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
3       cycles for LoopJmpLingo
4       cycles for LoopJmpLingoJ

1       cycles for inline loop, add, align 16
1       cycles for inline loop, sub, align 4
1       cycles for inline loop, cmp two byte regs, align 4
1       cycles for inline loop, cmp two immediates, align 4
2       cycles for LoopJmpLingo
2       cycles for LoopJmpLingoJ

Sizes:
23      inline, add
19      inline, sub
20      inline, cmp, two byte regs
14      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ

dedndave

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
16      cycles for inline loop, add, align 16
14      cycles for inline loop, sub, align 4
17      cycles for inline loop, cmp two byte regs, align 4
19      cycles for inline loop, cmp two immediates, align 4
19      cycles for LoopJmpLingo
41      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, align 16
12      cycles for inline loop, sub, align 4
10      cycles for inline loop, cmp two byte regs, align 4
14      cycles for inline loop, cmp two immediates, align 4
18      cycles for LoopJmpLingo
31      cycles for LoopJmpLingoJ

8       cycles for inline loop, add, align 16
8       cycles for inline loop, sub, align 4
9       cycles for inline loop, cmp two byte regs, align 4
7       cycles for inline loop, cmp two immediates, align 4
10      cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

5       cycles for inline loop, add, align 16
5       cycles for inline loop, sub, align 4
5       cycles for inline loop, cmp two byte regs, align 4
4       cycles for inline loop, cmp two immediates, align 4
7       cycles for LoopJmpLingo
11      cycles for LoopJmpLingoJ

i don't think alignment is that critical for smaller loops, Jochen - at least not on a P4

dedndave

prescott - i have removed all align's

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
14      cycles for inline loop, add, no align
14      cycles for inline loop, sub, no align
11      cycles for inline loop, cmp two byte regs, no align
17      cycles for inline loop, cmp two immediates, no align
19      cycles for LoopJmpLingo
23      cycles for LoopJmpLingoJ

13      cycles for inline loop, add, no align
12      cycles for inline loop, sub, no align
11      cycles for inline loop, cmp two byte regs, no align
12      cycles for inline loop, cmp two immediates, no align
17      cycles for LoopJmpLingo
27      cycles for LoopJmpLingoJ

8       cycles for inline loop, add, no align
7       cycles for inline loop, sub, no align
9       cycles for inline loop, cmp two byte regs, no align
8       cycles for inline loop, cmp two immediates, no align
9       cycles for LoopJmpLingo
14      cycles for LoopJmpLingoJ

3       cycles for inline loop, add, no align
4       cycles for inline loop, sub, no align
4       cycles for inline loop, cmp two byte regs, no align
3       cycles for inline loop, cmp two immediates, no align
6       cycles for LoopJmpLingo
10      cycles for LoopJmpLingoJ

Sizes:
18      inline, add
18      inline, sub
20      inline, cmp, two byte regs
12      inline, cmp, two byte immediates
34      LoopJmpLingo
22      LoopJmpLingoJ

jj2007

#36
Thanks, Dave. The timings look a little bit inconsistent, also on my machine, but it seems we can safely vote for the shortest version ;-)

mov eax, offset Src
dec eax
@@: inc eax
cmp byte ptr [eax], 48 ; skip "0"...
je @B
cmp byte ptr [eax], 32 ; and anything from space downwards
jle @B


EDIT: And I forgot the end of string case...!

; mov eax, offset Src
dec eax ; no align before the loop - it's slower
@@: inc eax
cmp byte ptr [eax], 48 ; "0"
je @B
cmp byte ptr [eax], 0 ; zero delimiter?
je @F
cmp byte ptr [eax], 32 ; " " or less
jle @B
@@:


17 bytes starting with dec eax, 1 cycle for the default case (no 0 or space at string start).

dedndave

not so fast - lol
it would be nice to see what difference, if any, alignment has on some other processors
from my testing on a P4, if it is a short jump to get back to the top of the loop, then alignment does little good
that may not be so for some of the more modern cores   :U

EDIT - maybe we need a new thread to get some tests run in the purely "alignment/timing" catagory

WryBugz

Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.

I got the same story on my new Q7 - I suspect some of the new processors are designed for data streaming and not computation. So we now have shitty computers but great televisions.!
Congratz to us... grumble grumble... sigh two years til wife will let me replace.

BlackVortex

What cpu is that Q7 ? (atom-based?)

Anyway, never trust laptop cpus. Still I'm sure the general performance is good, don't let some specific timings discourage you guys.

WryBugz

Sorry my bad. Meant Intel Core  i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.

dedndave

WryBugz
these timing tests are not intended to benchmark your machine
comparing clock cycles on one machine to clock cycles on another machine is a little like comparing apples to oranges
from what i know, the i7 is a good performer - evidenced by the fact that you are happy with the overall performance

the information that is meaningful is the performance ratio of one method to another on any given machine

BlackVortex

Quote from: WryBugz on February 23, 2010, 11:48:57 AM
Sorry my bad. Meant Intel Core  i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.
Could you post the timings with the exe posted at this thread ? I'm curious about something.
http://www.masm32.com/board/index.php?topic=13385.0

WryBugz

Intel Core I7

10      cycles for LoopDecAl
10      cycles for LoopDec
15      cycles for LoopWhile
17      cycles for LoopJmpAl
17      cycles for LoopJmp

6       cycles for LoopDecAl
6       cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
12      cycles for LoopJmp

3       cycles for LoopDecAl
3       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
6       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---
There you go.
I guess I don't understand the processor differences enough Dave. I thought that was the reason for the machine citation.
I am just a hobbyist and while having put in a lot of time, my knowledge is pretty erratic. For instance, I just realized this past week that eax - edx are not all equal.

WryBugz

The other one....

Loop
991     clock cycles
1054    clock cycles
1045    clock cycles
Dec ECX
505     clock cycles
505     clock cycles
513     clock cycles
Press any key to continue ...


From the link.