News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster alternative to .While ... .Endw

Started by jj2007, December 27, 2009, 09:32:10 AM

Previous topic - Next topic

jj2007

A simple loop to get rid of leading white space and zeroes:
  mov edx, [esp+4] ; get address to source string
  .While byte ptr [edx]<=32 || byte ptr [edx]=="0"
inc edx
  .Endw

This version is a little bit faster on my Celeron M ("Core", not Core 2) CPU:
mov edx, [esp+4]
dec edx
@@: inc edx
mov al, [edx]
cmp al, 32
jle @B
cmp al, "0"
je @B

Can somebody post timings for a P4 please?
Thanks, JJ

Intel(R) Celeron(R) M CPU
12      cycles for LoopDecAl   3 leading discardable chars
13      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

8       cycles for LoopDecAl   2 chars
9       cycles for LoopDec
9       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

6       cycles for LoopDecAl   1 char
7       cycles for LoopDec
7       cycles for LoopWhile
7       cycles for LoopJmpAl
7       cycles for LoopJmp

3       cycles for LoopDecAl   none
3       cycles for LoopDec
4       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp

hutch--

JJ,

Try this on a PIV.


; mov al, [edx]
movzx eax, BYTE PTR [edx]


Also try and time ADD and SUB as against INC and DEC as I have found that it is still faster on this Core quad. it is publiched by Intel the preference for ADD SUB on a PIV and as far as I have seen its no slower on other hardware.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

prescott

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
32      cycles for LoopDecAl
35      cycles for LoopDec
21      cycles for LoopWhile
23      cycles for LoopJmpAl
22      cycles for LoopJmp

20      cycles for LoopDecAl
71      cycles for LoopDec
41      cycles for LoopWhile
17      cycles for LoopJmpAl
15      cycles for LoopJmp

16      cycles for LoopDecAl
19      cycles for LoopDec
13      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

8       cycles for LoopDecAl
11      cycles for LoopDec
7       cycles for LoopWhile
9       cycles for LoopJmpAl
9       cycles for LoopJmp

jj2007

Quote from: hutch-- on December 27, 2009, 09:58:48 AM
Try this on a PIV.


; mov al, [edx]
movzx eax, BYTE PTR [edx]


Also try and time ADD and SUB as against INC and DEC
Attached, as LoopDecZx. Not faster on my Celeron, but maybe on a PIV it helps.
@DednDave: Thanks. Inconsistent timings as always, the Prescott is difficult to time...
Intel(R) Celeron(R) M CPU
13      cycles for LoopDecAl
12      cycles for LoopDecZx
14      cycles for LoopDec
14      cycles for LoopWhile
13      cycles for LoopJmpAl
13      cycles for LoopJmp

9       cycles for LoopDecAl
9       cycles for LoopDecZx
10      cycles for LoopDec
9       cycles for LoopWhile
10      cycles for LoopJmpAl
9       cycles for LoopJmp

7       cycles for LoopDecAl
6       cycles for LoopDecZx
7       cycles for LoopDec
8       cycles for LoopWhile
8       cycles for LoopJmpAl
8       cycles for LoopJmp

3       cycles for LoopDecAl
4       cycles for LoopDecZx
4       cycles for LoopDec
5       cycles for LoopWhile
4       cycles for LoopJmpAl
4       cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp

dedndave

GA Jochen

i took a different approach for the test - lol
i also changed to REALTIME to help get more consistent times on my prescott
for these brief tests - it shouldn't hurt anything
you are a lot faster than i am   :bg

                    IncAddSub.exe

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23      cycles for LoopJmpAlInc
21      cycles for LoopJmpAlAdd
20      cycles for LoopJmpAlSub
21      cycles for LoopJmpZxInc
24      cycles for LoopJmpZxAdd
19      cycles for LoopJmpZxSub

20      cycles for LoopJmpAlInc
16      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
14      cycles for LoopJmpZxInc
13      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

15      cycles for LoopJmpAlInc
12      cycles for LoopJmpAlAdd
14      cycles for LoopJmpAlSub
13      cycles for LoopJmpZxInc
12      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

9       cycles for LoopJmpAlInc
10      cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
8       cycles for LoopJmpZxInc
11      cycles for LoopJmpZxAdd
11      cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub   <- edit - lol

dedndave

prescott...

              LoopDecWhile.exe (v2)

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23      cycles for LoopDecAl
20      cycles for LoopDecZx
23      cycles for LoopDec
21      cycles for LoopWhile
26      cycles for LoopJmpAl
23      cycles for LoopJmp

18      cycles for LoopDecAl
14      cycles for LoopDecZx
41      cycles for LoopDec
57      cycles for LoopWhile
20      cycles for LoopJmpAl
24      cycles for LoopJmp

16      cycles for LoopDecAl
11      cycles for LoopDecZx
14      cycles for LoopDec
15      cycles for LoopWhile
17      cycles for LoopJmpAl
13      cycles for LoopJmp

11      cycles for LoopDecAl
11      cycles for LoopDecZx
10      cycles for LoopDec
11      cycles for LoopWhile
10      cycles for LoopJmpAl
9       cycles for LoopJmp

hutch--

Here are the times for Daves version on my quad. Really ain't much in it. :)


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

5       cycles for LoopJmpAlInc
5       cycles for LoopJmpAlAdd
5       cycles for LoopJmpAlSub
5       cycles for LoopJmpZxInc
5       cycles for LoopJmpZxAdd
5       cycles for LoopJmpZxSub

3       cycles for LoopJmpAlInc
3       cycles for LoopJmpAlAdd
3       cycles for LoopJmpAlSub
3       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxAdd
3       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Hmmm....
dave's version yields virtually identical timings for all algos. The only thing that gains a cycle is to eliminate the first jump (my initial choice).

Intel(R) Celeron(R) M CPU
13      cycles for LoopJmpAlInc
13      cycles for LoopJmpAlAdd
13      cycles for LoopJmpAlSub
13      cycles for LoopJmpZxInc
12      cycles for LoopJmpZxIncJJ
13      cycles for LoopJmpZxAdd
13      cycles for LoopJmpZxSub

9       cycles for LoopJmpAlInc
9       cycles for LoopJmpAlAdd
9       cycles for LoopJmpAlSub
9       cycles for LoopJmpZxInc
8       cycles for LoopJmpZxIncJJ
9       cycles for LoopJmpZxAdd
9       cycles for LoopJmpZxSub

7       cycles for LoopJmpAlInc
7       cycles for LoopJmpAlAdd
7       cycles for LoopJmpAlSub
7       cycles for LoopJmpZxInc
6       cycles for LoopJmpZxIncJJ
7       cycles for LoopJmpZxAdd
7       cycles for LoopJmpZxSub

4       cycles for LoopJmpAlInc
4       cycles for LoopJmpAlAdd
4       cycles for LoopJmpAlSub
4       cycles for LoopJmpZxInc
3       cycles for LoopJmpZxIncJJ
4       cycles for LoopJmpZxAdd
4       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
20      LoopJmpZxIncJJ
23      LoopJmpZxAdd
23      LoopJmpZxSub

hutch--

Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

yes - i should have used that for all of them - saves a byte - lol
we have to stop giving Hutch opportunities to show off his quad   :P
that ShowCPU proc is Jochen's
if you call it with 0/1, you get terse/verbose display
not sure how he feels about PowerBASIC - lol

jj2007

Quote from: hutch-- on December 27, 2009, 12:01:54 PM
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
I derived it from various sources, most of all Wikipedia. For the history, search the forum for ShowCPU.

hutch--

 :bg

So I can get away with blaming it on you.  :bdg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Here are the results on my PIV.


Genuine Intel(R) CPU 3.80GHz (SSE3)
23      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
23      cycles for LoopJmpAlSub
23      cycles for LoopJmpZxInc
23      cycles for LoopJmpZxAdd
23      cycles for LoopJmpZxSub

17      cycles for LoopJmpAlInc
23      cycles for LoopJmpAlAdd
21      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

15      cycles for LoopJmpAlInc
15      cycles for LoopJmpAlAdd
14      cycles for LoopJmpAlSub
15      cycles for LoopJmpZxInc
15      cycles for LoopJmpZxAdd
15      cycles for LoopJmpZxSub

11      cycles for LoopJmpAlInc
11      cycles for LoopJmpAlAdd
11      cycles for LoopJmpAlSub
11      cycles for LoopJmpZxInc
11      cycles for LoopJmpZxAdd
6       cycles for LoopJmpZxSub

Sizes:
20      LoopJmpAlInc
22      LoopJmpAlAdd
22      LoopJmpAlSub
21      LoopJmpZxInc
23      LoopJmpZxAdd
23      LoopJmpZxSub
--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

rags

I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
I don't know if it's because of all the pre-installed sh*t that I haven't gotten around
to removing yet, Win 7 or something else.
It just seems to me that my timings should be higher, given the processor.

AMD Athlon(tm) II X2 215 Processor (SSE3)
17      cycles for LoopDecAl
30      cycles for LoopDecZx
30      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
52      cycles for LoopJmp

20      cycles for LoopDecAl
55      cycles for LoopDecZx
27      cycles for LoopDec
52      cycles for LoopWhile
48      cycles for LoopJmpAl
55      cycles for LoopJmp

33      cycles for LoopDecAl
15      cycles for LoopDecZx
22      cycles for LoopDec
20      cycles for LoopWhile
20      cycles for LoopJmpAl
13      cycles for LoopJmp

15      cycles for LoopDecAl
16      cycles for LoopDecZx
21      cycles for LoopDec
11      cycles for LoopWhile
11      cycles for LoopJmpAl
12      cycles for LoopJmp

Sizes:
19      LoopDecAl
26      LoopDecZx
19      LoopDec
20      LoopWhile
20      LoopJmpAl
20      LoopJmp
--- ok ---


God made Man, but the monkey applied the glue -DEVO

dedndave

i sometimes need to add a few lines of code to get consistent timings...

.
.
start:
   push 1
   call ShowCpu

      invoke GetCurrentProcess
      invoke SetProcessAffinityMask,eax,1

   ct = 0
.
.

that restricts execution to a single core

also, if the tests are brief, i change to REALTIME_PRIORITY_CLASS, rather than HIGH_PRIORITY_CLASS
that appears in each of the "counter_begin" macro calls

EDIT - i should also mention that these tests are in no way intended to benchmark your machine
they are intended to give you relative timings to compare one algorithm with another