A simple loop to get rid of leading white space and zeroes:
mov edx, [esp+4] ; get address to source string
.While byte ptr [edx]<=32 || byte ptr [edx]=="0"
inc edx
.Endw
This version is a little bit faster on my Celeron M ("Core", not Core 2) CPU:
mov edx, [esp+4]
dec edx
@@: inc edx
mov al, [edx]
cmp al, 32
jle @B
cmp al, "0"
je @B
Can somebody post timings for a P4 please?
Thanks, JJ
Intel(R) Celeron(R) M CPU
12 cycles for LoopDecAl 3 leading discardable chars
13 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp
8 cycles for LoopDecAl 2 chars
9 cycles for LoopDec
9 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp
6 cycles for LoopDecAl 1 char
7 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
7 cycles for LoopJmp
3 cycles for LoopDecAl none
3 cycles for LoopDec
4 cycles for LoopWhile
4 cycles for LoopJmpAl
4 cycles for LoopJmp
Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
JJ,
Try this on a PIV.
; mov al, [edx]
movzx eax, BYTE PTR [edx]
Also try and time ADD and SUB as against INC and DEC as I have found that it is still faster on this Core quad. it is publiched by Intel the preference for ADD SUB on a PIV and as far as I have seen its no slower on other hardware.
prescott
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
32 cycles for LoopDecAl
35 cycles for LoopDec
21 cycles for LoopWhile
23 cycles for LoopJmpAl
22 cycles for LoopJmp
20 cycles for LoopDecAl
71 cycles for LoopDec
41 cycles for LoopWhile
17 cycles for LoopJmpAl
15 cycles for LoopJmp
16 cycles for LoopDecAl
19 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp
8 cycles for LoopDecAl
11 cycles for LoopDec
7 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp
Quote from: hutch-- on December 27, 2009, 09:58:48 AM
Try this on a PIV.
; mov al, [edx]
movzx eax, BYTE PTR [edx]
Also try and time ADD and SUB as against INC and DEC
Attached, as LoopDecZx. Not faster on my Celeron, but maybe on a PIV it helps.
@DednDave: Thanks. Inconsistent timings as always, the Prescott is difficult to time...
Intel(R) Celeron(R) M CPU
13 cycles for LoopDecAl
12 cycles for LoopDecZx
14 cycles for LoopDec
14 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp
9 cycles for LoopDecAl
9 cycles for LoopDecZx
10 cycles for LoopDec
9 cycles for LoopWhile
10 cycles for LoopJmpAl
9 cycles for LoopJmp
7 cycles for LoopDecAl
6 cycles for LoopDecZx
7 cycles for LoopDec
8 cycles for LoopWhile
8 cycles for LoopJmpAl
8 cycles for LoopJmp
3 cycles for LoopDecAl
4 cycles for LoopDecZx
4 cycles for LoopDec
5 cycles for LoopWhile
4 cycles for LoopJmpAl
4 cycles for LoopJmp
Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
GA Jochen
i took a different approach for the test - lol
i also changed to REALTIME to help get more consistent times on my prescott
for these brief tests - it shouldn't hurt anything
you are a lot faster than i am :bg
IncAddSub.exe
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23 cycles for LoopJmpAlInc
21 cycles for LoopJmpAlAdd
20 cycles for LoopJmpAlSub
21 cycles for LoopJmpZxInc
24 cycles for LoopJmpZxAdd
19 cycles for LoopJmpZxSub
20 cycles for LoopJmpAlInc
16 cycles for LoopJmpAlAdd
21 cycles for LoopJmpAlSub
14 cycles for LoopJmpZxInc
13 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
15 cycles for LoopJmpAlInc
12 cycles for LoopJmpAlAdd
14 cycles for LoopJmpAlSub
13 cycles for LoopJmpZxInc
12 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
9 cycles for LoopJmpAlInc
10 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
8 cycles for LoopJmpZxInc
11 cycles for LoopJmpZxAdd
11 cycles for LoopJmpZxSub
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub <- edit - lol
prescott...
LoopDecWhile.exe (v2)
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
23 cycles for LoopDecAl
20 cycles for LoopDecZx
23 cycles for LoopDec
21 cycles for LoopWhile
26 cycles for LoopJmpAl
23 cycles for LoopJmp
18 cycles for LoopDecAl
14 cycles for LoopDecZx
41 cycles for LoopDec
57 cycles for LoopWhile
20 cycles for LoopJmpAl
24 cycles for LoopJmp
16 cycles for LoopDecAl
11 cycles for LoopDecZx
14 cycles for LoopDec
15 cycles for LoopWhile
17 cycles for LoopJmpAl
13 cycles for LoopJmp
11 cycles for LoopDecAl
11 cycles for LoopDecZx
10 cycles for LoopDec
11 cycles for LoopWhile
10 cycles for LoopJmpAl
9 cycles for LoopJmp
Here are the times for Daves version on my quad. Really ain't much in it. :)
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
9 cycles for LoopJmpAlInc
9 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
9 cycles for LoopJmpZxInc
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
7 cycles for LoopJmpAlInc
7 cycles for LoopJmpAlAdd
7 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
7 cycles for LoopJmpZxAdd
7 cycles for LoopJmpZxSub
5 cycles for LoopJmpAlInc
5 cycles for LoopJmpAlAdd
5 cycles for LoopJmpAlSub
5 cycles for LoopJmpZxInc
5 cycles for LoopJmpZxAdd
5 cycles for LoopJmpZxSub
3 cycles for LoopJmpAlInc
3 cycles for LoopJmpAlAdd
3 cycles for LoopJmpAlSub
3 cycles for LoopJmpZxInc
3 cycles for LoopJmpZxAdd
3 cycles for LoopJmpZxSub
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
--- ok ---
Hmmm....
dave's version yields virtually identical timings for all algos. The only thing that gains a cycle is to eliminate the first jump (my initial choice).
Intel(R) Celeron(R) M CPU
13 cycles for LoopJmpAlInc
13 cycles for LoopJmpAlAdd
13 cycles for LoopJmpAlSub
13 cycles for LoopJmpZxInc
12 cycles for LoopJmpZxIncJJ
13 cycles for LoopJmpZxAdd
13 cycles for LoopJmpZxSub
9 cycles for LoopJmpAlInc
9 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
9 cycles for LoopJmpZxInc
8 cycles for LoopJmpZxIncJJ
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
7 cycles for LoopJmpAlInc
7 cycles for LoopJmpAlAdd
7 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
6 cycles for LoopJmpZxIncJJ
7 cycles for LoopJmpZxAdd
7 cycles for LoopJmpZxSub
4 cycles for LoopJmpAlInc
4 cycles for LoopJmpAlAdd
4 cycles for LoopJmpAlSub
4 cycles for LoopJmpZxInc
3 cycles for LoopJmpZxIncJJ
4 cycles for LoopJmpZxAdd
4 cycles for LoopJmpZxSub
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
20 LoopJmpZxIncJJ
23 LoopJmpZxAdd
23 LoopJmpZxSub
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
yes - i should have used that for all of them - saves a byte - lol
we have to stop giving Hutch opportunities to show off his quad :P
that ShowCPU proc is Jochen's
if you call it with 0/1, you get terse/verbose display
not sure how he feels about PowerBASIC - lol
Quote from: hutch-- on December 27, 2009, 12:01:54 PM
Unrelated, who owns the SSE detect code. I would like to be able to port it to PowerBASIC.
I derived it from various sources, most of all Wikipedia. For the history, search the forum for ShowCPU.
:bg
So I can get away with blaming it on you. :bdg
Here are the results on my PIV.
Genuine Intel(R) CPU 3.80GHz (SSE3)
23 cycles for LoopJmpAlInc
23 cycles for LoopJmpAlAdd
23 cycles for LoopJmpAlSub
23 cycles for LoopJmpZxInc
23 cycles for LoopJmpZxAdd
23 cycles for LoopJmpZxSub
17 cycles for LoopJmpAlInc
23 cycles for LoopJmpAlAdd
21 cycles for LoopJmpAlSub
15 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
15 cycles for LoopJmpAlInc
15 cycles for LoopJmpAlAdd
14 cycles for LoopJmpAlSub
15 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
11 cycles for LoopJmpAlInc
11 cycles for LoopJmpAlAdd
11 cycles for LoopJmpAlSub
11 cycles for LoopJmpZxInc
11 cycles for LoopJmpZxAdd
6 cycles for LoopJmpZxSub
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
--- ok ---
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
I don't know if it's because of all the pre-installed sh*t that I haven't gotten around
to removing yet, Win 7 or something else.
It just seems to me that my timings should be higher, given the processor.
AMD Athlon(tm) II X2 215 Processor (SSE3)
17 cycles for LoopDecAl
30 cycles for LoopDecZx
30 cycles for LoopDec
52 cycles for LoopWhile
48 cycles for LoopJmpAl
52 cycles for LoopJmp
20 cycles for LoopDecAl
55 cycles for LoopDecZx
27 cycles for LoopDec
52 cycles for LoopWhile
48 cycles for LoopJmpAl
55 cycles for LoopJmp
33 cycles for LoopDecAl
15 cycles for LoopDecZx
22 cycles for LoopDec
20 cycles for LoopWhile
20 cycles for LoopJmpAl
13 cycles for LoopJmp
15 cycles for LoopDecAl
16 cycles for LoopDecZx
21 cycles for LoopDec
11 cycles for LoopWhile
11 cycles for LoopJmpAl
12 cycles for LoopJmp
Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
i sometimes need to add a few lines of code to get consistent timings...
.
.
start:
push 1
call ShowCpu
invoke GetCurrentProcess
invoke SetProcessAffinityMask,eax,1
ct = 0
.
.
that restricts execution to a single core
also, if the tests are brief, i change to REALTIME_PRIORITY_CLASS, rather than HIGH_PRIORITY_CLASS
that appears in each of the "counter_begin" macro calls
EDIT - i should also mention that these tests are in no way intended to benchmark your machine
they are intended to give you relative timings to compare one algorithm with another
Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
rags,
my condolencies :thumbu
Just checked with my latest toy, an Olidata JumPC that cost me the horrendous sum of 99€ (142US$), Win XP included. It does the test in 3 cycles, claiming it has a Celeron M installed. Well...
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
43 cycles for LoopDecAl
45 cycles for LoopDec
47 cycles for LoopWhile
43 cycles for LoopJmpAl
50 cycles for LoopJmp
31 cycles for LoopDecAl
41 cycles for LoopDec
34 cycles for LoopWhile
30 cycles for LoopJmpAl
29 cycles for LoopJmp
26 cycles for LoopDecAl
32 cycles for LoopDec
29 cycles for LoopWhile
24 cycles for LoopJmpAl
33 cycles for LoopJmp
17 cycles for LoopDecAl
20 cycles for LoopDec
19 cycles for LoopWhile
17 cycles for LoopJmpAl
19 cycles for LoopJmp
Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
P3:
☺☺☻♥ (SSE1)
15 cycles for LoopDecAl
15 cycles for LoopDecZx
16 cycles for LoopDec
21 cycles for LoopWhile
21 cycles for LoopJmpAl
21 cycles for LoopJmp
10 cycles for LoopDecAl
10 cycles for LoopDecZx
11 cycles for LoopDec
14 cycles for LoopWhile
14 cycles for LoopJmpAl
14 cycles for LoopJmp
10 cycles for LoopDecAl
9 cycles for LoopDecZx
11 cycles for LoopDec
11 cycles for LoopWhile
11 cycles for LoopJmpAl
11 cycles for LoopJmp
5 cycles for LoopDecAl
5 cycles for LoopDecZx
5 cycles for LoopDec
6 cycles for LoopWhile
6 cycles for LoopJmpAl
6 cycles for LoopJmp
☺☺☻♥ (SSE1)
21 cycles for LoopJmpAlInc
17 cycles for LoopJmpAlAdd
17 cycles for LoopJmpAlSub
16 cycles for LoopJmpZxInc
14 cycles for LoopJmpZxAdd
14 cycles for LoopJmpZxSub
14 cycles for LoopJmpAlInc
12 cycles for LoopJmpAlAdd
12 cycles for LoopJmpAlSub
11 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
11 cycles for LoopJmpAlInc
12 cycles for LoopJmpAlAdd
10 cycles for LoopJmpAlSub
8 cycles for LoopJmpZxInc
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
6 cycles for LoopJmpAlInc
5 cycles for LoopJmpAlAdd
5 cycles for LoopJmpAlSub
4 cycles for LoopJmpZxInc
5 cycles for LoopJmpZxAdd
5 cycles for LoopJmpZxSub
As far as I can see, this is the winning algo, for Core & Celeron & P3 & P4:
SkipLeadingWhiteSpace proc ; pSrc$:DWORD
mov edx, [esp+4] ; get source string
dec edx
@@: inc edx
movzx eax, byte ptr [edx]
cmp al, "0"
je @B
cmp al, 0
je @F
cmp al, 32
jle @B
@@:
ret 4 ; edx points to first non-"0" and non-white space char
SkipLeadingWhiteSpace endp
EDIT: Added check for zero byte - thanks Sinsi :U
a close race, eh Jochen ?
Quote from: dedndave on December 27, 2009, 10:33:05 PM
a close race, eh Jochen ?
Very close indeed. On the other hand, it is that kind of loop that is typically run not even once, so why waste a single cycle on it? What remains useful from this exercise might be that a
dec ptr/inc ptr combination is slightly more efficient than the jmp generated by
.While - which confirmed my aversion against .While loops. In contrast, .Repeat ... .Until can't be beaten by a hand-coded loop.
It doesn't seem to work too well for a string like " 0" though...no check for a terminating 00.
Quote from: sinsi on December 28, 2009, 12:26:56 AM
It doesn't seem to work too well for a string like " 0" though...no check for a terminating 00.
Assuming a string ends with a nullbyte, it would stop right there. By design :bg
nah - it kinda keeps going, Jochen - lol
but that isn't a requirement for the tests - the tests showed us what we wanted to know
Quote from: dedndave on December 28, 2009, 10:01:05 AM
nah - it kinda keeps going, Jochen - lol
So Sinsi was right :red
Corrected above. Surprisingly enough, it still runs the test in three cycles for
db "This is a string", 0
that doesn't sound right - lol
figure at LEAST one clock cycle per byte :bg
EDIT - oh - lol
there are no bytes striped in that example :bg
i might be inclined to make the tests in this order:
SkipLeadingWhiteSpace proc ;pSrc$:DWORD
mov edx,[esp+4] ;get source string
dec edx
@@: inc edx
movzx eax,byte ptr [edx]
or al,al
jz @F
cmp al, 32
jle @B
cmp al, "0"
je @B
@@: ret 4 ;edx points to first non-"0" and non-white space char
SkipLeadingWhiteSpace endp
test for the most likely first, when practical
we have to test for null first so that it is culled out before the white space test
this order assumes white space is more likely than "0" - that may not be the case
if leading "0" is more likely, test for that before null (like you have it)
Quote from: dedndave on December 28, 2009, 10:52:14 AM
if leading "0" is more likely, test for that before null (like you have it)
Excerpt from Windows.inc. The assumption is you used Instr for "equ", and added 4, so the string starts with "0":
ENM_SCROLLEVENTS equ 00000008h
ENM_DRAGDROPDONE equ 00000010h
ENM_PARAGRAPHEXPANDED equ 00000020h
ENM_PAGECHANGE equ 00000040h
ENM_LANGCHANGE equ 01000000h
ENM_OBJECTPOSITIONS equ 02000000h
ENM_LINK equ 04000000h
ENM_LOWFIRTF equ 08000000h
ES_NOOLEDRAGDROP equ 00000008h
First program:
AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3 cycles for LoopDecAl
5 cycles for LoopDec
14 cycles for LoopWhile
4 cycles for LoopJmpAl
16 cycles for LoopJmp
20 cycles for LoopDecAl
9 cycles for LoopDec
10 cycles for LoopWhile
10 cycles for LoopJmpAl
-2 cycles for LoopJmp
8 cycles for LoopDecAl
-3 cycles for LoopDec
18 cycles for LoopWhile
18 cycles for LoopJmpAl
8 cycles for LoopJmp
-5 cycles for LoopDecAl
16 cycles for LoopDec
-7 cycles for LoopWhile
4 cycles for LoopJmpAl
25 cycles for LoopJmp
Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
Second:
AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ (SSE3)
3 cycles for LoopDecAl
2 cycles for LoopDecZx
2 cycles for LoopDec
15 cycles for LoopWhile
14 cycles for LoopJmpAl
24 cycles for LoopJmp
19 cycles for LoopDecAl
24 cycles for LoopDecZx
-1 cycles for LoopDec
-1 cycles for LoopWhile
10 cycles for LoopJmpAl
10 cycles for LoopJmp
-2 cycles for LoopDecAl
6 cycles for LoopDecZx
18 cycles for LoopDec
30 cycles for LoopWhile
8 cycles for LoopJmpAl
8 cycles for LoopJmp
-4 cycles for LoopDecAl
6 cycles for LoopDecZx
-5 cycles for LoopDec
14 cycles for LoopWhile
14 cycles for LoopJmpAl
15 cycles for LoopJmp
Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
"Here are the times for Daves version on my quad."
Hutch, I received the same times: :wink
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
9 cycles for LoopJmpAlInc
9 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
9 cycles for LoopJmpZxInc
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
7 cycles for LoopJmpLingo
7 cycles for LoopJmpAlInc
7 cycles for LoopJmpAlAdd
7 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
7 cycles for LoopJmpZxAdd
7 cycles for LoopJmpZxSub
5 cycles for LoopJmpLingo
5 cycles for LoopJmpAlInc
5 cycles for LoopJmpAlAdd
5 cycles for LoopJmpAlSub
5 cycles for LoopJmpZxInc
5 cycles for LoopJmpZxAdd
5 cycles for LoopJmpZxSub
3 cycles for LoopJmpLingo
3 cycles for LoopJmpAlInc
3 cycles for LoopJmpAlAdd
3 cycles for LoopJmpAlSub
3 cycles for LoopJmpZxInc
3 cycles for LoopJmpZxAdd
3 cycles for LoopJmpZxSub
2 cycles for LoopJmpLingo
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
43 LoopJmpLingo
--- ok ---
nice machine, Lingo
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
22 cycles for LoopJmpAlInc
21 cycles for LoopJmpAlAdd
21 cycles for LoopJmpAlSub
20 cycles for LoopJmpZxInc
20 cycles for LoopJmpZxAdd
27 cycles for LoopJmpZxSub
21 cycles for LoopJmpLingo
16 cycles for LoopJmpAlInc
20 cycles for LoopJmpAlAdd
17 cycles for LoopJmpAlSub
14 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
16 cycles for LoopJmpLingo
13 cycles for LoopJmpAlInc
15 cycles for LoopJmpAlAdd
15 cycles for LoopJmpAlSub
12 cycles for LoopJmpZxInc
14 cycles for LoopJmpZxAdd
13 cycles for LoopJmpZxSub
11 cycles for LoopJmpLingo
7 cycles for LoopJmpAlInc
8 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
8 cycles for LoopJmpZxAdd
8 cycles for LoopJmpZxSub
6 cycles for LoopJmpLingo
nice to finally get some repeatable numbers from my prescott :P
Compliments, Lingo! I have added a shortened version of your algo, 22 instead of 34 bytes, with almost identical timings.
EDIT: Since finding leading white space is not a frequent task, "inlining" instead of calling a proc might be more appropriate. So I added two inline versions. It turns out that the align 4 version is an edge faster.
EDIT(2): Two more inline versions added.
mov eax, offset Src
mov ecx, "00"
dec eax
align 4 ; align may change flags in Masm
@@: inc eax
mov cl, ch
sub cl, [eax] ; for [eax]==48, cl=0
je @B
cmp cl, 16 ; for [eax]==32, cl=48-16=16
jge @B
Quotealign 16
LoopJmpLingo_proc: ; the original Lingo algo
LoopLingo:
add eax, 1
mov cl, ch
add cl, [eax]
je LoopLingo
add cl, 10h
jle LoopLingo
jmp edx
align 16
LoopJmpLingo proc
pop edx
mov ecx, 0D0D0h
pop eax
add cl, [eax] ; for [eax]==48, cl=208+48=256 aka zero
je LoopLingo
add cl, 10h ; for [eax]==32, cl=208+32+16=256 aka zero
jle LoopLingo
jmp edx
LoopJmpLingo endp
LoopJmpLingo_endp:
align 16
LoopJmpLingoJ_proc:
LoopJmpLingoJ proc ; variant to Lingo's code (http://www.masm32.com/board/index.php?topic=12984.msg104011#msg104011)
pop edx ; the return address
mov ecx, 0D0D0h
pop eax
dec eax
@@: inc eax
mov cl, ch
add cl, [eax] ; for [eax]==48, cl=208+48=256 aka zero
je @B
add cl, 10h ; for [eax]==32, cl=208+32+16=256 aka zero
jle @B
jmp edx
LoopJmpLingoJ endp
LoopJmpLingoJ_endp:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
8 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
6 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ
7 cycles for inline loop, add, align 16
6 cycles for inline loop, sub, align 4
6 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ
3 cycles for inline loop, add, align 16
2 cycles for inline loop, sub, align 4
3 cycles for LoopJmpLingo
4 cycles for LoopJmpLingoJ
1 cycles for inline loop, add, align 16
1 cycles for inline loop, sub, align 4
2 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ
Sizes:
23 inline, add
19 inline, sub
34 LoopJmpLingo
22 LoopJmpLingoJ
not so good on a prescott, Jochen
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
15 cycles for inline loop, add, align 16
14 cycles for inline loop, sub, align 4
20 cycles for LoopJmpLingo
44 cycles for LoopJmpLingoJ
13 cycles for inline loop, add, align 16
11 cycles for inline loop, sub, align 4
18 cycles for LoopJmpLingo
27 cycles for LoopJmpLingoJ
9 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
11 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ
5 cycles for inline loop, add, align 16
4 cycles for inline loop, sub, align 4
9 cycles for LoopJmpLingo
10 cycles for LoopJmpLingoJ
Dave,
Can you try the new version, please? The inline algos seem to perform well, and the "two byte immediates" variant is pretty short, too. If the mov eax, offset src happens to be two bytes later, the size is only 12 bytes.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
7 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
5 cycles for inline loop, cmp two byte regs, align 4
5 cycles for inline loop, cmp two immediates, align 4
7 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ
7 cycles for inline loop, add, align 16
6 cycles for inline loop, sub, align 4
4 cycles for inline loop, cmp two byte regs, align 4
5 cycles for inline loop, cmp two immediates, align 4
6 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ
3 cycles for inline loop, add, align 16
2 cycles for inline loop, sub, align 4
2 cycles for inline loop, cmp two byte regs, align 4
1 cycles for inline loop, cmp two immediates, align 4
3 cycles for LoopJmpLingo
4 cycles for LoopJmpLingoJ
1 cycles for inline loop, add, align 16
1 cycles for inline loop, sub, align 4
1 cycles for inline loop, cmp two byte regs, align 4
1 cycles for inline loop, cmp two immediates, align 4
2 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ
Sizes:
23 inline, add
19 inline, sub
20 inline, cmp, two byte regs
14 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
prescott
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
16 cycles for inline loop, add, align 16
14 cycles for inline loop, sub, align 4
17 cycles for inline loop, cmp two byte regs, align 4
19 cycles for inline loop, cmp two immediates, align 4
19 cycles for LoopJmpLingo
41 cycles for LoopJmpLingoJ
13 cycles for inline loop, add, align 16
12 cycles for inline loop, sub, align 4
10 cycles for inline loop, cmp two byte regs, align 4
14 cycles for inline loop, cmp two immediates, align 4
18 cycles for LoopJmpLingo
31 cycles for LoopJmpLingoJ
8 cycles for inline loop, add, align 16
8 cycles for inline loop, sub, align 4
9 cycles for inline loop, cmp two byte regs, align 4
7 cycles for inline loop, cmp two immediates, align 4
10 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ
5 cycles for inline loop, add, align 16
5 cycles for inline loop, sub, align 4
5 cycles for inline loop, cmp two byte regs, align 4
4 cycles for inline loop, cmp two immediates, align 4
7 cycles for LoopJmpLingo
11 cycles for LoopJmpLingoJ
i don't think alignment is that critical for smaller loops, Jochen - at least not on a P4
prescott - i have removed all align's
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
14 cycles for inline loop, add, no align
14 cycles for inline loop, sub, no align
11 cycles for inline loop, cmp two byte regs, no align
17 cycles for inline loop, cmp two immediates, no align
19 cycles for LoopJmpLingo
23 cycles for LoopJmpLingoJ
13 cycles for inline loop, add, no align
12 cycles for inline loop, sub, no align
11 cycles for inline loop, cmp two byte regs, no align
12 cycles for inline loop, cmp two immediates, no align
17 cycles for LoopJmpLingo
27 cycles for LoopJmpLingoJ
8 cycles for inline loop, add, no align
7 cycles for inline loop, sub, no align
9 cycles for inline loop, cmp two byte regs, no align
8 cycles for inline loop, cmp two immediates, no align
9 cycles for LoopJmpLingo
14 cycles for LoopJmpLingoJ
3 cycles for inline loop, add, no align
4 cycles for inline loop, sub, no align
4 cycles for inline loop, cmp two byte regs, no align
3 cycles for inline loop, cmp two immediates, no align
6 cycles for LoopJmpLingo
10 cycles for LoopJmpLingoJ
Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
Thanks, Dave. The timings look a little bit inconsistent, also on my machine, but it seems we can safely vote for the shortest version ;-)
mov eax, offset Src
dec eax
@@: inc eax
cmp byte ptr [eax], 48 ; skip "0"...
je @B
cmp byte ptr [eax], 32 ; and anything from space downwards
jle @B
EDIT: And I forgot the end of string case...!
; mov eax, offset Src
dec eax ; no align before the loop - it's slower
@@: inc eax
cmp byte ptr [eax], 48 ; "0"
je @B
cmp byte ptr [eax], 0 ; zero delimiter?
je @F
cmp byte ptr [eax], 32 ; " " or less
jle @B
@@:
17 bytes starting with dec eax, 1 cycle for the default case (no 0 or space at string start).
not so fast - lol
it would be nice to see what difference, if any, alignment has on some other processors
from my testing on a P4, if it is a short jump to get back to the top of the loop, then alignment does little good
that may not be so for some of the more modern cores :U
EDIT - maybe we need a new thread to get some tests run in the purely "alignment/timing" catagory
Quote from: rags on December 27, 2009, 01:25:22 PM
I got a new box for a Christmas gift(hp Pavillion) with Win 7 and an Athlon II x2 250.
My timings seem horrible, using Jochen's original algo.
I got the same story on my new Q7 - I suspect some of the new processors are designed for data streaming and not computation. So we now have shitty computers but great televisions.!
Congratz to us... grumble grumble... sigh two years til wife will let me replace.
What cpu is that Q7 ? (atom-based?)
Anyway, never trust laptop cpus. Still I'm sure the general performance is good, don't let some specific timings discourage you guys.
Sorry my bad. Meant Intel Core i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.
WryBugz
these timing tests are not intended to benchmark your machine
comparing clock cycles on one machine to clock cycles on another machine is a little like comparing apples to oranges
from what i know, the i7 is a good performer - evidenced by the fact that you are happy with the overall performance
the information that is meaningful is the performance ratio of one method to another on any given machine
Quote from: WryBugz on February 23, 2010, 11:48:57 AM
Sorry my bad. Meant Intel Core i7 @v2.67 ghz. I score badly on all the test I have run here. But it does load an run movies slick, though that was not exactly what I wanted.
Could you post the timings with the exe posted at this thread ? I'm curious about something.
http://www.masm32.com/board/index.php?topic=13385.0
Intel Core I7
10 cycles for LoopDecAl
10 cycles for LoopDec
15 cycles for LoopWhile
17 cycles for LoopJmpAl
17 cycles for LoopJmp
6 cycles for LoopDecAl
6 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
12 cycles for LoopJmp
3 cycles for LoopDecAl
3 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
6 cycles for LoopJmp
Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
There you go.
I guess I don't understand the processor differences enough Dave. I thought that was the reason for the machine citation.
I am just a hobbyist and while having put in a lot of time, my knowledge is pretty erratic. For instance, I just realized this past week that eax - edx are not all equal.
The other one....
Loop
991 clock cycles
1054 clock cycles
1045 clock cycles
Dec ECX
505 clock cycles
505 clock cycles
513 clock cycles
Press any key to continue ...
From the link.
Here is mine...
Intel(R) Pentium(R) 4 CPU 2.40GHz (SSE2)
6 cycles for inline loop, add, no align
4 cycles for inline loop, sub, no align
7 cycles for inline loop, cmp two byte regs, no align
3 cycles for inline loop, cmp two immediates, no align
5 cycles for LoopJmpLingo
9 cycles for LoopJmpLingoJ
4 cycles for inline loop, add, no align
8 cycles for inline loop, sub, no align
2 cycles for inline loop, cmp two byte regs, no align
2 cycles for inline loop, cmp two immediates, no align
5 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ
-2 cycles for inline loop, add, no align
-2 cycles for inline loop, sub, no align
6 cycles for inline loop, cmp two byte regs, no align
-4 cycles for inline loop, cmp two immediates, no align
0 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ
-7 cycles for inline loop, add, no align
5 cycles for inline loop, sub, no align
-2 cycles for inline loop, cmp two byte regs, no align
-7 cycles for inline loop, cmp two immediates, no align
-2 cycles for LoopJmpLingo
4 cycles for LoopJmpLingoJ
Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---
Quote from: WryBugz on February 23, 2010, 10:12:44 PM
The other one....
Loop
991 clock cycles
1054 clock cycles
1045 clock cycles
Dec ECX
505 clock cycles
505 clock cycles
513 clock cycles
Press any key to continue ...
From the link.
Why does it take twice as many cycles as mine ? Not very efficient.
...more data...
Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
7 cycles for LoopDecAl
7 cycles for LoopDecZx
9 cycles for LoopDec
9 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp
5 cycles for LoopDecAl
5 cycles for LoopDecZx
5 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
7 cycles for LoopJmp
3 cycles for LoopDecAl
3 cycles for LoopDecZx
3 cycles for LoopDec
5 cycles for LoopWhile
5 cycles for LoopJmpAl
5 cycles for LoopJmp
1 cycles for LoopDecAl
1 cycles for LoopDecZx
1 cycles for LoopDec
3 cycles for LoopWhile
3 cycles for LoopJmpAl
3 cycles for LoopJmp
Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
9 cycles for LoopJmpAlInc
9 cycles for LoopJmpAlAdd
9 cycles for LoopJmpAlSub
9 cycles for LoopJmpZxInc
9 cycles for LoopJmpZxAdd
9 cycles for LoopJmpZxSub
7 cycles for LoopJmpLingo
7 cycles for LoopJmpAlInc
7 cycles for LoopJmpAlAdd
7 cycles for LoopJmpAlSub
7 cycles for LoopJmpZxInc
7 cycles for LoopJmpZxAdd
7 cycles for LoopJmpZxSub
5 cycles for LoopJmpLingo
5 cycles for LoopJmpAlInc
5 cycles for LoopJmpAlAdd
5 cycles for LoopJmpAlSub
5 cycles for LoopJmpZxInc
5 cycles for LoopJmpZxAdd
5 cycles for LoopJmpZxSub
3 cycles for LoopJmpLingo
3 cycles for LoopJmpAlInc
3 cycles for LoopJmpAlAdd
3 cycles for LoopJmpAlSub
3 cycles for LoopJmpZxInc
3 cycles for LoopJmpZxAdd
3 cycles for LoopJmpZxSub
2 cycles for LoopJmpLingo
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
43 LoopJmpLingo
--- ok ---
Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz (SSE4)
6 cycles for inline loop, add, no align
9 cycles for inline loop, sub, no align
3 cycles for inline loop, cmp two byte regs, no align
2 cycles for inline loop, cmp two immediates, no align
8 cycles for LoopJmpLingo
11 cycles for LoopJmpLingoJ
23 cycles for inline loop, add, no align
22 cycles for inline loop, sub, no align
1 cycles for inline loop, cmp two byte regs, no align
1 cycles for inline loop, cmp two immediates, no align
6 cycles for LoopJmpLingo
8 cycles for LoopJmpLingoJ
20 cycles for inline loop, add, no align
10 cycles for inline loop, sub, no align
0 cycles for inline loop, cmp two byte regs, no align
0 cycles for inline loop, cmp two immediates, no align
4 cycles for LoopJmpLingo
5 cycles for LoopJmpLingoJ
0 cycles for inline loop, add, no align
0 cycles for inline loop, sub, no align
0 cycles for inline loop, cmp two byte regs, no align
0 cycles for inline loop, cmp two immediates, no align
1 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ
Sizes:
18 inline, add
18 inline, sub
20 inline, cmp, two byte regs
12 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---
Loop
1294 clock cycles
1294 clock cycles
1294 clock cycles
Dec ECX
279 clock cycles
279 clock cycles
279 clock cycles
Press any key to continue ...
Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz (SSE4)
8 cycles for LoopDecAl
10 cycles for LoopDec
13 cycles for LoopWhile
13 cycles for LoopJmpAl
13 cycles for LoopJmp
6 cycles for LoopDecAl
6 cycles for LoopDec
9 cycles for LoopWhile
9 cycles for LoopJmpAl
9 cycles for LoopJmp
4 cycles for LoopDecAl
4 cycles for LoopDec
7 cycles for LoopWhile
7 cycles for LoopJmpAl
7 cycles for LoopJmp
2 cycles for LoopDecAl
2 cycles for LoopDec
4 cycles for LoopWhile
4 cycles for LoopJmpAl
4 cycles for LoopJmp
Sizes:
19 LoopDecAl
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
-------------
3 cycles for inline loop, add, align 16
6 cycles for inline loop, sub, align 4
3 cycles for inline loop, cmp two byte regs, align 4
3 cycles for inline loop, cmp two immediates, align 4
6 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ
3 cycles for inline loop, add, align 16
2 cycles for inline loop, sub, align 4
1 cycles for inline loop, cmp two byte regs, align 4
1 cycles for inline loop, cmp two immediates, align 4
5 cycles for LoopJmpLingo
7 cycles for LoopJmpLingoJ
0 cycles for inline loop, add, align 16
0 cycles for inline loop, sub, align 4
0 cycles for inline loop, cmp two byte regs, align 4
0 cycles for inline loop, cmp two immediates, align 4
2 cycles for LoopJmpLingo
3 cycles for LoopJmpLingoJ
0 cycles for inline loop, add, align 16
0 cycles for inline loop, sub, align 4
0 cycles for inline loop, cmp two byte regs, align 4
0 cycles for inline loop, cmp two immediates, align 4
2 cycles for LoopJmpLingo
2 cycles for LoopJmpLingoJ
Sizes:
23 inline, add
19 inline, sub
20 inline, cmp, two byte regs
14 inline, cmp, two byte immediates
34 LoopJmpLingo
22 LoopJmpLingoJ
--- ok ---
Hi,
here are my results ...
LoopDecWhile.exe (2nd version)
AMD Phenom(tm) II X4 955 Processor (SSE3)
16 cycles for LoopDecAl
16 cycles for LoopDecZx
9 cycles for LoopDec
15 cycles for LoopWhile
14 cycles for LoopJmpAl
15 cycles for LoopJmp
8 cycles for LoopDecAl
17 cycles for LoopDecZx
8 cycles for LoopDec
26 cycles for LoopWhile
29 cycles for LoopJmpAl
26 cycles for LoopJmp
7 cycles for LoopDecAl
7 cycles for LoopDecZx
7 cycles for LoopDec
6 cycles for LoopWhile
6 cycles for LoopJmpAl
6 cycles for LoopJmp
4 cycles for LoopDecAl
4 cycles for LoopDecZx
6 cycles for LoopDec
2 cycles for LoopWhile
2 cycles for LoopJmpAl
2 cycles for LoopJmp
Sizes:
19 LoopDecAl
26 LoopDecZx
19 LoopDec
20 LoopWhile
20 LoopJmpAl
20 LoopJmp
--- ok ---
... and for IncAddSub.exe
AMD Phenom(tm) II X4 955 Processor (SSE3)
14 cycles for LoopJmpAlInc
15 cycles for LoopJmpAlAdd
15 cycles for LoopJmpAlSub
15 cycles for LoopJmpZxInc
15 cycles for LoopJmpZxAdd
15 cycles for LoopJmpZxSub
27 cycles for LoopJmpAlInc
29 cycles for LoopJmpAlAdd
27 cycles for LoopJmpAlSub
29 cycles for LoopJmpZxInc
27 cycles for LoopJmpZxAdd
27 cycles for LoopJmpZxSub
6 cycles for LoopJmpAlInc
6 cycles for LoopJmpAlAdd
6 cycles for LoopJmpAlSub
6 cycles for LoopJmpZxInc
6 cycles for LoopJmpZxAdd
6 cycles for LoopJmpZxSub
2 cycles for LoopJmpAlInc
2 cycles for LoopJmpAlAdd
2 cycles for LoopJmpAlSub
2 cycles for LoopJmpZxInc
2 cycles for LoopJmpZxAdd
2 cycles for LoopJmpZxSub
Sizes:
20 LoopJmpAlInc
22 LoopJmpAlAdd
22 LoopJmpAlSub
21 LoopJmpZxInc
23 LoopJmpZxAdd
23 LoopJmpZxSub
--- ok ---
Regards
Greenhorn