Hi All:
push eax
and eax,80000000h ; is high bit on?
cmp eax,80000000h
pop eax
jnz @f
neg eax ; is Negative so flip it!
@@:
Is there any easy way to test if a signed Number
is negative. Or is they a jmp on a conditon
to do it?
test eax, 080000000h possibly?
or eax,eax
jns @f
neg eax
@@:
Hi Jimg:
It works for me!
Thank you.
For unsigned integers such as the well-known eax:
IsNeg EQU 80000000h ; eax>= means: eax is negative
IsNegW EQU 8000h ; GetKeystate needs a WORD
.if eax>=IsNeg
; negative
.endif
.if eax<IsNeg
; positive
.endif
What about
input in eax
mov ebx, eax
sar eax, 31
add ebx, eax
xor ebx, eax
output in ebx
no branching, but more operations.. so profile.
Hi Rockon:
Yes that works as well.
Thank you!
some more :)
Abs macro __rm:req
.repeat
neg __rm
.until !sign?
endm
AbsEAX macro
cdq
xor eax,edx
sub eax,edx
endm
Quote from: drizz on May 21, 2008, 06:10:49 PM
some more :)
AbsEAX macro
cdq
xor eax,edx
sub eax,edx
endm
Very humbling. This one is clearly superior to all of the others posted thus far. No benchmrking required. Even uses eax as much as possible for shorter instruction lengths.
Hi drizz:
The AbsEAX macro works.
But I don't need 64 bits yet.
Thanks.
:bg
Like it. Compliments drizz. :U
AbsEAX macro
cdq
xor eax,edx
sub eax,edx
endm
The attachment is a quick test of the code here, based on the assumption that the goal is to convert the number in eax to its absolute value. The cycle counts in the results below are for a total of 200 conversions, of alternating positive and negative values, running on a P3. Considering the call overhead, even the CRT function is surprisingly fast.
0 0 0 0 0 0
1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483648 2147483648 2147483648 2147483648 2147483648 2147483648
804 cycles
432 cycles
426 cycles
510 cycles
406 cycles
1605 cycles
[attachment deleted by admin]
Quote from: herge on May 21, 2008, 07:44:46 PM
Hi drizz:
The AbsEAX macro works.
But I don't need 64 bits yet.
Thanks.
It doesnt do 64-bit values.
It just uses the property of the cdq instruction of sign extending all the way through the edx register, creating an all 1's mask in edx if eax is negative, or an all 0's mask if eax is positive.
Negation in twos complement:
take the NOT of the value, and then add 1.
or
subtract 1 from the value, then take the NOT
The mask can be used for both NOTing and adding/subtracting 1, conditionally, based on the state of eax. (an all 1's mask is equivilent to the value '-1', and xoring with the all 1's mask is equivilent to a NOT)
My methodology is the same, 'cept I wasnt exploiting the CDQ instruction (instead I was making a copy of the input and then doing an arithmetic shift by 31 to create the mask) .. I think i've been spending too much time in HLL's
Quote from: MichaelW on May 22, 2008, 04:37:45 AM
The cycle counts in the results below are for a total of 200 conversions, of alternating positive and negative values, running on a P3.
Alternating signs isnt a very good test of anything usefull.
The best way to test things like this is to make a list of typical inputs, then shuffle them all, then perform your test on each item in the list in the now randomly ordered sequence. Anything else biases in favor of branching versions because branch predictors love patterns, patterns that arent typical in the real world usage of such a function.
This is what I got with Michaels test. I added the macro name for each so I knew what was what.
0 0 0 0 0 0
1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483648 2147483648 2147483648 2147483648 2147483648 2147483648
761 cycles abs0
364 cycles abs1
403 cycles abs2
395 cycles abs3
455 cycles abs4
3601 cycles crt_abs
Press any key to exit...
ow all wonder is if there would be a time difference with the TEST/Jxx code if the jump was taken or not.
Quote from: Rockoon on May 22, 2008, 05:00:27 AM
Alternating signs isnt a very good test of anything usefull.
The best way to test things like this is to make a list of typical inputs, then shuffle them all, then perform your test on each item in the list in the now randomly ordered sequence. Anything else biases in favor of branching versions because branch predictors love patterns, patterns that arent typical in the real world usage of such a function.
Good point. I changed the code so it now uses a random sequence of inputs, repeating the same sequence for each of the tests. On a P3 the cycle counts for the macros did not change significantly, and the ranking remained the same, but the random inputs caused the cycle counts for the CRT function to more than double.
800 cycles, abs0
450 cycles, abs1
444 cycles, abs2
516 cycles, abs3
401 cycles, abs4
3853 cycles, crt_abs
I have updated the attachment.
EDIT: Removed code that had absolutely nothing to do with the subject at hand :red
Sizewise the abs4 is also a clear winner. The high level macro (.if eax!>80000000h) scores not too bad, either - but I notice a certain volatility of timings.
1130 cycles, abs0
500 cycles, abs1 ; 9 bytes
523 cycles, abs2 ; 9 bytes
487 cycles, abs3 ; 9 bytes
407 cycles, abs4 ; 5 bytes
375 cycles, abs5 ; 9 bytes
207 cycles, abs6 = nop
3715 cycles, crt_abs ; 10 bytes
crt_abs means a lot of work ;-)
00401EB8 |. B8 00000000 mov eax, 0
00401EBD |. 50 push eax ; /x = 0
00401EBE |. FF15 20804000 call near dword ptr [<&msvc>; \labs
labs 8BFF mov edi, edi ; ntdll.7C910738
77C36BD2 55 push ebp
77C36BD3 8BEC mov ebp, esp
77C36BD5 8B45 08 mov eax, dword ptr [ebp+8]
77C36BD8 85C0 test eax, eax
77C36BDA 7D 02 jge short msvcrt.77C36BDE
77C36BDC F7D8 neg eax ; [color=Red]not taken[/color]
77C36BDE 5D pop ebp
77C36BDF C3 retn
00401EC4 |. 83C4 04 add esp, 4
QuoteThe high level macro (.if eax!>80000000h) scores not too bad, either - but I notice a certain volatility of timings.
How would .if eax!>80000000h be used, and what is in abs5? By volatility I assume you mean that the counts vary from run to run. There will always be some variation, and the longer the test the larger the absolute variation. Running on a P3, if I reduce the repeat count to 20, I get the following cycle counts for 4 consecutive runs:
74 74 73 74
43 42 42 43
42 42 42 42
47 47 46 47
38 38 38 38
291 291 291 290
I expect other processors will show more variation.
Hi All:
Mac Cycle
abs0 802
abs1 511
abs2 436
abs3 532
abs4 411
crt_ 4898
It appears abs4 beats all.
abs4 MACRO; AbsEAX
cdq
xor eax,edx
sub eax,edx
ENDM
Thanks again drizz.
If someone has the time, could they plug this into the benchmark, it sems to work OK. It has 5 instructions but no slow ones so it may perform OK.
mov eax, 100
mov ecx, eax
neg ecx
test eax, eax
cmovs eax, ecx
print str$(eax),13,10
mov eax, -100
mov ecx, eax
neg ecx
test eax, eax
cmovs eax, ecx
print str$(eax),13,10
this is the output.
100
100
Press any key to continue ...
LATER: I added the idea into Michael's latest test piece but its not fast enough at least on my old Northwood.
540 cycles, abs0 ; <<<< I used this macro for it.
503 cycles, abs1
537 cycles, abs2
393 cycles, abs3
489 cycles, abs4
4171 cycles, crt_abs
Press any key to exit...
This is the substitute macro.
abs0 MACRO
mov ecx, eax
neg ecx
test eax, eax
cmovs eax, ecx
ENDM
Here is the result on my 3.2 gig Prescott.
585 cycles, abs0
485 cycles, abs1
512 cycles, abs2
442 cycles, abs3
344 cycles, abs4
3825 cycles, crt_abs
Press any key to exit...
All of these algos are subject to hardware ariation it would seem.
LATER AGAIN: This seems to have a bit more legs but is not the fastest on either PIV.
abs0 MACRO
add eax, 0
js @F
neg eax
@@:
Old Northwood PIV.
442 cycles, abs0
535 cycles, abs1
475 cycles, abs2
393 cycles, abs3
444 cycles, abs4
4314 cycles, crt_abs
Press any key to exit...
and on the 3.2 gig Prescott,
433 cycles, abs0
519 cycles, abs1
504 cycles, abs2
444 cycles, abs3
344 cycles, abs4
3825 cycles, crt_abs
Press any key to exit...
Note: this test is totally invaild for Intel users-
Would someone with an AMD try this? I vowed to stay out of these cycle wars because AMD just doesn't time the same, but I couldn't resist.
I've run this 10 times with the same general results-
Code:
0 0 0 0 0 0 0
1 1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647 21474836
47
1 1 1 1 1 1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647 21474836
47
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2
147483648
764 cycles, abs0
400 cycles, abs1
402 cycles, abs2
400 cycles, abs3
385 cycles, abs4
347 cycles, abs5
2499 cycles, crt_abs
881 cycles, abs0
396 cycles, abs1
407 cycles, abs2
400 cycles, abs3
386 cycles, abs4
349 cycles, abs5
2482 cycles, crt_abs
852 cycles, abs0
394 cycles, abs1
404 cycles, abs2
404 cycles, abs3
382 cycles, abs4
349 cycles, abs5
2470 cycles, crt_abs
732 cycles, abs0
403 cycles, abs1
395 cycles, abs2
412 cycles, abs3
373 cycles, abs4
357 cycles, abs5
2486 cycles, crt_abs
733 cycles, abs0
398 cycles, abs1
392 cycles, abs2
404 cycles, abs3
376 cycles, abs4
345 cycles, abs5
2486 cycles, crt_abs
Press any key to exit...
What am I doing wrong here?
I corrected a problem with the random number generator not being re-seeded between the abs4 and abs5 tests, which was causing the abs5 test to use a different input sequence than the other tests. I also added Hutch's macros (note that the first one fails the 1 and -1 function tests), added the user names to the results, and reduced the repeat count to 20 to shorten the duration of each test loop (the idea being fewer opportunities for interruptions).
Typical results on my P3, Windows 2000 system:
74 cycles, abs0, herge
42 cycles, abs1, evlcrn8
54 cycles, abs2, jimg
47 cycles, abs3, rockoon
38 cycles, abs4, drizz
40 cycles, abs5, jj2007
60 cycles, abs6, hutch1
37 cycles, abs7, hutch2
291 cycles, crt_abs
74 cycles, abs0, herge
52 cycles, abs1, evlcrn8
42 cycles, abs2, jimg
47 cycles, abs3, rockoon
38 cycles, abs4, drizz
40 cycles, abs5, jj2007
60 cycles, abs6, hutch1
37 cycles, abs7, hutch2
291 cycles, crt_abs
74 cycles, abs0, herge
42 cycles, abs1, evlcrn8
42 cycles, abs2, jimg
47 cycles, abs3, rockoon
38 cycles, abs4, drizz
40 cycles, abs5, jj2007
60 cycles, abs6, hutch1
37 cycles, abs7, hutch2
291 cycles, crt_abs
86 cycles, abs0, herge
42 cycles, abs1, evlcrn8
42 cycles, abs2, jimg
47 cycles, abs3, rockoon
38 cycles, abs4, drizz
40 cycles, abs5, jj2007
60 cycles, abs6, hutch1
38 cycles, abs7, hutch2
297 cycles, crt_abs
74 cycles, abs0, herge
42 cycles, abs1, evlcrn8
43 cycles, abs2, jimg
47 cycles, abs3, rockoon
38 cycles, abs4, drizz
40 cycles, abs5, jj2007
60 cycles, abs6, hutch1
37 cycles, abs7, hutch2
291 cycles, crt_abs
Another possibility, if you were willing to accept some risk of buggy code crashing Windows, would be REALTIME_PRIORITY_CLASS. This did not significantly improve the consistency of my results.
[attachment deleted by admin]
Quote from: MichaelW on May 22, 2008, 10:48:56 AM
QuoteThe high level macro (.if eax!>80000000h) scores not too bad, either - but I notice a certain volatility of timings.
How would .if eax!>80000000h be used, and what is in abs5? By volatility I assume you mean that the counts vary from run to run
I meant the simple
.if eax>=80000000h
neg eax
.endif
...which translates, if I remember well, to
cmp eax, 80000000h
jl @f
neg eax
@@:
Hi Michael-
When you set the one up with my name, you copied abs1 into it instead of the code I presented. Here's the same one with the correct code-
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 -1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647 21474836
47 -2147483647 2147483647
1 1 1 1 1 1 1 -1 1
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647 21474836
47 -2147483647 2147483647
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2
147483648 -2147483648 -2147483648
64 cycles, abs0, herge
37 cycles, abs1, evlcrn8
34 cycles, abs2, jimg
44 cycles, abs3, rockoon
33 cycles, abs4, drizz
34 cycles, abs5, jj2007
40 cycles, abs6, hutch1
31 cycles, abs7, hutch2
185 cycles, crt_abs
64 cycles, abs0, herge
37 cycles, abs1, evlcrn8
48 cycles, abs2, jimg
37 cycles, abs3, rockoon
33 cycles, abs4, drizz
34 cycles, abs5, jj2007
40 cycles, abs6, hutch1
31 cycles, abs7, hutch2
195 cycles, crt_abs
64 cycles, abs0, herge
37 cycles, abs1, evlcrn8
33 cycles, abs2, jimg
37 cycles, abs3, rockoon
33 cycles, abs4, drizz
33 cycles, abs5, jj2007
40 cycles, abs6, hutch1
31 cycles, abs7, hutch2
185 cycles, crt_abs
65 cycles, abs0, herge
51 cycles, abs1, evlcrn8
34 cycles, abs2, jimg
37 cycles, abs3, rockoon
33 cycles, abs4, drizz
33 cycles, abs5, jj2007
49 cycles, abs6, hutch1
31 cycles, abs7, hutch2
185 cycles, crt_abs
64 cycles, abs0, herge
37 cycles, abs1, evlcrn8
33 cycles, abs2, jimg
37 cycles, abs3, rockoon
33 cycles, abs4, drizz
34 cycles, abs5, jj2007
40 cycles, abs6, hutch1
31 cycles, abs7, hutch2
195 cycles, crt_abs
Press any key to exit...
hi,
2 more to test :
abs8 MACRO _Operand_:REQ
test _Operand_,_Operand_
jns @F
neg _Operand_
@@:
ENDM
abs9 MACRO _Operand_:REQ
bt _Operand_,31
jnc @F
neg _Operand_
@@:
ENDM
hutch, js @F (avoid neg if signed ?)
jimg, abs2 = abs5 in your test (but different result...)
Drat. Ok, I'll let Michael sort it out. I just knew the one with my name on it was the wrong one.
I don't know why but I am getting very large variations in the timings in the last test piece.
80 cycles, abs0, herge
51 cycles, abs1, evlcrn8
69 cycles, abs2, jimg
38 cycles, abs3, rockoon
32 cycles, abs4, drizz
12 cycles, abs5, jj2007
50 cycles, abs6, hutch1
29 cycles, abs7, hutch2
390 cycles, crt_abs
71 cycles, abs0, herge
71 cycles, abs1, evlcrn8
41 cycles, abs2, jimg
38 cycles, abs3, rockoon
32 cycles, abs4, drizz
41 cycles, abs5, jj2007
38 cycles, abs6, hutch1
29 cycles, abs7, hutch2
378 cycles, crt_abs
68 cycles, abs0, herge
20 cycles, abs1, evlcrn8
37 cycles, abs2, jimg
28 cycles, abs3, rockoon
32 cycles, abs4, drizz
13 cycles, abs5, jj2007
38 cycles, abs6, hutch1
14 cycles, abs7, hutch2
377 cycles, crt_abs
103 cycles, abs0, herge
45 cycles, abs1, evlcrn8
37 cycles, abs2, jimg
28 cycles, abs3, rockoon
32 cycles, abs4, drizz
13 cycles, abs5, jj2007
43 cycles, abs6, hutch1
14 cycles, abs7, hutch2
390 cycles, crt_abs
69 cycles, abs0, herge
45 cycles, abs1, evlcrn8
12 cycles, abs2, jimg
40 cycles, abs3, rockoon
32 cycles, abs4, drizz
13 cycles, abs5, jj2007
48 cycles, abs6, hutch1
33 cycles, abs7, hutch2
390 cycles, crt_abs
Press any key to exit...
Quote from: hutch-- on May 23, 2008, 06:52:55 AM
I don't know why but I am getting very large variations in the timings in the last test piece.
First, the good news: On Hutch's average, my code beats all the others. The bad news is that it isn't my code, since I actually proposed the rather ordinary
.if eax>=80000000h
neg eax
.endif
Never mind, I'll bear the false honour with dignity. But jokes apart: The variations are indeed very significant. I added the two by Nightware...
abs4 MACRO ; Drizz 5 bytes
cdq
xor eax,edx
sub eax,edx
ENDM
abs5 Macro ; jimg 6 bytes
or eax, eax
.if sign? ; jns @F
neg eax
.endif
endm
abs5jj Macro ; jj 9 bytes
.if eax>=80000000h
neg eax
.endif
endm
abs6 MACRO ; Hutch1 9 bytes
mov ecx, eax
neg ecx
test eax, eax
cmovs eax, ecx
ENDM
abs7 MACRO ; Hutch2 7 bytes
add eax, 0
js @F
neg eax
@@:
ENDM
abs8 Macro ; Nightware 6 bytes
test eax, eax
.if sign? ; jns @F
neg eax
.endif
endm
abs9 Macro ; Nightware 6 bytes
bt eax, 31
jnc @F
neg eax
@@:
endm
... and get these benchmarks:
71 cycles, abs0, herge
41 cycles, abs1, evlcrn8
41 cycles, abs2, jimg
41 cycles, abs3, rockoon
32 cycles, abs4, drizz
13 cycles, abs5
45 cycles, abs5, jj2007
39 cycles, abs6, hutch1
31 cycles, abs7, hutch2
13 cycles, abs8
51 cycles, abs9
390 cycles, crt_abs
69 cycles, abs0, herge
41 cycles, abs1, evlcrn8
41 cycles, abs2, jimg
29 cycles, abs3, rockoon
32 cycles, abs4, drizz
37 cycles, abs5
45 cycles, abs5, jj2007
38 cycles, abs6, hutch1
33 cycles, abs7, hutch2
37 cycles, abs8
37 cycles, abs9
380 cycles, crt_abs
69 cycles, abs0, herge
33 cycles, abs1, evlcrn8
13 cycles, abs2, jimg
38 cycles, abs3, rockoon
33 cycles, abs4, drizz
13 cycles, abs5
37 cycles, abs5, jj2007
38 cycles, abs6, hutch1
16 cycles, abs7, hutch2
41 cycles, abs8
47 cycles, abs9
378 cycles, crt_abs
75 cycles, abs0, herge
55 cycles, abs1, evlcrn8
41 cycles, abs2, jimg
28 cycles, abs3, rockoon
32 cycles, abs4, drizz
37 cycles, abs5
51 cycles, abs5, jj2007
39 cycles, abs6, hutch1
32 cycles, abs7, hutch2
38 cycles, abs8
37 cycles, abs9
380 cycles, crt_abs
71 cycles, abs0, herge
45 cycles, abs1, evlcrn8
13 cycles, abs2, jimg
38 cycles, abs3, rockoon
32 cycles, abs4, drizz
41 cycles, abs5
41 cycles, abs5, jj2007
38 cycles, abs6, hutch1
33 cycles, abs7, hutch2
51 cycles, abs8
47 cycles, abs9
384 cycles, crt_abs
jimg made it in only 13 cycles, but not for long ;-)
In the absence of a clear winner, let's go for a "diplomatic" solution: Take the product of size * speed... congrats, drizz :cheekygreen:
Because I doubt that this could be any more confusing than it is now, I reworked the test to use the second set of macros, the ones that start a new time slice at the start of the loops and capture the lowest cycle count that occurs in any loop. I also went back over the thread and attempted to get the names straight. Running on my P3 with the repeat count set to 20 the repeatability for the macro code is near perfect.
abs0 0 1 2147483647 1 2147
abs1 0 1 2147483647 1 2147
abs2 0 1 2147483647 1 2147
abs3t 0 1 2147483647 1 2147
abs4 0 1 2147483647 1 2147
abs5 0 1 2147483647 1 2147
abs6 0 1 2147483647 1 2147
abs7 0 1 2147483647 1 2147
abs8 0 1 2147483647 1 2147
abs9 0 1 2147483647 1 2147
crtabs 0 1 2147483647 1 2147
80 cycles, abs0 (herge)
47 cycles, abs1 (evlncrn8)
45 cycles, abs2 (jimg)
55 cycles, abs3 (Rockoon)
45 cycles, abs4 (drizz)
47 cycles, abs5 (jj2007)
70 cycles, abs6 (hutch1)
45 cycles, abs7 (hutch2)
45 cycles, abs8 (NightWare1)
48 cycles, abs9 (NightWare2)
228 cycles, crt_abs
80 cycles, abs0 (herge)
47 cycles, abs1 (evlncrn8)
45 cycles, abs2 (jimg)
55 cycles, abs3 (Rockoon)
45 cycles, abs4 (drizz)
47 cycles, abs5 (jj2007)
70 cycles, abs6 (hutch1)
45 cycles, abs7 (hutch2)
45 cycles, abs8 (NightWare1)
48 cycles, abs9 (NightWare2)
282 cycles, crt_abs
80 cycles, abs0 (herge)
47 cycles, abs1 (evlncrn8)
45 cycles, abs2 (jimg)
55 cycles, abs3 (Rockoon)
45 cycles, abs4 (drizz)
47 cycles, abs5 (jj2007)
70 cycles, abs6 (hutch1)
45 cycles, abs7 (hutch2)
45 cycles, abs8 (NightWare1)
48 cycles, abs9 (NightWare2)
282 cycles, crt_abs
80 cycles, abs0 (herge)
47 cycles, abs1 (evlncrn8)
45 cycles, abs2 (jimg)
55 cycles, abs3 (Rockoon)
45 cycles, abs4 (drizz)
47 cycles, abs5 (jj2007)
70 cycles, abs6 (hutch1)
45 cycles, abs7 (hutch2)
45 cycles, abs8 (NightWare1)
48 cycles, abs9 (NightWare2)
282 cycles, crt_abs
80 cycles, abs0 (herge)
47 cycles, abs1 (evlncrn8)
45 cycles, abs2 (jimg)
55 cycles, abs3 (Rockoon)
45 cycles, abs4 (drizz)
47 cycles, abs5 (jj2007)
70 cycles, abs6 (hutch1)
45 cycles, abs7 (hutch2)
45 cycles, abs8 (NightWare1)
48 cycles, abs9 (NightWare2)
282 cycles, crt_abs
On a P4 I expect the cycle counts will always be a multiple of 4.
[attachment deleted by admin]
What have I done wrong here ? This is the test piece I used which with 1 + -1 both returned 1.
mov eax, 1
add eax, 0
jns @F
neg eax
@@:
print str$(eax),13,10
mov eax, -1
add eax, 0
jns @F
neg eax
@@:
print str$(eax),13,10
Result
1
1
Press any key to continue ...
They both should return 1 ..?
abs(1) = 1
abs(-1) = 1
I found the problem. I copied the code as it was posted in reply #19, without analyzing it, even after it failed the function tests. I should have assumed that there must have been an error in transit, and fixed it.
abs0 MACRO
add eax, 0
js @F
neg eax
@@:
I have corrected the problem and posted new results and a new attachment.
how about-
abs10 Macro
mov edx,eax
neg edx
cmovns eax,edx
endm
abs0 0 1 2147483647 1 -2147483648 size=16
abs1 0 1 2147483647 1 -2147483648 size=9
abs2 0 1 2147483647 1 -2147483648 size=6
abs3t 0 1 2147483647 1 -2147483648 size=11
abs4 0 1 2147483647 1 -2147483648 size=5
abs5 0 1 2147483647 1 -2147483648 size=9
abs6 0 1 2147483647 1 -2147483648 size=9
abs7 0 1 2147483647 1 -2147483648 size=7
abs8 0 1 2147483647 1 -2147483648 size=6
abs9 0 1 2147483647 1 -2147483648 size=8
abs10 0 1 2147483647 1 -2147483648 size=7
crtabs 0 1 2147483647 1 -2147483648 size=10
74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
36 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)
74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)
74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
39 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)
74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
36 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
39 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)
74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
36 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)
Press any key to exit...
Michael- The results seem very consistant, probably no need for 5 loops now. Also, I used a new macro.
[attachment deleted by admin]
I am not sure why I still get such a wide variation, it may just be the vaguries of a PIV with short tested code of this type. Where Michael's PIII and Jims AMD both seem to produce reasonably reliable timings, mine ar all over the place.
80 cycles, abs0 (herge)
44 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
28 cycles, abs7 (hutch2)
48 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs
120 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
372 cycles, crt_abs
76 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
32 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
372 cycles, crt_abs
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs
Press any key to exit...
Jim's version with the extra macro.
80 cycles, abs0 (herge)
32 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
20 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
48 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
16 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
428 cycles, crtabs (crt_abs)
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
52 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
424 cycles, crtabs (crt_abs)
80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
32 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)
Press any key to exit...
Back when I still had a P4 I observed this problem and could find no way around it. The cycle counts being a multiple of 4, coupled with some Intel documents I have seen (and cannot find ATM), would seem to suggest that the TSC is updated in step with the external clock. I think this alone should account for an uncertainty in the counts of 8 or more cycles. Perhaps a reasonable solution for the P4 might be to use the second set of macros, and for each test, average the counts over 8-16 macro calls.
I coded up another benchmark with a style that I know runs on a PIV OK and got these results. This does 2 passes, one with "1" as the test number, the second with "-1".
-------------
positive pass
-------------
562 abs0 herge
579 abs1 evlncrn8
578 abs2 jimg 1
563 abs3 rockoon 1
563 abs4 rockoon 2
547 abs5 drizz
579 abs6 jj2007
562 abs7 hutch 1
578 abs8 hutch 2
578 abs9 Nightware 1
562 abs10 Nightware 2
547 abs11 jimg 2
-------------
negative pass
-------------
562 abs0 herge
563 abs1 evlncrn8
563 abs2 jimg 1
547 abs3 rockoon 1
563 abs4 rockoon 2
547 abs5 drizz
562 abs6 jj2007
547 abs7 hutch 1
547 abs8 hutch 2
578 abs9 Nightware 1
563 abs10 Nightware 2
563 abs11 jimg 2
Press any key to continue ...
[attachment deleted by admin]
Hi All:
My results with hutch-
I have high numbers because I had a
game running.
-------------
positive pass
-------------
6870 abs0 herge
6890 abs1 evlncrn8
7922 abs2 jimg 1
11677 abs3 rockoon 1
6399 abs4 rockoon 2
6009 abs5 drizz
7441 abs6 jj2007
5898 abs7 hutch 1
7651 abs8 hutch 2
6659 abs9 Nightware 1
7671 abs10 Nightware 2
5157 abs11 jimg 2
-------------
negative pass
-------------
5548 abs0 herge
7591 abs1 evlncrn8
10986 abs2 jimg 1
9233 abs3 rockoon 1
5007 abs4 rockoon 2
6139 abs5 drizz
5799 abs6 jj2007
5828 abs7 hutch 1
6029 abs8 hutch 2
5117 abs9 Nightware 1
5978 abs10 Nightware 2
4607 abs11 jimg 2
Press any key to continue ...
Cheers.
G'day herge
This line made me laugh
QuoteI have high numbers because I had a
game running.
Me too (plus media player 11), but here are my numbers
-------------
positive pass
-------------
500 abs0 herge
562 abs1 evlncrn8
500 abs2 jimg 1
375 abs3 rockoon 1
375 abs4 rockoon 2
438 abs5 drizz
563 abs6 jj2007
375 abs7 hutch 1
500 abs8 hutch 2
500 abs9 Nightware 1
563 abs10 Nightware 2
391 abs11 jimg 2
-------------
negative pass
-------------
437 abs0 herge
437 abs1 evlncrn8
375 abs2 jimg 1
375 abs3 rockoon 1
375 abs4 rockoon 2
438 abs5 drizz
437 abs6 jj2007
375 abs7 hutch 1
375 abs8 hutch 2
375 abs9 Nightware 1
438 abs10 Nightware 2
375 abs11 jimg 2
Do we count a quad-core q6600 as a p4?
I've noticed that quite a few of these timing posts seem to rely a lot on the processor speed - I'm getting a bit disheartened when a 3GHz CPU can beat my quad :bdg
The attachment is a test piece that hopefully will minimize the variations for a P4. It does essentially what I described in reply #34.
[attachment deleted by admin]
Quote from: MichaelW on May 24, 2008, 08:33:30 AM
test piece that hopefully will minimize the variations for a P4.
It yields pretty stable results. Now we might ask what to choose as a Masm32 library candidate...
4 thoughts:
- function form, so that you can call it as mov MyMemLocation, Abs(esi)
- should not change any other registers (some of the candidates violate this condition)
- size should matter (5-9 bytes)
- speed should matter
Any other views?
Considering that the code ideally would execute in only a few clock cycles, I think it should be a macro of a form that would effectively provide an ABS opcode. As a basis for this I considered the four macros that were fastest on a P3, abs2, abs4, abs7, and abs8 in my last test, all at 45 cycles per 20 executions, and this count included a mov reg,immed that was not part of the macros. After eliminating those not suitable for memory operands and those that would affect more than one operand, I ended up with abs7, the second macro that hutch posted. The attachment is a test app, and these are the results on my P3:
0 1 2147483647 1 2147483648
0 1 2147483647 1 2147483648
0 1 32767 1 32768
0 1 32767 1 32768
0 1 127 1 128
0 1 127 1 128
45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8
45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8
45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8
45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8
45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8
[attachment deleted by admin]
Here is my results on a P-IV 2.5ghz:
0 1 2147483647 1 2147483648
0 1 2147483647 1 2147483648
0 1 32767 1 32768
0 1 32767 1 32768
0 1 127 1 128
0 1 127 1 128
40 cycles, abs eax
81 cycles, abs m32
43 cycles, abs ax
81 cycles, abs m16
53 cycles, abs al
85 cycles, abs m8
44 cycles, abs eax
85 cycles, abs m32
40 cycles, abs ax
85 cycles, abs m16
51 cycles, abs al
83 cycles, abs m8
42 cycles, abs eax
83 cycles, abs m32
41 cycles, abs ax
85 cycles, abs m16
50 cycles, abs al
83 cycles, abs m8
47 cycles, abs eax
86 cycles, abs m32
42 cycles, abs ax
82 cycles, abs m16
52 cycles, abs al
83 cycles, abs m8
51 cycles, abs eax
85 cycles, abs m32
44 cycles, abs ax
85 cycles, abs m16
50 cycles, abs al
84 cycles, abs m8
For me the fastest is next code from A.Fog's book:
"22.4. Avoiding conditional jumps by using flags (all processors)
The most important jumps to eliminate are conditional jumps, especially if they are poorly predictable.
Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags.
For example you may calculate the absolute value of a signed number without branching:
CDQ
XOR EAX,EDX
SUB EAX,EDX
(On PPlain and PMMX, use MOV EDX,EAX / SAR EDX,31 instead of CDQ).
The carry flag is particularly useful for this kind of tricks:"
from "How to optimize for the Pentium family of microprocessors
Copyright © 1996, 2000 by Agner Fog"
Lingo,
I agree with the view but even with that code that Drizz posted, the branchless versions were no faster than the conditional jump versions. The CMOVxx versions were no faster either.
Quote from: lingo on May 25, 2008, 02:35:30 PM
For me the fastest is ...
CDQ
XOR EAX,EDX
SUB EAX,EDX
Hi Lingo,
Your code is somewhat incomplete. Check your timings with this one:
MOV eax, -123h
push edx CDQ
XOR EAX,EDX
SUB EAX,EDX
pop edx
My tests were using data that alternated randomly between positive and negative, on the assumption that this would make the jumps unpredictable, and if there was a difference it was too small to measure. Since conditional jumps, including unpredictable ones, are so common in code, it seems plausible that the processors would have been designed to run such code as efficiently as possible.
Hi All:
FOR mac,<abs0,abs1,abs2,abs3,abs4,abs5,abs6,abs7,abs8,abs9,abs10,abs11,crtabs>
I am having trouble finding where mac is defined.
Must admit I can't find it?
thanks in advance.
MAC is defined in the statement starting with,
FOR mac
See FOR Loops and Variable-Length Parameters, about half way down the page here (http://webster.cs.ucr.edu/Page_TechDocs/MASMDoc/ProgrammersGuide/Chap_09.htm). I keep a copy of this page on my desktop.
Hi MichaelW:
Thanks.
How about using TEST and jumping on the flags