News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Is Negative

Started by herge, May 20, 2008, 02:04:45 PM

Previous topic - Next topic

Tedd

They both should return 1 ..?

abs(1) = 1
abs(-1) = 1
No snowflake in an avalanche feels responsible.

MichaelW

I found the problem. I copied the code as it was posted in reply #19, without analyzing it, even after it failed the function tests. I should have assumed that there must have been an error in transit, and fixed it.

abs0 MACRO
add eax, 0
js @F
neg eax
@@:


I have corrected the problem and posted new results and a new attachment.
eschew obfuscation

Jimg

how about-
    abs10 Macro
        mov edx,eax
        neg edx
        cmovns eax,edx
    endm

abs0    0 1 2147483647 1 -2147483648  size=16
abs1    0 1 2147483647 1 -2147483648  size=9
abs2    0 1 2147483647 1 -2147483648  size=6
abs3t   0 1 2147483647 1 -2147483648  size=11
abs4    0 1 2147483647 1 -2147483648  size=5
abs5    0 1 2147483647 1 -2147483648  size=9
abs6    0 1 2147483647 1 -2147483648  size=9
abs7    0 1 2147483647 1 -2147483648  size=7
abs8    0 1 2147483647 1 -2147483648  size=6
abs9    0 1 2147483647 1 -2147483648  size=8
abs10   0 1 2147483647 1 -2147483648  size=7
crtabs  0 1 2147483647 1 -2147483648  size=10

74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
36 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)

74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)

74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
39 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)

74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
36 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
35 cycles, abs8 (NightWare1)
39 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)

74 cycles, abs0 (herge)
38 cycles, abs1 (evlncrn8)
35 cycles, abs2 (jimg)
40 cycles, abs3 (Rockoon)
36 cycles, abs4 (drizz)
38 cycles, abs5 (jj2007)
42 cycles, abs6 (hutch1)
34 cycles, abs7 (hutch2)
36 cycles, abs8 (NightWare1)
38 cycles, abs9 (NightWare2)
31 cycles, abs10 (jimg2)
190 cycles, crtabs (crt_abs)

Press any key to exit...


Michael-  The results seem very consistant, probably no need for 5 loops now.  Also, I used a new macro.

[attachment deleted by admin]

hutch--

I am not sure why I still get such a wide variation, it may just be the vaguries of a PIV with short tested code of this type. Where Michael's PIII and Jims AMD both seem to produce reasonably reliable timings, mine ar all over the place.


80 cycles, abs0 (herge)
44 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
28 cycles, abs7 (hutch2)
48 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs

120 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
372 cycles, crt_abs

76 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
32 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
372 cycles, crt_abs

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
368 cycles, crt_abs

Press any key to exit...


Jim's version with the extra macro.


80 cycles, abs0 (herge)
32 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
20 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
48 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
16 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
48 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
428 cycles, crtabs (crt_abs)

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
52 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
424 cycles, crtabs (crt_abs)

80 cycles, abs0 (herge)
48 cycles, abs1 (evlncrn8)
44 cycles, abs2 (jimg)
36 cycles, abs3 (Rockoon)
44 cycles, abs4 (drizz)
32 cycles, abs5 (jj2007)
48 cycles, abs6 (hutch1)
48 cycles, abs7 (hutch2)
44 cycles, abs8 (NightWare1)
56 cycles, abs9 (NightWare2)
40 cycles, abs10 (jimg2)
432 cycles, crtabs (crt_abs)

Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Back when I still had a P4 I observed this problem and could find no way around it. The cycle counts being a multiple of 4, coupled with some Intel documents I have seen (and cannot find ATM), would seem to suggest that the TSC is updated in step with the external clock. I think this alone should account for an uncertainty in the counts of 8 or more cycles. Perhaps a reasonable solution for the P4 might be to use the second set of macros, and for each test, average the counts over 8-16 macro calls.
eschew obfuscation

hutch--

I coded up another benchmark with a style that I know runs on a PIV OK and got these results. This does 2 passes, one with "1" as the test number, the second with "-1".


-------------
positive pass
-------------
562 abs0 herge
579 abs1 evlncrn8
578 abs2 jimg 1
563 abs3 rockoon 1
563 abs4 rockoon 2
547 abs5 drizz
579 abs6 jj2007
562 abs7 hutch 1
578 abs8 hutch 2
578 abs9 Nightware 1
562 abs10 Nightware 2
547 abs11 jimg 2
-------------
negative pass
-------------
562 abs0 herge
563 abs1 evlncrn8
563 abs2 jimg 1
547 abs3 rockoon 1
563 abs4 rockoon 2
547 abs5 drizz
562 abs6 jj2007
547 abs7 hutch 1
547 abs8 hutch 2
578 abs9 Nightware 1
563 abs10 Nightware 2
563 abs11 jimg 2
Press any key to continue ...

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

herge


Hi All:

My results with hutch-
I have high numbers because I had a
game running.


-------------
positive pass
-------------
6870 abs0 herge
6890 abs1 evlncrn8
7922 abs2 jimg 1
11677 abs3 rockoon 1
6399 abs4 rockoon 2
6009 abs5 drizz
7441 abs6 jj2007
5898 abs7 hutch 1
7651 abs8 hutch 2
6659 abs9 Nightware 1
7671 abs10 Nightware 2
5157 abs11 jimg 2
-------------
negative pass
-------------
5548 abs0 herge
7591 abs1 evlncrn8
10986 abs2 jimg 1
9233 abs3 rockoon 1
5007 abs4 rockoon 2
6139 abs5 drizz
5799 abs6 jj2007
5828 abs7 hutch 1
6029 abs8 hutch 2
5117 abs9 Nightware 1
5978 abs10 Nightware 2
4607 abs11 jimg 2
Press any key to continue ...



Cheers.
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

sinsi

G'day herge

This line made me laugh
QuoteI have high numbers because I had a
game running.

Me too (plus media player 11), but here are my numbers

-------------
positive pass
-------------
500 abs0 herge
562 abs1 evlncrn8
500 abs2 jimg 1
375 abs3 rockoon 1
375 abs4 rockoon 2
438 abs5 drizz
563 abs6 jj2007
375 abs7 hutch 1
500 abs8 hutch 2
500 abs9 Nightware 1
563 abs10 Nightware 2
391 abs11 jimg 2
-------------
negative pass
-------------
437 abs0 herge
437 abs1 evlncrn8
375 abs2 jimg 1
375 abs3 rockoon 1
375 abs4 rockoon 2
438 abs5 drizz
437 abs6 jj2007
375 abs7 hutch 1
375 abs8 hutch 2
375 abs9 Nightware 1
438 abs10 Nightware 2
375 abs11 jimg 2


Do we count a quad-core q6600 as a p4?
I've noticed that quite a few of these timing posts seem to rely a lot on the processor speed - I'm getting a bit disheartened when a 3GHz CPU can beat my quad  :bdg
Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

The attachment is a test piece that hopefully will minimize the variations for a P4. It does essentially what I described in reply #34.



[attachment deleted by admin]
eschew obfuscation

jj2007

Quote from: MichaelW on May 24, 2008, 08:33:30 AM
test piece that hopefully will minimize the variations for a P4.

It yields pretty stable results. Now we might ask what to choose as a Masm32 library candidate...
4 thoughts:
- function form, so that you can call it as mov MyMemLocation, Abs(esi)
- should not change any other registers (some of the candidates violate this condition)
- size should matter (5-9 bytes)
- speed should matter
Any other views?

MichaelW

Considering that the code ideally would execute in only a few clock cycles, I think it should be a macro of a form that would effectively provide an ABS opcode. As a basis for this I considered the four macros that were fastest on a P3, abs2, abs4, abs7, and abs8 in my last test, all at 45 cycles per 20 executions, and this count included a mov reg,immed that was not part of the macros. After eliminating those not suitable for memory operands and those that would affect more than one operand, I ended up with abs7, the second macro that hutch posted. The attachment is a test app, and these are the results on my P3:

0       1       2147483647      1       2147483648
0       1       2147483647      1       2147483648
0       1       32767   1       32768
0       1       32767   1       32768
0       1       127     1       128
0       1       127     1       128

45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8

45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8

45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8

45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8

45 cycles, abs eax
89 cycles, abs m32
157 cycles, abs ax
173 cycles, abs m16
42 cycles, abs al
79 cycles, abs m8



[attachment deleted by admin]
eschew obfuscation

rags

Here is my results on a P-IV 2.5ghz:

0       1       2147483647      1       2147483648
0       1       2147483647      1       2147483648
0       1       32767   1       32768
0       1       32767   1       32768
0       1       127     1       128
0       1       127     1       128

40 cycles, abs eax
81 cycles, abs m32
43 cycles, abs ax
81 cycles, abs m16
53 cycles, abs al
85 cycles, abs m8

44 cycles, abs eax
85 cycles, abs m32
40 cycles, abs ax
85 cycles, abs m16
51 cycles, abs al
83 cycles, abs m8

42 cycles, abs eax
83 cycles, abs m32
41 cycles, abs ax
85 cycles, abs m16
50 cycles, abs al
83 cycles, abs m8

47 cycles, abs eax
86 cycles, abs m32
42 cycles, abs ax
82 cycles, abs m16
52 cycles, abs al
83 cycles, abs m8

51 cycles, abs eax
85 cycles, abs m32
44 cycles, abs ax
85 cycles, abs m16
50 cycles, abs al
84 cycles, abs m8
God made Man, but the monkey applied the glue -DEVO

lingo

For me the fastest is next code from  A.Fog's book:

"22.4. Avoiding conditional jumps by using flags (all processors)

The most important jumps to eliminate are conditional jumps, especially if they are poorly predictable.
Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags.
For example you may calculate the absolute value of a signed number without branching:
         CDQ
         XOR EAX,EDX
         SUB EAX,EDX

(On PPlain and PMMX, use MOV EDX,EAX / SAR EDX,31 instead of CDQ).
The carry flag is particularly useful for this kind of tricks:"


from "How to optimize for the Pentium family of microprocessors
Copyright © 1996, 2000 by Agner Fog"

hutch--

Lingo,

I agree with the view but even with that code that Drizz posted, the branchless versions were no faster than the conditional jump versions. The CMOVxx versions were no faster either.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: lingo on May 25, 2008, 02:35:30 PM
For me the fastest is ...
         CDQ
         XOR EAX,EDX
         SUB EAX,EDX

Hi Lingo,
Your code is somewhat incomplete. Check your timings with this one:

   MOV eax, -123h
   push edx
   CDQ
   XOR EAX,EDX
   SUB EAX,EDX
   pop edx