News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fastest Absolute Function

Started by Twister, August 23, 2010, 09:31:34 PM

Previous topic - Next topic

Rockoon

Quote from: daydreamer on August 25, 2010, 10:26:29 AM
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit

Well I am quite certain that for valid float numbers (both normal and denormal!) that just clearing the high bit works. Both 32-bit and 64-bit IEEE encoding does not use two's complement.. it just uses the sign bit as a flag.

I honestly dont know if the high bit is set in the special error condition values (Not A Number, etc..) tho .. if its always set, then problems could arise.. if its only set when sign means something (+Infinity, -Infinity, ...) then no problems should ever arise that werent already problems.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

lingo

"The thieves are always liars"

"Lingo's code is a bit bloated but for once it does not throw exceptions"

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg

4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


--- ok ---



Rockoon


:dance:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

6       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-2      cycles for mov arg, AbsJJ(arg), neg
-3      cycles for SetAbsJJ arg, neg

6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-1      cycles for SetAbsJJ arg, pos

6       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-2      cycles for mov arg, AbsJJ(arg), neg
-3      cycles for SetAbsJJ arg, neg


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: Rockoon on August 26, 2010, 03:53:45 PM

:dance:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

Minus three cycles is pretty fast, where can I buy that CPU?

Seriously: Does it change if you use SetProcessAffinityMask? If not, we'd have to look at MichaelW's calibration loop.

MichaelW

Processors have gotten so fast now that occasionally they skip back in time :bg

Assuming that everything is running on a single core, results like this indicate that something interfered with the reference loop, making the overhead count larger than the test count. This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.

eschew obfuscation

Rockoon

Quote from: jj2007 on August 26, 2010, 04:11:22 PM
Quote from: Rockoon on August 26, 2010, 03:53:45 PM

:dance:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

Minus three cycles is pretty fast, where can I buy that CPU?

Seriously: Does it change if you use SetProcessAffinityMask? If not, we'd have to look at MichaelW's calibration loop.

No change. The results are consistent across many runs, with affinity and without.

The timing code is using CPUID as the out-of-order serializing fence, surrounding RDTSC with two such fences, but thats not actually good enough!

In more direct terms, these are the possibilities:

(A) All of the preceding instructions are completed before CPUID begins.
(B) None of the following instructions are begun before CPUID ends.
(C) Both (A) and (B)

For CPUID to be a foolproof fence that bars all pairings with it (rather than just an out-of-order fence), it must obey (C), but that is not part of the specification. The specification demands (A) or (B) but that does not preclude the code being timed to pair up with the CPUID instruction itself on one end or the other, which is what appears to be happening on my AMD.

Basically, CPUID is an out-of-order fence, but is not by specification a pairing fence.

Agner Fog uses the same CPUID serialization technique for his timing code, but he repeats the code being timed a hundred times (by default) between start and end RDSTC's, so pairing effects with CPUID would be at most 1% of the total count. There, he is more concerned with code throughput than he is code latency... with latency being deduced from throughput, secondary execution unit knowledge, and reasoning about the dependency chains.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: MichaelW on August 26, 2010, 05:07:15 PM
This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.

You can try to change the SleepMs variable, but a delay of 2,000 ms delivers exactly the same results as a delay of 50ms:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
4       cycles for AbsLingo, pos
3       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

Rockoon

Quote from: MichaelW on August 26, 2010, 05:07:15 PM
Processors have gotten so fast now that occasionally they skip back in time :bg

Assuming that everything is running on a single core, results like this indicate that something interfered with the reference loop, making the overhead count larger than the test count. This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.



Unfortunately this explanation isnt valid based on the timing macros structure. The baseline is calibrated repeatedly as each counter_begin "call" re-evaluates the baseline, changing the codes REPEAT to 10 yields the same "irrational" results on all repetitions.


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

dedndave

i always wanted a computer that was done BEFORE i hit enter - lol
select a single core - REALTIME_PRIORITY_CLASS - keep the test bursts brief

jj2007

Quote from: lingo on August 26, 2010, 03:26:53 PM
"The thieves are always liars"

"Lingo's code is a bit bloated but for once it does not throw exceptions"

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: :lol

While I am happy to see that you liked my code, I am very sorry that you don't get the "correct" results. Maybe because you forgot to insert some of your magic bytes before progstart?

But you are right that the testbed should be neutral, so I added the a16 macro that aligns the code and places 16 rets before progstart. The results are now exactly the same for both identical progs (3 cycles), but I am afraid your original algo is just utterly, painfully slow:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8       cycles for AbsLingo, pos
3       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos


But it does not raise exceptions :U

jj2007


lingo

"The thieves are always liars"[/U]

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: 
ntel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg

4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


--- ok ---

Rockoon

AMD Phenom(tm) II X6 1055T Processor (SSE3)
7       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-3      cycles for mov arg, AbsJJ(arg), pos
-2      cycles for SetAbsJJ arg, pos

7       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-3      cycles for mov arg, AbsJJ(arg), neg
-2      cycles for SetAbsJJ arg, neg

7       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-3      cycles for mov arg, AbsJJ(arg), pos
-2      cycles for SetAbsJJ arg, pos

7       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-3      cycles for mov arg, AbsJJ(arg), neg
-2      cycles for SetAbsJJ arg, neg


Nice and consistent.. still, this timing method just isnt practical. Hasnt been for years.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Rockoon

Unfortunately, the latest AMD CodeAnalyst no longer includes a Pipeline Simulation mode. With that mode I could have at least shown that the code was indeed pairing up with CPUID on my AMD.

Yay for downgrades?

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: lingo on August 26, 2010, 07:33:29 PM
I changed my code with JJ's code in his "test" program
and received different numbers for equal algos:

Lingo, you repeat yourself, and that problem has been dealt with in reply #54. Your attachment has old code.