News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Test Or Compare?

Started by Neil, May 04, 2009, 01:39:05 PM

Previous topic - Next topic

mitchi

Quote from: dedndave on May 04, 2009, 03:18:06 PM
for 0, i always used OR (or TEST) reg,reg
OR EAX,EAX - is this not faster than immediate ?
Neil is using 9, but inquiring minds want to know


On my Intel E8500 45nm 2x3.16ghz :

196   cycles for 400*test reg, reg  ----
129   cycles for 400*or reg, reg  ----
196   cycles for 400* cmp reg, 0

UtillMasm


jj2007

Quote from: dedndave on May 04, 2009, 03:18:06 PM
for 0, i always used OR (or TEST) reg,reg
OR EAX,EAX - is this not faster than immediate ?
Neil is using 9, but inquiring minds want to know


On a Celeron M, no - see my edit above. Interesting that test eax, eax is not the same as test ecx, ecx - sizewise:

A9 FFFFFFFF                  test eax, FFFFFFFF
F7C1 FFFFFFFF                test ecx, FFFFFFFF


Speedwise they are identical on my Celeron M (test reg, reg is slower...)

dedndave

oops JJ - lol

.
.
.

Interesting that test eax, eax is not the same as test ecx, ecx

then you tested TEST eax,immed and TEST ecx,immed - not TEST reg,reg

btw - reg,immed has special forms for many instructions if reg is eax
also special is XCHG eax,reg, as opposed to XCHG ecx,reg
the assemblers will code XCHG ECX,EAX as XCHG EAX,ECX to save a byte

Mark Jones

Dave, I was going to blindly say, "the TEST instruction is considerably faster than CMP on some modern hardware," but I decided to test this, and with good results: they appear identical on this hardware as far as I can tell. This is by no means a comprehensive or complete analysis, but take a peek at the attachment.

Quote from: AMD Athlon x64 4000+ dual-core, XP Pro SP3 x32
Reference null tests:
Null:               0 clocks.
10x NOP:            3 clocks.

Failure-mode CMP tests:
10x CMP REG,REG:    3 clocks.
10x CMP REG,IMMED:  4 clocks.
10x CMP MEM,REG:    4 clocks.
10x CMP MEM,IMMED:  6 clocks.

Success-mode CMP tests:
10x CMP REG,REG:    3 clocks.
10x CMP REG,IMMED:  4 clocks.
10x CMP MEM,REG:    4 clocks.
10x CMP MEM,IMMED:  6 clocks.

Failure-mode TEST tests:
10x TEST REG,REG:   3 clocks.
10x TEST REG,IMMED: 4 clocks.
10x TEST MEM,REG:   4 clocks.
10x TEST MEM,IMMED: 6 clocks.

Success-mode TEST tests:
10x TEST REG,REG:   3 clocks.
10x TEST REG,IMMED: 4 clocks.
10x TEST MEM,REG:   4 clocks.
10x TEST MEM,IMMED: 6 clocks.

Edit: small code typo. (I'm notorious for my bugs.) :bg

[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

dedndave

see Mark, this is what is driving me nuts - lol

on this run, i ran it with output redirected into a text file...

Reference null tests:
Null:               20 clocks.
10x NOP:            25 clocks.

Failure-mode CMP tests:
10x CMP REG,REG:    7 clocks.
10x CMP REG,IMMED:  -210 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Success-mode CMP tests:
10x CMP REG,REG:    1 clocks.
10x CMP REG,IMMED:  554189126 clocks.
10x CMP MEM,REG:    -202 clocks.
10x CMP MEM,IMMED:  -202 clocks.

Failure-mode TEST tests:
10x TEST REG,REG:   1356305252 clocks.
10x TEST REG,IMMED: 554189125 clocks.
10x TEST MEM,REG:   6 clocks.
10x TEST MEM,IMMED: 54 clocks.

Success-mode TEST tests:
10x TEST REG,REG:   18 clocks.
10x TEST REG,IMMED: 15 clocks.
10x TEST MEM,REG:   -330 clocks.
10x TEST MEM,IMMED: 47 clocks.

Press any key to exit...


on this run, i ran it at the console...

Reference null tests:
Null:               286330943 clocks.
10x NOP:            1227133468 clocks.

Failure-mode CMP tests:
10x CMP REG,REG:    -858993677 clocks.
10x CMP REG,IMMED:  -203clocks.
10x CMP MEM,REG:    -2004318275 clocks.
10x CMP MEM,IMMED:  -210 clocks.

Success-mode CMP tests:
10x CMP REG,REG:    954436967 clocks.
10x CMP REG,IMMED:  8 clocks.
10x CMP MEM,REG:    8 clocks.
10x CMP MEM,IMMED:  8 clocks.

Failure-mode TEST tests:
10x TEST REG,REG:   1908874143 clocks.
10x TEST REG,IMMED: 7 clocks.
10x TEST MEM,REG:   7 clocks.
10x TEST MEM,IMMED: 15 clocks.

Success-mode TEST tests:
10x TEST REG,REG:   -1840700479 clocks.
10x TEST REG,IMMED: 7 clocks.
10x TEST MEM,REG:   8 clocks.
10x TEST MEM,IMMED: 15 clocks.

Press any key to exit...


just imagine what i can do with clocks that are (+) - lol

i am using a prescott at 3 ghz

jj2007

Quote from: Mark Jones on May 04, 2009, 05:14:55 PM
Dave, I was going to blindly say, "the TEST instruction is considerably faster than CMP on some modern hardware," but I decided to test this
...
Edit: small code typo. (I'm notorious for my bugs.) :bg

Interesting. Celeron M:
Reference null tests:
Null:               0 clocks.
10x NOP:            12 clocks

Failure-mode CMP tests:
10x CMP REG,REG:    0 clocks.
10x CMP REG,IMMED:  12 clocks
10x CMP MEM,REG:    24 clocks
10x CMP MEM,IMMED:  24 clocks

Success-mode CMP tests:
10x CMP REG,REG:    0 clocks.
10x CMP REG,IMMED:  12 clocks
10x CMP MEM,REG:    12 clocks
10x CMP MEM,IMMED:  24 clocks

Failure-mode TEST tests:
10x TEST REG,REG:   12 clocks
10x TEST REG,IMMED: 12 clocks
10x TEST MEM,REG:   24 clocks
10x TEST MEM,IMMED: 12 clocks

Success-mode TEST tests:
10x TEST REG,REG:   12 clocks
10x TEST REG,IMMED: 0 clocks.
10x TEST MEM,REG:   24 clocks
10x TEST MEM,IMMED: 24 clocks

dedndave

obviously, it is running the timer from both cores

Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that

Mark Jones

Wow, that is interesting indeed. :dazzled: :bg

I believe the timer may indeed be running from more than one core. Strange I would get such repeatable results. Are your results repeatable JJ?

Curious Dave, try changing "HIGH_PRIORITY_CLASS" to "REALTIME_PRIORITY_CLASS" and see if that makes any difference.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

mitchi

From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?

dedndave

i can't test it Mitchi - lol
but, good to know because that is what i have always used to test for 0

repeat...
Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that

jj2007

Quote from: Mark Jones on May 04, 2009, 05:34:38 PM
Are your results repeatable JJ?

Yes, more or less. Always multiples of 12.

Quote from: dedndave on May 04, 2009, 05:38:00 PM
that means that the advantage of having 2 cores (or more) is negated by the test

Not really. It is quite unlikely that you can convince the OS to run your fast inner loop in both cores simultaneously. In practice, your code runs on core 1, while Microsoft Word runs more or less independently on core 2. Now if you have three loop variants in "your" core 1, it is still very meaningful to compare their relative performance.

But there are guys here who know much more about parallel code execution - try the Search box :bg

jj2007

Quote from: mitchi on May 04, 2009, 05:36:56 PM
From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?

Not on a Celeron M:
197     cycles for 400*test eax, -1
197     cycles for 400*test ecx, -1
385     cycles for 400*or ecx, ecx
262     cycles for 400*test reg, reg

ecube

Quote from: jj2007 on May 04, 2009, 06:28:01 PM
Quote from: mitchi on May 04, 2009, 05:36:56 PM
From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?

Not on a Celeron M:
197     cycles for 400*test eax, -1
197     cycles for 400*test ecx, -1
385     cycles for 400*or ecx, ecx
262     cycles for 400*test reg, reg


Funny how your cpu always has different overall results than everyone else... ::)

mitchi

It's a good thing to have various results. Now I see why the Visual C++ compiler uses the test instruction whenever it needs to test for zero. It's because it has a good speed on all processors.