I have a piece of code :-
cmp eax,9
ja @F
if I change it to :-
test eax,9
ja @F
I get totally unpredictable results, am I missing something here?
TEST sets the flags as if you had used AND, but does not modify the destination register
CMP sets the flags as if you had used SUB, but does not modify the destination register
TEST is logical, CMP is arithmetic
TEST eax,9
JA @F
it is important to note that logical instructions always clear the CARRY FLAG (CF)
that includes AND, OR, XOR, TEST (the NOT instruction alters no flags)
the JA instruction branches if the ZERO FLAG (ZF) is clear (i.e. not zero) and the CF is clear
normally, after a TEST instruction, a JZ or JNZ instruction is used
we know the CF is clear, so use a branch that does not look at it
thanks dedndave, my brain wasn't in gear :dazzled: it's the JA instruction that's causing the problem, I thought that test might be quicker than cmp (hope this doesn't start another war :P)
what's these mean?
cmp eax,0
jl far 75550000
test eax,eax
jl far 75550000
lol - those aren't really new wars - they have been going on since the forum existed, i think
i think TEST and CMP are the same, time-wise
choose the one that most aptly suits the application
is the CMP eax,9 an arithmetic test or a logical test?
if it is arithmetic, use CMP
this helps make code more legible
Just had a look at opcodes.chm in the help file & here's what it says:-
CMP reg,mem 486 2 clock cycles & 2 to 4 bytes size
TEST reg,rmem 486 1 clock cycle & 2 to 4 bytes sze
So it appears that TEST REG,MEM is twice as fast As CMP REG,MEM.
I have a feeling that someone will prove me wrong :toothy
lol - i dunno UtillMasm
there is no "far", that i know of
JL is a signed comparison
it branches if the OVERFLOW FLAG (OF) is not equal to the SIGN FLAG (SF)
TEST always clears the OF too, i think
so, after a TEST instruction, it is the same as JS - never see it used that way
after TEST eax,0, it would never branch
hmmmm - good to know, Neil
i am new to 32-bit code, so i have to re-memorize all the dang timings
they were so much simpler for the 8088
oops - you grabbed different forms of the instructions ?
reg,mem
reg,rmem
you are comparing a register to an immediate value
reg,immed
Quote from: Neil on May 04, 2009, 02:17:38 PM
Just had a look at opcodes.chm in the help file & here's what it says:-
CMP reg,mem 486 2 clock cycles & 2 to 4 bytes size
TEST reg,rmem 486 1 clock cycle & 2 to 4 bytes sze
So it appears that TEST REG,MEM is twice as fast As CMP REG,MEM.
I have a feeling that someone will prove me wrong :toothy
Nobody in this forum would do such a horrible thing :naughty:
But test yourself... :bg
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 1000000
.code
start:
REPEAT 3
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 10
cmp eax, 123
cmp ecx, 123
cmp edx, 123
cmp edi, 123
ENDM
counter_end
print str$(eax), 9, "cycles for 40*cmp", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 10
test eax, 123
test ecx, 123
test edx, 123
test edi, 123
ENDM
counter_end
print str$(eax), 9, "cycles for 40*test ----", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
cmp eax, 123
cmp ecx, 123
cmp edx, 123
cmp edi, 123
ENDM
counter_end
print str$(eax), 9, "cycles for 400*cmp", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
test eax, 123
test ecx, 123
test edx, 123
test edi, 123
ENDM
counter_end
print str$(eax), 9, "cycles for 400*test", 13, 10, 10
ENDM
inkey "--- ok ---"
exit
end start
EDIT: Results for a Celeron M:
21 cycles for 40*cmp
15 cycles for 40*test reg, 123
21 cycles for 40*test reg, reg
262 cycles for 400*cmp
195 cycles for 400*test reg, 123
262 cycles for 400*test reg, reg
21 cycles for 40*cmp
15 cycles for 40*test reg, 123
21 cycles for 40*test reg, reg
262 cycles for 400*cmp
195 cycles for 400*test reg, 123
262 cycles for 400*test reg, reg
21 cycles for 40*cmp
15 cycles for 40*test reg, 123
21 cycles for 40*test reg, reg
262 cycles for 400*cmp
195 cycles for 400*test reg, 123
262 cycles for 400*test reg, reg
Normally a test for zero is faster with the mnemonic TEST than CPM REG, 0 but from memory this varies with hardware. Most Intel processors are faster with TEST and Intel recommend using TEST for this purpose.
It's pretty much equally fast here on a recent Intel Processor (Intel E8500).
Thanks jj, all the tests have identical cycle counts, so I won't bother changing any of my code.
UtillMasm is giving me a headache, Hutch
I want him censored
for 0, i always used OR (or TEST) reg,reg
OR EAX,EAX - is this not faster than immediate ?
Neil is using 9, but inquiring minds want to know
Quote from: hutch-- on May 04, 2009, 02:31:26 PM
Normally a test for zero is faster with the mnemonic TEST than CPM REG, 0 but from memory this varies with hardware. Most Intel processors are faster with TEST and Intel recommend using TEST for this purpose.
For those that have fun debugging their code after seemingly
minor edits, note that:
TEST AX,AX
JZ @F
CMP AX,AX
JZ @F
have different results. Almost as much fun as mixing up signed
and unsigned conditional jumps.
Cheers,
Steve N.
Quote from: dedndave on May 04, 2009, 03:18:06 PM
for 0, i always used OR (or TEST) reg,reg
OR EAX,EAX - is this not faster than immediate ?
Neil is using 9, but inquiring minds want to know
On my Intel E8500 45nm 2x3.16ghz :
196 cycles for 400*test reg, reg ----
129 cycles for 400*or reg, reg ----
196 cycles for 400* cmp reg, 0
me too. :bg
Quote from: dedndave on May 04, 2009, 03:18:06 PM
for 0, i always used OR (or TEST) reg,reg
OR EAX,EAX - is this not faster than immediate ?
Neil is using 9, but inquiring minds want to know
On a Celeron M, no - see my edit above. Interesting that test e
ax, eax is not the same as test e
cx, ecx - sizewise:
A9 FFFFFFFF test eax, FFFFFFFF
F7C1 FFFFFFFF test ecx, FFFFFFFF
Speedwise they are identical on my Celeron M (test reg, reg is slower...)
oops JJ - lol
.
.
.
Interesting that test eax, eax is not the same as test ecx, ecx
then you tested TEST eax,immed and TEST ecx,immed - not TEST reg,reg
btw - reg,immed has special forms for many instructions if reg is eax
also special is XCHG eax,reg, as opposed to XCHG ecx,reg
the assemblers will code XCHG ECX,EAX as XCHG EAX,ECX to save a byte
Dave, I was going to blindly say, "the TEST instruction is considerably faster than CMP on some modern hardware," but I decided to test this, and with good results: they appear identical on this hardware as far as I can tell. This is by no means a comprehensive or complete analysis, but take a peek at the attachment.
Quote from: AMD Athlon x64 4000+ dual-core, XP Pro SP3 x32
Reference null tests:
Null: 0 clocks.
10x NOP: 3 clocks.
Failure-mode CMP tests:
10x CMP REG,REG: 3 clocks.
10x CMP REG,IMMED: 4 clocks.
10x CMP MEM,REG: 4 clocks.
10x CMP MEM,IMMED: 6 clocks.
Success-mode CMP tests:
10x CMP REG,REG: 3 clocks.
10x CMP REG,IMMED: 4 clocks.
10x CMP MEM,REG: 4 clocks.
10x CMP MEM,IMMED: 6 clocks.
Failure-mode TEST tests:
10x TEST REG,REG: 3 clocks.
10x TEST REG,IMMED: 4 clocks.
10x TEST MEM,REG: 4 clocks.
10x TEST MEM,IMMED: 6 clocks.
Success-mode TEST tests:
10x TEST REG,REG: 3 clocks.
10x TEST REG,IMMED: 4 clocks.
10x TEST MEM,REG: 4 clocks.
10x TEST MEM,IMMED: 6 clocks.
Edit: small code typo. (I'm notorious for my bugs.) :bg
[attachment deleted by admin]
see Mark, this is what is driving me nuts - lol
on this run, i ran it with output redirected into a text file...
Reference null tests:
Null: 20 clocks.
10x NOP: 25 clocks.
Failure-mode CMP tests:
10x CMP REG,REG: 7 clocks.
10x CMP REG,IMMED: -210 clocks.
10x CMP MEM,REG: -202 clocks.
10x CMP MEM,IMMED: -202 clocks.
Success-mode CMP tests:
10x CMP REG,REG: 1 clocks.
10x CMP REG,IMMED: 554189126 clocks.
10x CMP MEM,REG: -202 clocks.
10x CMP MEM,IMMED: -202 clocks.
Failure-mode TEST tests:
10x TEST REG,REG: 1356305252 clocks.
10x TEST REG,IMMED: 554189125 clocks.
10x TEST MEM,REG: 6 clocks.
10x TEST MEM,IMMED: 54 clocks.
Success-mode TEST tests:
10x TEST REG,REG: 18 clocks.
10x TEST REG,IMMED: 15 clocks.
10x TEST MEM,REG: -330 clocks.
10x TEST MEM,IMMED: 47 clocks.
Press any key to exit...
on this run, i ran it at the console...
Reference null tests:
Null: 286330943 clocks.
10x NOP: 1227133468 clocks.
Failure-mode CMP tests:
10x CMP REG,REG: -858993677 clocks.
10x CMP REG,IMMED: -203clocks.
10x CMP MEM,REG: -2004318275 clocks.
10x CMP MEM,IMMED: -210 clocks.
Success-mode CMP tests:
10x CMP REG,REG: 954436967 clocks.
10x CMP REG,IMMED: 8 clocks.
10x CMP MEM,REG: 8 clocks.
10x CMP MEM,IMMED: 8 clocks.
Failure-mode TEST tests:
10x TEST REG,REG: 1908874143 clocks.
10x TEST REG,IMMED: 7 clocks.
10x TEST MEM,REG: 7 clocks.
10x TEST MEM,IMMED: 15 clocks.
Success-mode TEST tests:
10x TEST REG,REG: -1840700479 clocks.
10x TEST REG,IMMED: 7 clocks.
10x TEST MEM,REG: 8 clocks.
10x TEST MEM,IMMED: 15 clocks.
Press any key to exit...
just imagine what i can do with clocks that are (+) - lol
i am using a prescott at 3 ghz
Quote from: Mark Jones on May 04, 2009, 05:14:55 PM
Dave, I was going to blindly say, "the TEST instruction is considerably faster than CMP on some modern hardware," but I decided to test this
...
Edit: small code typo. (I'm notorious for my bugs.) :bg
Interesting. Celeron M:
Reference null tests:
Null: 0 clocks.
10x NOP: 12 clocks
Failure-mode CMP tests:
10x CMP REG,REG: 0 clocks.
10x CMP REG,IMMED: 12 clocks
10x CMP MEM,REG: 24 clocks
10x CMP MEM,IMMED: 24 clocks
Success-mode CMP tests:
10x CMP REG,REG: 0 clocks.
10x CMP REG,IMMED: 12 clocks
10x CMP MEM,REG: 12 clocks
10x CMP MEM,IMMED: 24 clocks
Failure-mode TEST tests:
10x TEST REG,REG: 12 clocks
10x TEST REG,IMMED: 12 clocks
10x TEST MEM,REG: 24 clocks
10x TEST MEM,IMMED: 12 clocks
Success-mode TEST tests:
10x TEST REG,REG: 12 clocks
10x TEST REG,IMMED: 0 clocks.
10x TEST MEM,REG: 24 clocks
10x TEST MEM,IMMED: 24 clocks
obviously, it is running the timer from both cores
Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that
Wow, that is interesting indeed. :dazzled: :bg
I believe the timer may indeed be running from more than one core. Strange I would get such repeatable results. Are your results repeatable JJ?
Curious Dave, try changing "HIGH_PRIORITY_CLASS" to "REALTIME_PRIORITY_CLASS" and see if that makes any difference.
From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?
i can't test it Mitchi - lol
but, good to know because that is what i have always used to test for 0
repeat...
Microsoft suggests confining the thread to a single core
that means that the advantage of having 2 cores (or more) is negated by the test
what is needed, is to acquire the timer values from all cores, run the test, then acquire them all again
and see how many total cycles were used
i am not experienced enough to know how to do that
Quote from: Mark Jones on May 04, 2009, 05:34:38 PM
Are your results repeatable JJ?
Yes, more or less. Always multiples of 12.
Quote from: dedndave on May 04, 2009, 05:38:00 PM
that means that the advantage of having 2 cores (or more) is negated by the test
Not really. It is quite unlikely that you can convince the OS to run your fast inner loop in both cores simultaneously. In practice, your code runs on core 1, while Microsoft Word runs more or less independently on core 2. Now if you have three loop variants in "your" core 1, it is still very meaningful to compare their
relative performance.
But there are guys here who know much more about parallel code execution - try the Search box :bg
Quote from: mitchi on May 04, 2009, 05:36:56 PM
From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?
Not on a Celeron M:
197 cycles for 400*test eax, -1
197 cycles for 400*test ecx, -1
385 cycles for 400*or ecx, ecx
262 cycles for 400*test reg, reg
Quote from: jj2007 on May 04, 2009, 06:28:01 PM
Quote from: mitchi on May 04, 2009, 05:36:56 PM
From my own tests, OR REG, REG is the fastest test for zero. Does everyone agree?
Not on a Celeron M:
197 cycles for 400*test eax, -1
197 cycles for 400*test ecx, -1
385 cycles for 400*or ecx, ecx
262 cycles for 400*test reg, reg
Funny how your cpu always has different overall results than everyone else... ::)
It's a good thing to have various results. Now I see why the Visual C++ compiler uses the test instruction whenever it needs to test for zero. It's because it has a good speed on all processors.
Quote from: dedndave on Today at 11:38:00 AM
that means that the advantage of having 2 cores (or more) is negated by the test
Not really. It is quite unlikely that you can convince the OS to run your fast inner loop in both cores simultaneously. In practice, your code runs on core 1, while Microsoft Word runs more or less independently on core 2. Now if you have three loop variants in "your" core 1, it is still very meaningful to compare their relative performance.
JJ ! - lol
i am not even using MS word
please don't tell me MS has reserved half my brain for MS word - lol
i absolutely say that if you have a dual core processor, and you confine a test to one core, you will not see the advantage of having two cores - this seems pretty basic
Quote from: dedndave on May 04, 2009, 07:41:41 PM
i absolutely say that if you have a dual core processor, and you confine a test to one core, you will not see the advantage of having two cores - this seems pretty basic
Correct but not very meaningful. You do not see the absolute advantage, but you can still compare algo A with algo B on a relative basis - and you want to know which one is
faster, right?
Quote from: Mark Jones on May 04, 2009, 05:34:38 PM
Are your results repeatable JJ?
One more on Celeron M:
Reference null tests:
Null: 12 clocks.
1000x NOP: 516 clocks.
Failure-mode CMP tests:
1000x CMP REG,REG: 1002 clocks.
1000x CMP REG,IMMED: 492 clocks.
1000x CMP MEM,REG: 1008 clocks.
1000x CMP MEM,IMMED: 1008 clocks.
Success-mode CMP tests:
1000x CMP REG,REG: 996 clocks.
1000x CMP REG,IMMED: 504 clocks.
1000x CMP MEM,REG: 1008 clocks.
1000x CMP MEM,IMMED: 1008 clocks.
Failure-mode TEST tests:
1000x TEST REG,REG: 996 clocks.
1000x TEST REG,IMMED: 492 clocks.
1000x TEST MEM,REG: 1008 clocks.
1000x TEST MEM,IMMED: 1008 clocks.
Success-mode TEST tests:
1000x TEST REG,REG: 996 clocks.
1000x TEST REG,IMMED: 504 clocks.
1000x TEST MEM,REG: 1008 clocks.
1000x TEST MEM,IMMED: 1008 clocks.
Now the picture is getting clearer...
Bloated exe attached :bg
[attachment deleted by admin]
ok JJ, my friend, you have me challenged, now - lol
let me see if i can figure out how to do this....
Quote from: dedndave on May 04, 2009, 08:11:44 PM
ok JJ, my friend, you have me challenged, now - lol
:bg
What makes me furious, though, is that a 6-byte
test ecx, -1 is twice as fast than a 2-byte
test ecx, ecx - that's just not fair! :8)
Highly repeatable results here.
Reference null tests:
Null: 0 clocks.
1000x NOP: 333 clocks.
Failure-mode CMP tests:
1000x CMP REG,REG: 333 clocks.
1000x CMP REG,IMMED: 363 clocks.
1000x CMP MEM,REG: 505 clocks.
1000x CMP MEM,IMMED: 625 clocks.
Success-mode CMP tests:
1000x CMP REG,REG: 333 clocks.
1000x CMP REG,IMMED: 363 clocks.
1000x CMP MEM,REG: 505 clocks.
1000x CMP MEM,IMMED: 625 clocks.
Failure-mode TEST tests:
1000x TEST REG,REG: 333 clocks.
1000x TEST REG,IMMED: 356 clocks.
1000x TEST MEM,REG: 505 clocks.
1000x TEST MEM,IMMED: 625 clocks.
Success-mode TEST tests:
1000x TEST REG,REG: 333 clocks.
1000x TEST REG,IMMED: 356 clocks.
1000x TEST MEM,REG: 505 clocks.
1000x TEST MEM,IMMED: 625 clocks.
It may be worth noting that I have BOINC (http://boinc.berkeley.edu/) running in the background, which pretty much keeps both cores fully utilized in all available idle time (thus preventing the CPU from entering any power-saving or clock-reducing state, but yielding to any higher-priority thread.) However, the timing is exactly the same wether the BOINC worker threads are running or not. (Back to the drawing board, lol.)
yes - it seems logical (and fair - lol) that reg,reg is faster than reg,immed
of course, i have not learned these new CPU's well enough to say
certainly, on a 8088, reg,reg was always faster, and usually fewer bytes
it sounds to me like something is wrong with the measurement, somehow
btw JJ - i am looking into writing a simple test that reads RDTSC from each core, then runs the test sequence with both cores running
it may take me a while - and i may need some help, as i am already lost wth thread limitation and access - lol
let me see how far i can get on my own before i ask for help
Steve,
Quote
For those that have fun debugging their code after seemingly
minor edits, note that:
Code:
TEST AX,AX
JZ @F
CMP AX,AX
JZ @F
have different results. Almost as much fun as mixing up signed
and unsigned conditional jumps.
The comment appears to be misleading, I posted a well known technique top test for zero so you have code like this,
test eax, eax,
jz label
CMP code would be like this.
cmp eax, 0
je label
in newer cores, cmp/J is fused together into a macro-op
so what is the point to measure cmp/test by itself when branch misprediction causes most lost cycles and are more interesting to know how much I lose?
Quote from: dedndave on May 04, 2009, 09:13:00 PM
btw JJ - i am looking into writing a simple test that reads RDTSC from each core, then runs the test sequence with both cores running
it may take me a while - and i may need some help, as i am already lost wth thread limitation and access - lol
let me see how far i can get on my own before i ask for help
Just found an excellent source: Performance measurements with RDTSC (http://www.strchr.com/performance_measurements_with_rdtsc#getting_not_only_timings_but_also_performance_event_counters) by Peter Kankovski.
great reading JJ - thanks
poking around, i am learning things i hadn't intended to - lol
on the bright side, this doesn't look as hard as i thought
Quote from: hutch-- on May 05, 2009, 01:52:52 AM
Steve,
The comment appears to be misleading,
Hutch,
Sorry, my point was intended to show that if you use CMP AX,AX,
or its ilk, you are probably misusing the instruction. CMP AX,AX can
only have one possible result, which negates any use outside clearing
flags to a known state or a NOP with side effects. Use of TEST and
CMP can sometimes confuse beginners, or (in my case, the last time it
bit) sleepy programmers. You (I) can stare at the code for a long time
before you notice that perfectly legal code is inappropriate.
Your code shows a much more reasonable use of CMP.
Regards,
Steve N.
The trick with TEST is to treat it like an AND with no store result but a changed flag if the condition is met. The real problem for learners is not understanding the flags after an instruction has been run. Its just part of the learning curve of assembler but I agree with you that TEST AX, AX can be confusing to those who are not familiar with how flags work.