and [reg32], 0 is 3 bytes shorter than mov [reg32], 0 but theory suggests it should be slower, since and has to both get the value from mem and then write it back...
Evidence says it doesn't matter, but I am curious how that behaves on other CPUs.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
130 cycles for 100*mov
131 cycles for 100*and
130 cycles for 100*mov
131 cycles for 100*and
QuoteIntel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
112 cycles for 100*mov
132 cycles for 100*and
116 cycles for 100*mov
133 cycles for 100*and
--- ok ---
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
124 cycles for 100*mov
141 cycles for 100*and
123 cycles for 100*mov
120 cycles for 100*and
P3:
pre-P4 (SSE1)
130 cycles for 100*mov
177 cycles for 100*and
131 cycles for 100*mov
177 cycles for 100*and
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
102 cycles for 100*mov
120 cycles for 100*and
101 cycles for 100*mov
121 cycles for 100*and
--- ok ---
I would be inclined to try a more complex test as MOV may have some advantage in tight looping that AND does not. Try a timed framework that has enough other instructions in it then try both out. mke sure you don't use other instructions around it that stall or you will get unreliable readings as both MOV and AND will fill a hole left by a stall.
Thanks to everybody. Here is version that checks the difference between a "real" loop and REPEAT 100. In any case, we are talking here about the rather hypothetical difference between 2.1 and 2.2 cycles, so in all real life apps it won't matter. Except that and occupies less space in the instruction cache....
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
214 cycles for 100*mov, loop
224 cycles for 100*and, loop
130 cycles for 100*mov, REP
131 cycles for 100*and, REP
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
128 cycles for 100*mov, loop
209 cycles for 100*and, loop
100 cycles for 100*mov, REP
120 cycles for 100*and, REP
118 cycles for 100*mov, loop
209 cycles for 100*and, loop
95 cycles for 100*mov, REP
120 cycles for 100*and, REP
--- ok ---
Am I the lone AMD?
AMD Phenom(tm) II X6 1100T Processor (SSE3)
278 cycles for 100*mov, loop
186 cycles for 100*and, loop
79 cycles for 100*mov, REP
129 cycles for 100*and, REP
184 cycles for 100*mov, loop
274 cycles for 100*and, loop
80 cycles for 100*mov, REP
130 cycles for 100*and, REP
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
292 cycles for 100*mov, loop
337 cycles for 100*and, loop
188 cycles for 100*mov, REP
207 cycles for 100*and, REP
294 cycles for 100*mov, loop
334 cycles for 100*and, loop
186 cycles for 100*mov, REP
213 cycles for 100*and, REP
Strange.
pre-P41153 cycles for 100*mov, loop
413 cycles for 100*and, loop
1024 cycles for 100*mov, REP
407 cycles for 100*and, REP
1020 cycles for 100*mov, loop
416 cycles for 100*and, loop
1018 cycles for 100*mov, REP
409 cycles for 100*and, REP
--- ok ---
pre-P4 (SSE1)
308 cycles for 100*mov, loop
309 cycles for 100*and, loop
131 cycles for 100*mov, REP
178 cycles for 100*and, REP
308 cycles for 100*mov, loop
308 cycles for 100*and, loop
131 cycles for 100*mov, REP
178 cycles for 100*and, REP
--- ok ---
JJ,
I had a quick play with 2 loops, one with AND, the other with MOV and as soon as you start adding identical instructions to both loops the timing becomes close enough to identical. It is probably because both AND and MOV are preferred instructions that pair through pipelines so I would imagine they have very similar times in most contexts.
Yes, they are almost identical. Which is surprising as initially stated, since in theory and [reg32] implies two actions, a read plus a write, while mov requires only one write. Since inc behaves very differently, see below, the reason for the "fast" and might be some special circuitry.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
215 cycles for 100*mov, loop
225 cycles for 100*and, loop
130 cycles for 100*mov, REP
132 cycles for 100*and, REP
642 cycles for 100*inc, loop
601 cycles for 100*inc, REP
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
129 cycles for 100*mov, loop
209 cycles for 100*and, loop
101 cycles for 100*mov, REP
121 cycles for 100*and, REP
608 cycles for 100*inc, loop
595 cycles for 100*inc, REP
118 cycles for 100*mov, loop
209 cycles for 100*and, loop
97 cycles for 100*mov, REP
122 cycles for 100*and, REP
607 cycles for 100*inc, loop
595 cycles for 100*inc, REP
--- ok ---
For a long time Intel have recommended using ADD over INC and its probably the case of putting a redundant instruction back to a lower priority in terms of die layout. The x86 instruction set that we see is in fact an interface to whatever lies below which varies from one processor core to another but it appears that that they use a statistical derived instruction priority stacking that puts the most commonly used instructions a lot closer to the silicon and the less used ones back into the microcode. The very late Intel hardware is a lot faster with SSE2/3/4 than the earlier PIVs and it seems that technology advances in die size are mainly being used for the SSE instruction sets with only a subset of the integer instructions being in the fast lane.
Quote from: hutch-- on May 15, 2011, 12:39:25 AM
For a long time Intel have recommended using ADD over INC
Add & inc behave identically on my Celeron:
640 cycles for 100*inc, loop
598 cycles for 100*inc, REP
640 cycles for 100*add, loop
598 cycles for 100*add, REP
But that's not the point. For an
and mem you need to know what is in mem, so you need to read it,
and it, write it. That is not the case for a
mov mem, immediate - no read necessary. That is why
and mem should be slower than
mov mem, but evidence shows it isn't slower.
You will probably find that at the bit level a copy bit occurs at about the same speed as a modify bit so if the source is an immediate in both, they should take a similar amount of time if they have similar circuitry in hardware.
Intel(R) Atom(TM) CPU N475 @ 1.83GHz (SSE4)
430 cycles for 100*mov, loop
424 cycles for 100*and, loop
412 cycles for 100*mov, REP
407 cycles for 100*and, REP
424 cycles for 100*mov, loop
424 cycles for 100*and, loop
410 cycles for 100*mov, REP
411 cycles for 100*and, REP
AMD Phenom(tm) II X6 1055T Processor (SSE3)
308 cycles for 100*mov, loop
208 cycles for 100*and, loop
88 cycles for 100*mov, REP
146 cycles for 100*and, REP
696 cycles for 100*inc, loop
695 cycles for 100*inc, REP
207 cycles for 100*mov, loop
307 cycles for 100*and, loop
88 cycles for 100*mov, REP
146 cycles for 100*and, REP
695 cycles for 100*inc, loop
695 cycles for 100*inc, REP