News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

mov [reg32], 0 or and [reg32], 0?

Started by jj2007, May 14, 2011, 07:43:54 AM

Previous topic - Next topic

jj2007

and [reg32], 0 is 3 bytes shorter than mov [reg32], 0 but theory suggests it should be slower, since and has to both get the value from mem and then write it back...
Evidence says it doesn't matter, but I am curious how that behaves on other CPUs.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
130     cycles for 100*mov
131     cycles for 100*and

130     cycles for 100*mov
131     cycles for 100*and

ERNST

QuoteIntel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
112     cycles for 100*mov
132     cycles for 100*and

116     cycles for 100*mov
133     cycles for 100*and


--- ok ---

Neil

Intel(R) Core(TM)2 Quad   CPU    Q9550  @ 2.83GHz  (SSE4)
124      cycles for 100*mov
141      cycles for 100*and

123      cycles for 100*mov
120      cycles for 100*and

MichaelW

P3:

pre-P4 (SSE1)
130     cycles for 100*mov
177     cycles for 100*and

131     cycles for 100*mov
177     cycles for 100*and

eschew obfuscation

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
102     cycles for 100*mov
120     cycles for 100*and

101     cycles for 100*mov
121     cycles for 100*and


--- ok ---


I would be inclined to try a more complex test as MOV may have some advantage in tight looping that AND does not. Try a timed framework that has enough other instructions in it then try both out. mke sure you don't use other instructions around it that stall or you will get unreliable readings as both MOV and AND will fill a hole left by a stall.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Thanks to everybody. Here is version that checks the difference between a "real" loop and REPEAT 100. In any case, we are talking here about the rather hypothetical difference between 2.1 and 2.2 cycles, so in all real life apps it won't matter. Except that and occupies less space in the instruction cache....

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
214     cycles for 100*mov, loop
224     cycles for 100*and, loop
130     cycles for 100*mov, REP
131     cycles for 100*and, REP

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
128     cycles for 100*mov, loop
209     cycles for 100*and, loop
100     cycles for 100*mov, REP
120     cycles for 100*and, REP

118     cycles for 100*mov, loop
209     cycles for 100*and, loop
95      cycles for 100*mov, REP
120     cycles for 100*and, REP


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Am I the lone AMD?

AMD Phenom(tm) II X6 1100T Processor (SSE3)
278     cycles for 100*mov, loop
186     cycles for 100*and, loop
79      cycles for 100*mov, REP
129     cycles for 100*and, REP

184     cycles for 100*mov, loop
274     cycles for 100*and, loop
80      cycles for 100*mov, REP
130     cycles for 100*and, REP

Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
292     cycles for 100*mov, loop
337     cycles for 100*and, loop
188     cycles for 100*mov, REP
207     cycles for 100*and, REP

294     cycles for 100*mov, loop
334     cycles for 100*and, loop
186     cycles for 100*mov, REP
213     cycles for 100*and, REP

FORTRANS

   Strange.

pre-P41153   cycles for 100*mov, loop
413   cycles for 100*and, loop
1024   cycles for 100*mov, REP
407   cycles for 100*and, REP

1020   cycles for 100*mov, loop
416   cycles for 100*and, loop
1018   cycles for 100*mov, REP
409   cycles for 100*and, REP


--- ok ---

pre-P4 (SSE1)
308   cycles for 100*mov, loop
309   cycles for 100*and, loop
131   cycles for 100*mov, REP
178   cycles for 100*and, REP

308   cycles for 100*mov, loop
308   cycles for 100*and, loop
131   cycles for 100*mov, REP
178   cycles for 100*and, REP


--- ok ---

hutch--

JJ,

I had a quick play with 2 loops, one with AND, the other with MOV and as soon as you start adding identical  instructions to both loops the timing becomes close enough to identical. It is probably because both AND and MOV are preferred instructions that pair through pipelines so I would imagine they have very similar times in most contexts.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Yes, they are almost identical. Which is surprising as initially stated, since in theory and [reg32] implies two actions, a read plus a write, while mov requires only one write. Since inc behaves very differently, see below, the reason for the "fast" and might be some special circuitry.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
215     cycles for 100*mov, loop
225     cycles for 100*and, loop
130     cycles for 100*mov, REP
132     cycles for 100*and, REP
642     cycles for 100*inc, loop
601     cycles for 100*inc, REP

mineiro

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
129     cycles for 100*mov, loop
209     cycles for 100*and, loop
101     cycles for 100*mov, REP
121     cycles for 100*and, REP
608     cycles for 100*inc, loop
595     cycles for 100*inc, REP

118     cycles for 100*mov, loop
209     cycles for 100*and, loop
97      cycles for 100*mov, REP
122     cycles for 100*and, REP
607     cycles for 100*inc, loop
595     cycles for 100*inc, REP
--- ok ---

hutch--

For a long time Intel have recommended using ADD over INC and its probably the case of putting a redundant instruction back to a lower priority in terms of die layout. The x86 instruction set that we see is in fact an interface to whatever lies below which varies from one processor core to another but it appears that that they use a statistical derived instruction priority stacking that puts the most commonly used instructions a lot closer to the silicon and the less used ones back into the microcode. The very late Intel hardware is a lot faster with SSE2/3/4 than the earlier PIVs and it seems that technology advances in die size are mainly being used for the SSE instruction sets with only a subset of the integer instructions being in the fast lane.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 15, 2011, 12:39:25 AM
For a long time Intel have recommended using ADD over INC

Add & inc behave identically on my Celeron:
640     cycles for 100*inc, loop
598     cycles for 100*inc, REP
640     cycles for 100*add, loop
598     cycles for 100*add, REP


But that's not the point. For an and mem you need to know what is in mem, so you need to read it, and it, write it. That is not the case for a mov mem, immediate - no read necessary. That is why and mem should be slower than mov mem, but evidence shows it isn't slower.