One is inclined to think that "twins" like movsx and movzx behave similarly...
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
141 cycles for cwde
95 cycles for movzx
120 cycles for movsx
Depends on who's watching
AMD Phenom(tm) II X6 1100T Processor (SSE3)
66 cycles for cwde
51 cycles for movzx
62 cycles for movsx
50 cycles for cwde
62 cycles for movzx
51 cycles for movsx
QuoteIntel(R) Core(TM) i3 CPU 550 @ 3.20GHz (SSE4)
64 cycles for cwde
45 cycles for movzx
63 cycles for movsx
58 cycles for cwde
63 cycles for movzx
45 cycles for movsx
64 cycles for cwde
44 cycles for movzx
63 cycles for movsx
58 cycles for cwde
63 cycles for movzx
44 cycles for movsx
64 cycles for cwde
44 cycles for movzx
63 cycles for movsx
--- ok ---
Thanks, Sinsi & six_L. My P4 consistently favours the Z, will see tonight how the Celeron behaves.
Not that it mattered: You rarely have a choice between the two instructions ;-)
P3:
pre-P4 (SSE1)
70 cycles for cwde
70 cycles for movzx
70 cycles for movsx
70 cycles for cwde
70 cycles for movzx
70 cycles for movsx
70 cycles for cwde
70 cycles for movzx
70 cycles for movsx
71 cycles for cwde
70 cycles for movzx
70 cycles for movsx
70 cycles for cwde
70 cycles for movzx
70 cycles for movsx
Hi,
P-III and two laptops.
Regards,
Steve N.
++ P-III
pre-P4 (SSE1)
70 cycles for cwde
71 cycles for movzx
71 cycles for movsx
72 cycles for cwde
71 cycles for movzx
71 cycles for movsx
71 cycles for cwde
71 cycles for movzx
71 cycles for movsx
71 cycles for cwde
71 cycles for movzx
71 cycles for movsx
71 cycles for cwde
71 cycles for movzx
71 cycles for movsx
--- ok ---
++ P-MMX
pre-P4136 cycles for cwde
115 cycles for movzx
114 cycles for movsx
135 cycles for cwde
114 cycles for movzx
113 cycles for movsx
136 cycles for cwde
122 cycles for movzx
114 cycles for movsx
136 cycles for cwde
113 cycles for movzx
113 cycles for movsx
134 cycles for cwde
114 cycles for movzx
113 cycles for movsx
--- ok ---
++ P-4?
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
60 cycles for cwde
59 cycles for movzx
62 cycles for movsx
61 cycles for cwde
61 cycles for movzx
60 cycles for movsx
60 cycles for cwde
66 cycles for movzx
64 cycles for movsx
61 cycles for cwde
60 cycles for movzx
61 cycles for movsx
60 cycles for cwde
61 cycles for movzx
60 cycles for movsx
--- ok ---
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (SSE4)
43 cycles for cwde
26 cycles for movzx
57 cycles for movsx
31 cycles for cwde
43 cycles for movzx
27 cycles for movsx
44 cycles for cwde
26 cycles for movzx
56 cycles for movsx
45 cycles for cwde
59 cycles for movzx
27 cycles for movsx
43 cycles for cwde
26 cycles for movzx
56 cycles for movsx
--- ok ---
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
57 cycles for cwde
37 cycles for movzx
57 cycles for movsx
37 cycles for cwde
57 cycles for movzx
37 cycles for movsx
87 cycles for cwde
58 cycles for movzx
86 cycles for movsx
37 cycles for cwde
57 cycles for movzx
37 cycles for movsx
57 cycles for cwde
59 cycles for movzx
57 cycles for movsx
--- ok ---
AMD Sempron(tm) Processor 3100+ (SSE3)
73 cycles for cwde
68 cycles for movzx
68 cycles for movsx
73 cycles for cwde
68 cycles for movzx
68 cycles for movsx
73 cycles for cwde
68 cycles for movzx
68 cycles for movsx
75 cycles for cwde
68 cycles for movzx
68 cycles for movsx
74 cycles for cwde
68 cycles for movzx
69 cycles for movsx
QuoteIntel(R) Xeon(R) CPU E7520 @ 1.87GHz (SSE4)
104 cycles for cwde
78 cycles for movzx
104 cycles for movsx
102 cycles for cwde
102 cycles for movzx
78 cycles for movsx
105 cycles for cwde
78 cycles for movzx
104 cycles for movsx
102 cycles for cwde
103 cycles for movzx
78 cycles for movsx
105 cycles for cwde
78 cycles for movzx
105 cycles for movsx
--- ok ---
Thanks to everybody. What is really odd is that several CPUs show an alternating pattern. In contrast, my Celeron favours cwde and yields very stable timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
59 cycles for cwde
62 cycles for movzx
62 cycles for movsx
59 cycles for cwde
62 cycles for movzx
62 cycles for movsx
59 cycles for cwde
62 cycles for movzx
62 cycles for movsx
i get unrepeatable results - but i always do - lol
to be fair, the CWDE test uses a MOV with a size-override
so, i added a CWDE test with a dword MOV
for comparison with MOVZX, i also added a test with AND EAX,0FFFFh :P
see reply #13 for attachment
Dave,
All around 60 cycles, no winner. The and eax, 0FFFFh is sometimes slower but that could be outliers.
hehe is to cry :bdg
AMD Turion(tm) 64 X2 Mobile Technology TL-52 (SSE3)
213 cycles for cwde
-69 cycles for movzx
-69 cycles for movsx
74 cycles for cwde
-69 cycles for movzx
345 cycles for movsx
75 cycles for cwde
69 cycles for movzx
69 cycles for movsx
75 cycles for cwde
69 cycles for movzx
75 cycles for movsx
-55 cycles for cwde
69 cycles for movzx
69 cycles for movsx
i got better results by restricting execution to a single core...
prescott w/htt:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
144 cycles for cwde (mov word)
126 cycles for cwde (mov dword)
113 cycles for and 0FFFFh
90 cycles for movzx
111 cycles for movsx
141 cycles for cwde (mov word)
147 cycles for cwde (mov dword)
127 cycles for and 0FFFFh
95 cycles for movzx
111 cycles for movsx
200 cycles for cwde (mov word)
125 cycles for cwde (mov dword)
116 cycles for and 0FFFFh
107 cycles for movzx
129 cycles for movsx
140 cycles for cwde (mov word)
126 cycles for cwde (mov dword)
111 cycles for and 0FFFFh
95 cycles for movzx
111 cycles for movsx
152 cycles for cwde (mov word)
146 cycles for cwde (mov dword)
125 cycles for and 0FFFFh
97 cycles for movzx
111 cycles for movsx
see attached...
Try adding about a 3 second delay at the start of the code to allow time for the system activities involved in launching an app to finish.
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
148 cycles for cwde (mov word)
218 cycles for cwde (mov dword)
299 cycles for and 0FFFFh
257 cycles for movzx
126 cycles for movsx
147 cycles for cwde (mov word)
147 cycles for cwde (mov dword)
148 cycles for and 0FFFFh
152 cycles for movzx
154 cycles for movsx
210 cycles for cwde (mov word)
177 cycles for cwde (mov dword)
178 cycles for and 0FFFFh
152 cycles for movzx
191 cycles for movsx
230 cycles for cwde (mov word)
222 cycles for cwde (mov dword)
221 cycles for and 0FFFFh
193 cycles for movzx
191 cycles for movsx
225 cycles for cwde (mov word)
222 cycles for cwde (mov dword)
255 cycles for and 0FFFFh
193 cycles for movzx
197 cycles for movsx
by increasing LOOP_COUNT from 1,000,000 to 10,000,000, i get considerably more repeatable results
this value makes each test ~0.5 seconds
the culprit seems to be the CPUID instructions used to serialize
i guess we already knew that :P