Sometimes you need to use the full FPU, and if you need to use in between the str$ macro or similar routines, you may discover that FPU regs ST5 ... ST7 are lost.
The fsave/frstor combi is designed to care for that scenario, but it is slow. Since MasmBasic (http://www.masm32.com/board/index.php?topic=12460) Str$ does indeed trash two FPU regs (6+7), I needed a way to save them in specific cases:
FpuSave ; no args means save 2 regs: ST(6) and ST(7)
Print Str$("This is a test with eax=%i", eax) ; Str$ trashes ST(6) and ST(7)
FpuRestore ; get ST(6) and ST(7) back, correct the stack
For saving 4 FPU regs, the macros are about a factor 15 faster than the fsave/frstor combination.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1193 cycles for fsave/frstor
81 cycles for FpuSave 4
60 cycles for FpuSave 3
41 cycles for FpuSave 2
32 cycles for FpuSave 1
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
298 cycles for fsave/frstor
25 cycles for FpuSave 4
19 cycles for FpuSave 3
16 cycles for FpuSave 2
9 cycles for FpuSave 1
Source attached (MB not required, it's plain Masm32).
EDIT: Attachment removed, see new version below qWord's warning.
Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
376 cycles for fsave/frstor
63 cycles for FpuSave 4
43 cycles for FpuSave 3
25 cycles for FpuSave 2
6 cycles for FpuSave 1
435 cycles for fsave/frstor
65 cycles for FpuSave 4
44 cycles for FpuSave 3
25 cycles for FpuSave 2
8 cycles for FpuSave 1
AMD Phenom(tm) II X6 1055T Processor (SSE3)
286 cycles for fsave/frstor
63 cycles for FpuSave 4
47 cycles for FpuSave 3
35 cycles for FpuSave 2
30 cycles for FpuSave 1
285 cycles for fsave/frstor
56 cycles for FpuSave 4
43 cycles for FpuSave 3
41 cycles for FpuSave 2
30 cycles for FpuSave 1
hi,
your macros missalgin the stack for all numregs=uneven. It should be N*16 instead of N*10.
Quote from: qWord on February 09, 2012, 06:58:00 PM
hi,
your macros missalgin the stack for all numregs=uneven. It should be N*16 instead of N*10.
Thanks, qWord. You are absolutely right. It could be N*12, too, but *10 is definitely a fat bug :red
New version attached. The MasmBasic library (http://www.masm32.com/board/index.php?topic=12460.0) has also been updated, as MasmBasic9February2012
b.zip - "b" like "bug free" :bg
By the way:
FpuSave MACRO numregs:=<2>
LOCAL ct
FpuSaveRegs=numregs
FpuSaveSize=10 ; <<<<<<<<<<<<<<<< bad size
...
.code
FillFPU
FpuSave 3
MsgBox 0, "You won't see this", "Hi", MB_OK
FpuRestore
At least on Win XP, that shows a crippled MessageBox, due to the misaligned stack. And if you check in Olly, you will see that a simple MessageBox trashes
all FPU regs :snooty:
Intel(R) Pentium(R) D CPU 3.00GHz (SSE3)
1099 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
1104 cycles for fsave/frstor
79 cycles for FpuSave 4
111 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
1103 cycles for fsave/frstor
113 cycles for FpuSave 4
59 cycles for FpuSave 3
42 cycles for FpuSave 2
32 cycles for FpuSave 1
1007 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
999 cycles for fsave/frstor
79 cycles for FpuSave 4
61 cycles for FpuSave 3
106 cycles for FpuSave 2
32 cycles for FpuSave 1
AMD Phenom(tm) II X6 1055T Processor (SSE3)
285 cycles for fsave/frstor
34 cycles for FpuSave 4
23 cycles for FpuSave 3
37 cycles for FpuSave 2
16 cycles for FpuSave 1
285 cycles for fsave/frstor
35 cycles for FpuSave 4
49 cycles for FpuSave 3
17 cycles for FpuSave 2
15 cycles for FpuSave 1
286 cycles for fsave/frstor
71 cycles for FpuSave 4
24 cycles for FpuSave 3
17 cycles for FpuSave 2
15 cycles for FpuSave 1
286 cycles for fsave/frstor
34 cycles for FpuSave 4
24 cycles for FpuSave 3
16 cycles for FpuSave 2
27 cycles for FpuSave 1
285 cycles for fsave/frstor
34 cycles for FpuSave 4
24 cycles for FpuSave 3
39 cycles for FpuSave 2
15 cycles for FpuSave 1
Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
348 cycles for fsave/frstor
58 cycles for FpuSave 4
45 cycles for FpuSave 3
22 cycles for FpuSave 2
3 cycles for FpuSave 1
437 cycles for fsave/frstor
62 cycles for FpuSave 4
43 cycles for FpuSave 3
25 cycles for FpuSave 2
7 cycles for FpuSave 1
403 cycles for fsave/frstor
41 cycles for FpuSave 4
46 cycles for FpuSave 3
24 cycles for FpuSave 2
7 cycles for FpuSave 1
384 cycles for fsave/frstor
63 cycles for FpuSave 4
38 cycles for FpuSave 3
21 cycles for FpuSave 2
6 cycles for FpuSave 1
435 cycles for fsave/frstor
63 cycles for FpuSave 4
44 cycles for FpuSave 3
24 cycles for FpuSave 2
8 cycles for FpuSave 1
AMD E-350 Processor (SSE4)
344 cycles for fsave/frstor
56 cycles for FpuSave 4
43 cycles for FpuSave 3
41 cycles for FpuSave 2
19 cycles for FpuSave 1
281 cycles for fsave/frstor
56 cycles for FpuSave 4
50 cycles for FpuSave 3
26 cycles for FpuSave 2
19 cycles for FpuSave 1
281 cycles for fsave/frstor
67 cycles for FpuSave 4
43 cycles for FpuSave 3
26 cycles for FpuSave 2
19 cycles for FpuSave 1
282 cycles for fsave/frstor
57 cycles for FpuSave 4
43 cycles for FpuSave 3
26 cycles for FpuSave 2
27 cycles for FpuSave 1
276 cycles for fsave/frstor
56 cycles for FpuSave 4
43 cycles for FpuSave 3
41 cycles for FpuSave 2
19 cycles for FpuSave 1
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
1108 cycles for fsave/frstor
113 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
1105 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
106 cycles for FpuSave 1
1104 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
40 cycles for FpuSave 2
32 cycles for FpuSave 1
1009 cycles for fsave/frstor
81 cycles for FpuSave 4
113 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
1016 cycles for fsave/frstor
114 cycles for FpuSave 4
59 cycles for FpuSave 3
41 cycles for FpuSave 2
32 cycles for FpuSave 1
Thanks, this should come in very handy and speed up my debugging, especially since i'm almost always working with the FPU. :U
Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz (SSE4)
381 cycles for fsave/frstor
29 cycles for FpuSave 4
20 cycles for FpuSave 3
14 cycles for FpuSave 2
4 cycles for FpuSave 1
356 cycles for fsave/frstor
30 cycles for FpuSave 4
31 cycles for FpuSave 3
12 cycles for FpuSave 2
9 cycles for FpuSave 1
363 cycles for fsave/frstor
49 cycles for FpuSave 4
20 cycles for FpuSave 3
25 cycles for FpuSave 2
5 cycles for FpuSave 1
346 cycles for fsave/frstor
39 cycles for FpuSave 4
20 cycles for FpuSave 3
15 cycles for FpuSave 2
11 cycles for FpuSave 1
337 cycles for fsave/frstor
28 cycles for FpuSave 4
32 cycles for FpuSave 3
28 cycles for FpuSave 2
4 cycles for FpuSave 1
AMD A6-3400M APU with Radeon(tm) HD Graphics (SSE3)
233 cycles for fsave/frstor
60 cycles for FpuSave 4
40 cycles for FpuSave 3
64 cycles for FpuSave 2
27 cycles for FpuSave 1
228 cycles for fsave/frstor
35 cycles for FpuSave 4
85 cycles for FpuSave 3
12 cycles for FpuSave 2
10 cycles for FpuSave 1
190 cycles for fsave/frstor
41 cycles for FpuSave 4
27 cycles for FpuSave 3
29 cycles for FpuSave 2
30 cycles for FpuSave 1
230 cycles for fsave/frstor
59 cycles for FpuSave 4
39 cycles for FpuSave 3
29 cycles for FpuSave 2
51 cycles for FpuSave 1
226 cycles for fsave/frstor
61 cycles for FpuSave 4
41 cycles for FpuSave 3
66 cycles for FpuSave 2
30 cycles for FpuSave 1
Intel(R) Atom(TM) CPU N450 @ 1.66GHz (SSE4
600 cycles for fsave/frstor
111 cycles for FpuSave 4
79 cycles for FpuSave 3
51 cycles for FpuSave 2
27 cycles for FpuSave 1
552 cycles for fsave/frstor
103 cycles for FpuSave 4
96 cycles for FpuSave 3
49 cycles for FpuSave 2
28 cycles for FpuSave 1
549 cycles for fsave/frstor
118 cycles for FpuSave 4
76 cycles for FpuSave 3
53 cycles for FpuSave 2
25 cycles for FpuSave 1
512 cycles for fsave/frstor
104 cycles for FpuSave 4
75 cycles for FpuSave 3
50 cycles for FpuSave 2
22 cycles for FpuSave 1
514 cycles for fsave/frstor
103 cycles for FpuSave 4
77 cycles for FpuSave 3
82 cycles for FpuSave 2
28 cycles for FpuSave 1
Clive,
Are you running a lab??
:bg
Quote from: clive on February 11, 2012, 03:40:19 AM
Intel(R) Pentium(R) D CPU 3.00GHz (SSE3)
1099 cycles for fsave/frstor
79 cycles for FpuSave 4
AMD Phenom(tm) II X6 1055T Processor (SSE3)
285 cycles for fsave/frstor
34 cycles for FpuSave 4
AMD A6-3400M APU with Radeon(tm) HD Graphics (SSE3)
233 cycles for fsave/frstor
60 cycles for FpuSave 4
Intel(R) Atom(TM) CPU N450 @ 1.66GHz (SSE4)
600 cycles for fsave/frstor
111 cycles for FpuSave 4
QuoteIntel(R) Pentium(R) D CPU 3.20GHz (SSE3)
1095 cycles for fsave/frstor
112 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
1095 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
109 cycles for FpuSave 1
1089 cycles for fsave/frstor
79 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
999 cycles for fsave/frstor
79 cycles for FpuSave 4
113 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
999 cycles for fsave/frstor
118 cycles for FpuSave 4
59 cycles for FpuSave 3
39 cycles for FpuSave 2
32 cycles for FpuSave 1
--- ok ---
Quote from: jj2007Clive, Are you running a lab??
Actually it's more like a Bat Cave than a Lab.
I do have a diverse collection of hardware at my disposal, of which this is a subset. I ignored the ones which duplicated the results already provided, for example the AMD E-350 and C-50 systems are identical performers, as are the 4 and 6 core Phenom II's.
Timing across a wide spectrum was what you were after, right?
Quote from: clive on February 11, 2012, 02:07:09 PM
Timing across a wide spectrum was what you were after, right?
Yes, thanks :thumbu
Hi,
Here are some more.
Cheers,
Steve N.
{P-III}
pre-P4 (SSE1)
211 cycles for fsave/frstor
51 cycles for FpuSave 4
16 cycles for FpuSave 3
12 cycles for FpuSave 2
4 cycles for FpuSave 1
195 cycles for fsave/frstor
23 cycles for FpuSave 4
21 cycles for FpuSave 3
9 cycles for FpuSave 2
19 cycles for FpuSave 1
198 cycles for fsave/frstor
24 cycles for FpuSave 4
16 cycles for FpuSave 3
22 cycles for FpuSave 2
4 cycles for FpuSave 1
183 cycles for fsave/frstor
23 cycles for FpuSave 4
49 cycles for FpuSave 3
9 cycles for FpuSave 2
13 cycles for FpuSave 1
198 cycles for fsave/frstor
58 cycles for FpuSave 4
16 cycles for FpuSave 3
13 cycles for FpuSave 2
4 cycles for FpuSave 1
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
227 cycles for fsave/frstor
32 cycles for FpuSave 4
15 cycles for FpuSave 3
14 cycles for FpuSave 2
3 cycles for FpuSave 1
227 cycles for fsave/frstor
20 cycles for FpuSave 4
15 cycles for FpuSave 3
8 cycles for FpuSave 2
18 cycles for FpuSave 1
230 cycles for fsave/frstor
20 cycles for FpuSave 4
15 cycles for FpuSave 3
14 cycles for FpuSave 2
3 cycles for FpuSave 1
212 cycles for fsave/frstor
20 cycles for FpuSave 4
31 cycles for FpuSave 3
8 cycles for FpuSave 2
10 cycles for FpuSave 1
212 cycles for fsave/frstor
33 cycles for FpuSave 4
15 cycles for FpuSave 3
15 cycles for FpuSave 2
3 cycles for FpuSave 1
--- ok ---
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
212 cycles for fsave/frstor
32 cycles for FpuSave 4
15 cycles for FpuSave 3
15 cycles for FpuSave 2
3 cycles for FpuSave 1
214 cycles for fsave/frstor
20 cycles for FpuSave 4
15 cycles for FpuSave 3
8 cycles for FpuSave 2
17 cycles for FpuSave 1
215 cycles for fsave/frstor
20 cycles for FpuSave 4
15 cycles for FpuSave 3
13 cycles for FpuSave 2
3 cycles for FpuSave 1
197 cycles for fsave/frstor
20 cycles for FpuSave 4
29 cycles for FpuSave 3
8 cycles for FpuSave 2
10 cycles for FpuSave 1
198 cycles for fsave/frstor
34 cycles for FpuSave 4
15 cycles for FpuSave 3
15 cycles for FpuSave 2
3 cycles for FpuSave 1
--- ok ---
{P-MMX}
pre-P4181 cycles for fsave/frstor
86 cycles for FpuSave 4
34 cycles for FpuSave 3
42 cycles for FpuSave 2
13 cycles for FpuSave 1
181 cycles for fsave/frstor
44 cycles for FpuSave 4
61 cycles for FpuSave 3
24 cycles for FpuSave 2
22 cycles for FpuSave 1
180 cycles for fsave/frstor
80 cycles for FpuSave 4
33 cycles for FpuSave 3
41 cycles for FpuSave 2
13 cycles for FpuSave 1
179 cycles for fsave/frstor
43 cycles for FpuSave 4
61 cycles for FpuSave 3
23 cycles for FpuSave 2
22 cycles for FpuSave 1
180 cycles for fsave/frstor
80 cycles for FpuSave 4
33 cycles for FpuSave 3
42 cycles for FpuSave 2
13 cycles for FpuSave 1
--- ok ---
AMD Athlon(tm) II X2 215 Processor (SSE3)
384 cycles for fsave/frstor
116 cycles for FpuSave 4
80 cycles for FpuSave 3
124 cycles for FpuSave 2
53 cycles for FpuSave 1
398 cycles for fsave/frstor
118 cycles for FpuSave 4
161 cycles for FpuSave 3
58 cycles for FpuSave 2
52 cycles for FpuSave 1
407 cycles for fsave/frstor
222 cycles for FpuSave 4
79 cycles for FpuSave 3
57 cycles for FpuSave 2
52 cycles for FpuSave 1
388 cycles for fsave/frstor
94 cycles for FpuSave 4
81 cycles for FpuSave 3
56 cycles for FpuSave 2
93 cycles for FpuSave 1
428 cycles for fsave/frstor
115 cycles for FpuSave 4
80 cycles for FpuSave 3
132 cycles for FpuSave 2
52 cycles for FpuSave 1
Thanks to everybody :U
I think we can leave it "as is" - it has been sufficiently proven that partial saving is a lot faster than fsave/frstor. It will hardly find a useful application in speeding up an innermost loop, but ok we learnt something ;-)