Quote
GetPercent proc source:DWORD, percent:DWORD
LOCAL var1:DWORD
mov var1, 100 ; to divide by 100
fild source ; load source integer
fild var1 ; load 100
fdiv ; divide source by 100
fild percent ; load required percentage
fmul ; multiply 1% by required percentage
fistp var1 ; store result in variable
mov eax, var1 ;FPU STACK is +2 HERE and can't be used as this
FINIT ;---------- <<<<<<<<<<<< Added correction needed file getpcnt.asm
ret
GetPercent endp
I can't find any problem. The FDIV and FMUL are actually encoded as FDIVP and FMULP (or at least they are if I assemble with ML 6.15).
This instructions aren't the problem,I agree.If you have read the comments the problem is:
Quote
THE STACK OF FPU IS +2 AT THE END OF THE FUNCTION ,NOT 0
The function couldn't be recall without ERROR.
The FPU stack is empty after the FISTP. I can call the procedure repeatedly without problems.
Michael is right, The problem exists only in your imagination. However, the GetPercent can certainly be optimised a little bit - 20 bytes instead of 36:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
GetPercent proc source:DWORD, percent:DWORD
fild dword ptr [esp+4] ; load source integer
push 100
fidiv dword ptr [esp] ; divide source by 100
fimul dword ptr [esp+12] ; multiply 1% by required percentage
fistp dword ptr [esp] ; store result on stack
pop eax ; return in eax
ret 8
GetPercent endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Perhaps you can made one effort: count on your fingers
Quote
mov var1, 100 ; to divide by 100
fild source ; stack +1
fild var1 ; lstack +1
fdiv ;
fild percent ; stack +1
fmul ;
fistp var1 ;stack -1
mov eax, var1
FINIT ;---------- <<<<<<<<<<<< Added correction needed file getpcnt.asm
ret
3-1 =2 not 0
If you want more proof,try a loop with 4 or 5 recall of the function
Quote from: ToutEnMasm on November 07, 2010, 01:44:11 PM
Perhaps you can made one effort: count on your fingers
Perhaps you can made one effort: Read Michael's comments ("The FDIV and FMUL are actually encoded as FDIVP and FMULP"), or launch Olly to see yourself.
And still, it could be optimised :bg
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
27 cycles for GetPercent
11 cycles for GetPercentJJ1
9 cycles for GetPercentJJ2
32 cycles for GetPercent
11 cycles for GetPercentJJ1
9 cycles for GetPercentJJ2
32 cycles for GetPercent
11 cycles for GetPercentJJ1
9 cycles for GetPercentJJ2
Code sizes:
36 for GetPercent
32 for GetPercentJJ1
32 for GetPercentJJ2
he may be using a different assembler :P
might try the explicit...
fmul st0,st1
Quote
The FDIV and FMUL are actually encoded as FDIVP and FMULP"), or launch Olly to see yourself.
And what do you do if you have to work with an old compiler ?.
Quote from: ToutEnMasm on November 07, 2010, 02:54:59 PM
Quote
The FDIV and FMUL are actually encoded as FDIVP and FMULP"), or launch Olly to see yourself.
And what do you do if you have to work with an old compiler ?.
ML 6.14, 6.15, 9.0 and JWasm all expose this behaviour. If you have a "compiler" older than 6.14, move it into the dustbin.
Hi,
CPU and FPU do not matter. It is the assembler that
encodes the implied POP. It is a matter of convention in
MASM from version 1.0 onwards. Confuses some more
than others, and is only used with the one operand form.
IIRC.
Regards,
Steve N.
i wasn't thinking he may be using an older masm
but, maybe tasm or fasm or some other creature
not sure how GoAsm handles it - knowing Jeremy, it is probably masm-compatible
Just for fun, the non-FPU integer (JJ3) and SSE2 versions:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
27 cycles for GetPercent
10 cycles for GetPercentJJ1
9 cycles for GetPercentJJ2
35 cycles for GetPercentJJ3
11 cycles for GetPercentJJ4
7 cycles for GetPercentSSE
Code sizes:
32 for GetPercentJJ1
32 for GetPercentJJ2
17 for GetPercentJJ3
33 for GetPercentJJ4
39 for GetPercentSSE
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
38 cycles for GetPercent
24 cycles for GetPercentJJ1
20 cycles for GetPercentJJ2
46 cycles for GetPercentJJ3
21 cycles for GetPercentJJ4
18 cycles for GetPercentSSE
43 cycles for GetPercent
21 cycles for GetPercentJJ1
18 cycles for GetPercentJJ2
42 cycles for GetPercentJJ3
22 cycles for GetPercentJJ4
21 cycles for GetPercentSSE
Atom
QuoteIntel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
117 cycles for GetPercent
51 cycles for GetPercentJJ1
46 cycles for GetPercentJJ2
100 cycles for GetPercentJJ3
50 cycles for GetPercentJJ4
40 cycles for GetPercentSSE
125 cycles for GetPercent
48 cycles for GetPercentJJ1
46 cycles for GetPercentJJ2
87 cycles for GetPercentJJ3
51 cycles for GetPercentJJ4
43 cycles for GetPercentSSE
Code sizes:
32 for GetPercentJJ1
32 for GetPercentJJ2
17 for GetPercentJJ3
33 for GetPercentJJ4
39 for GetPercentSSE
I wrote this years ago to plug up a simple requirement, things like integer sizing for screen display and it has never been a performance issue where it was normally used. Now I have no doubt that it can be optimised but in the context of its use, its a "who cares" issue.
Now I understand what Yves has said but the procedure has always handled high call counts with no problems at all. Here is a test piece that calls it in a loop 1 million times.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
push esi
mov esi, 1000000
lbl0:
print str$(rv(GetPercent,esi,50)),13,10
sub esi, 1
jnz lbl0
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
:bg
JJ,
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
14 cycles for GetPercent
6 cycles for GetPercentJJ1
5 cycles for GetPercentJJ2
16 cycles for GetPercent
6 cycles for GetPercentJJ1
5 cycles for GetPercentJJ2
16 cycles for GetPercent
6 cycles for GetPercentJJ1
5 cycles for GetPercentJJ2
Code sizes:
32 for GetPercentJJ1
32 for GetPercentJJ2
--- ok ---
Removing the FDIV certainly improves the time. :P
Quote from: hutch-- on November 08, 2010, 12:14:55 AM
Removing the FDIV certainly improves the time. :P
Certainly. SSE speeds it up once more, but I ran into a problem with the mulsd instruction: It is fast if the xmm register is already in float format, but if not, it costs over 200 cycles! The first version below works fine but needs lots of conversions.
GetPercentSSE_s:
GPJ005 REAL8 0.01
GetPercentSSE proc source:DWORD, percent:DWORD
cvtsi2sd xmm0, dword ptr [esp+4] ; source
cvtsi2sd xmm1, dword ptr [esp+8] ; percent
mulsd xmm0, xmm1
mulsd xmm0, GPJ005 ; multiply with 0.01, i.e. divide source by 100
cvtsd2si eax, xmm0
ret 8 ; 11 cycles
GetPercentSSE endp
GetPercentSSE_e:
GetPercentSSEs_s:
GPJ005s REAL8 0.01
GetPercentSSEs proc source:DWORD, percent:DWORD
; xorps xmm0, xmm0
; movaps xmm1, xmm0 ; no effect
movd xmm0, dword ptr [esp+4] ; source
movd xmm1, dword ptr [esp+8] ; percent
; int 3 ; OPT_Olly 2
pmuludq xmm0, xmm1 ; source * percent, OK (pm xmm0, qword ptr [esp+8] possible but slow)
MakeSlow = 1
if MakeSlow ; first branch: result is correct but >200 cycles
mulsd xmm0, GPJ005s ; multiply with 0.01, i.e. divide source by 100 - SLOOOOOOW ###
movd eax, xmm0
else ; second branch: result correct, 27 cycles
cvtdq2pd xmm0, xmm0 ; convert integer to float (Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values)
mulsd xmm0, GPJ005s ; multiply with 0.01, i.e. divide source by 100 - fast
cvtsd2si eax, xmm0
endif
ret 8 ; 259 or 348 cycles
GetPercentSSEs endp
GetPercentSSEs_e:
New testbed attached, now with more realistic cycle counts (there is a REPEAT 64 ... ENDM followed by shr eax, 6).
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
30 cycles for GetPercent
14 cycles for GetPercentJJ2
11 cycles for GetPercentSSE
263 cycles for GetPercentSSEs
Code sizes:
36 bytes for GetPercent, result=6790123
32 bytes for GetPercentJJ2, result=6790123
39 bytes for GetPercentSSE, result=6790123
39 bytes for GetPercentSSEs, result=6790123
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
44 cycles for GetPercent
21 cycles for GetPercentJJ2
22 cycles for GetPercentSSE
1104 cycles for GetPercentSSEs
Code sizes:
36 bytes for GetPercent, result=6790123
32 bytes for GetPercentJJ2, result=6790123
39 bytes for GetPercentSSE, result=6790123
39 bytes for GetPercentSSEs, result=6790123
Thanks, Alex. So the ordinary FPU version is one cycle faster than the fast SSE2 version... and a whopping 1100 cycles for the bad SSE stuff ::)
JJ,
Inspired by you reciprocal multiply, this one has abot 40% legs on the old version. I changed the calculation order as well.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
GetPercent2 proc source:DWORD, percent:DWORD
fild DWORD PTR [esp+8] ; load percent
fld10 0.01 ; load reciprocal of 100
fmul ; mul by reciprocal = div by 100
fild DWORD PTR [esp+4] ; load the source
fmul ; multiply by previous result
fistp DWORD PTR [esp+8] ; pop FP stack and store result in stack variable
mov eax, [esp+8] ; write result to EAX for return value
ret 8
GetPercent2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Great, so now we are waiting for Lingo :bg
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
11 cycles for GetPercentSSE
36 cycles for GetPercent
13 cycles for GetPercent2c
14 cycles for GetPercent2nc
14 cycles for GetPercentJJ2
11 cycles for GetPercentSSE
36 cycles for GetPercent
14 cycles for GetPercent2c
14 cycles for GetPercent2nc
14 cycles for GetPercentJJ2
11 cycles for GetPercentSSE
36 cycles for GetPercent
13 cycles for GetPercent2c
14 cycles for GetPercent2nc
14 cycles for GetPercentJJ2
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=6790123
41 bytes for GetPercent2c, result=6790123
45 bytes for GetPercent2nc, result=6790123
32 bytes for GetPercentJJ2, result=6790123
P.S.: GetPercent2c and GetPercent2nc are two variants of Hutch' new algo. I like the JJ2 variant, it fits into 2 paras and is reasonably fast.
Edit: Prescott P4:
20 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
20 cycles for GetPercentJJ2
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
14 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
10 cycles for GetPercentJJ2
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
10 cycles for GetPercentJJ2
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
10 cycles for GetPercentJJ2
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=6790123
41 bytes for GetPercent2c, result=6790123
45 bytes for GetPercent2nc, result=6790123
32 bytes for GetPercentJJ2, result=6790123
--- ok ---
Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
21 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
21 cycles for GetPercentJJ2
21 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
20 cycles for GetPercentJJ2
21 cycles for GetPercentSSE
48 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
21 cycles for GetPercentJJ2
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=6790123
41 bytes for GetPercent2c, result=6790123
45 bytes for GetPercent2nc, result=6790123
32 bytes for GetPercentJJ2, result=6790123
--- ok ---
Here is my integer version of GetPercent. By my tests - fast and small.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxGetPercentInt proc source:DWORD, percent:DWORD
mov eax,[esp+4]
imul edx,[esp+8],28F5C29h
jl @F
mul edx
mov eax,edx
shr edx,32-2
sub eax,edx
@@:
ret 8
AxGetPercentInt endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
The code is indended for calculation of values by percents in range 0-100. If percent is greater than, or equal to 100, then returned source value.
Code speed have no dependencyes from the value of the source number or value of percent.
Unusual correction is used.
If algo will always used with percents less than 100, then first 4 lines of the code can be replace to:
imul eax,[esp+8],28F5C29h
jl @F
mul dword ptr [esp+4]
Then timings by 2 clocks faster.
For testing is used latest Jochen's testbed. But I have fix flaw in the ChkPrecision code, which now calls to the right FPU calculation code.
Here is my timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
24 cycles for GetPercentSSE
47 cycles for GetPercent
23 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
20 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
24 cycles for GetPercentSSE
46 cycles for GetPercent
23 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
20 cycles for GetPercentJJ2
15 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
26 bytes for AxGetPercentInt, result=6790122
I have asking for testing. Thanks!
Alex
Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
26 bytes for AxGetPercentInt, result=6790122
--- ok ---
Alex,
I get a GP fault out of your last zip on xp sp2.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
-2125943931 for 7FFF0000/1, case 26750588
Quote from: hutch-- on December 01, 2010, 10:34:56 PM
Alex,
I get a GP fault out of your last zip on xp sp2.
That's something with ChkPrecision/FPU code, apparently. AxGetPercentInt have no instructions which can cause GPF.
I have checked this, and found a reason, Jochen intentionally made FPU stack overflow in the MACRO:
Algo MACRO arg
finit
REPEAT 8
fldpi
ENDM
ChkPrecision code uses simple FPU code as etalone. And this simple code is not handle cases when FPU stack is full. This is reason why you get exception.
Now I have commented FLDPI line, and it should work properly.
As sayed, that's not bug in my code, just Jochen not write documentation for his testing variant :bg
Hutch, and all, test this one, please.
Alex
Quote from: ToutEnMasm on December 01, 2010, 03:26:00 PM
Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
26 bytes for AxGetPercentInt, result=6790122
--- ok ---
Thanks,
Luce!
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
29 cycles for GetPercent2nc
26 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
25 cycles for GetPercentSSE
47 cycles for GetPercent
30 cycles for GetPercent2c
30 cycles for GetPercent2nc
26 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
:U
Quote from: dedndave on December 01, 2010, 11:43:54 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
24 cycles for GetPercentSSE
47 cycles for GetPercent
24 cycles for GetPercent2c
29 cycles for GetPercent2nc
26 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
25 cycles for GetPercentSSE
47 cycles for GetPercent
30 cycles for GetPercent2c
30 cycles for GetPercent2nc
26 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
:U
Thanks, Dave! In testing with rules of the testbed it looks not bad.
i am beginning to think my machine gives the worst results - lol
that makes it good for testing, at least
it runs fast enough (with my tweaks)
Quote from: dedndave on December 01, 2010, 11:49:48 PM
i am beginning to think my machine gives the worst results - lol
that makes it good for testing, at least
it runs fast enough (with my tweaks)
No, your results is excellent...
...Because they are equal to my results :P
it looks ok this time
except this one...
24 cycles for GetPercent2c
30 cycles for GetPercent2c
i think i have a way to fix that problem, though :P
Quote from: Antariy on December 01, 2010, 11:13:54 PM
Quote from: hutch-- on December 01, 2010, 10:34:56 PM
Alex,
I get a GP fault out of your last zip on xp sp2.
That's something with ChkPrecision/FPU code, apparently. AxGetPercentInt have no instructions which can cause GPF.
I have checked this, and found a reason, Jochen intentionally made FPU stack overflow in the MACRO:
Algo MACRO arg
finit
REPEAT 8
fldpi
ENDM
ChkPrecision code uses simple FPU code as etalone. And this simple code is not handle cases when FPU stack is full. This is reason why you get exception.
Now I have commented FLDPI line, and it should work properly.
As sayed, that's not bug in my code, just Jochen not write documentation for his testing variant :bg
Hutch, and all, test this one, please.
Alex
It doesn't work on my pc as well. GPF:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
-2125943931 for 7FFF0000/1, case 0
Probably it depends on the fact you are not using the right Testbed :lol
Quote from: frktons on December 02, 2010, 12:00:35 AM
It doesn't work on my pc as well. GPF:
Probably it depends on the fact you are not using the right Testbed :lol
Of course, influence of the old testbed. :lol
Well, I have no desire to dig into the all code of entire testbed, to find a culprit.
On PIV cores it works, so - something with FPU part somewhere maybe, which lead to inexact results etc on other hardware.
Quote from: Antariy on December 02, 2010, 12:06:56 AM
Of course, influence of the old testbed. :lol
Well, I have no desire to dig into the all code of entire testbed, to find a culprit.
On PIV cores it works, so - something with FPU part somewhere maybe, which lead to inexact results etc on other hardware.
We have worked hard to create a new Testbed, compatible with old machines and MASM versions.
It is a pity we have to see these horrible interfaces again. And they have some bugs as well ::)
:naughty: :naughty: :snooty: :snooty: :naughty: :naughty:
Quote from: frktons on December 02, 2010, 12:10:20 AM
We have worked hard to create a new Testbed, compatible with old machines and MASM versions.
It is a pity we have to see these horrible interfaces again. And they have some bugs as well ::)
:naughty: :naughty: :snooty: :snooty: :naughty: :naughty:
:green2 :green2 :green2 :green2 :green2 :green2 :green2 :green2 :green2 :green2 :green2 :green2
Quote from: Antariy on December 01, 2010, 11:13:54 PM
Quote from: hutch-- on December 01, 2010, 10:34:56 PM
Alex,
I get a GP fault out of your last zip on xp sp2.
That's something with ChkPrecision/FPU code, apparently. AxGetPercentInt have no instructions which can cause GPF.
I have checked this, and found a reason, Jochen intentionally made FPU stack overflow in the MACRO:
Alex,
First, I don't make the FPU stack
overflow - I just fill the FPU with valid numbers. From a general purpose algo, I would expect that it works even if other parts of the code use the FPU. That is what the ffree instruction is meant for.
Second, the GPF is caused by an int 3 in ChkPrecision.
Third, after commenting out the int 3, I see a serious if incorrect results. Who wrote GetPercentEtalone?
Quote from: jj2007 on December 02, 2010, 12:17:38 AM
Quote from: Antariy on December 01, 2010, 11:13:54 PM
Quote from: hutch-- on December 01, 2010, 10:34:56 PM
Alex,
I get a GP fault out of your last zip on xp sp2.
That's something with ChkPrecision/FPU code, apparently. AxGetPercentInt have no instructions which can cause GPF.
I have checked this, and found a reason, Jochen intentionally made FPU stack overflow in the MACRO:
Alex,
First, I don't make the FPU stack overflow - I just fill the FPU with valid numbers. From a general purpose algo, I would expect that it works even if other parts of the code use the FPU. That is what the ffree instruction is meant for.
Second, the GPF is caused by an int 3 in ChkPrecision.
Third, after commenting out the int 3, I see a serious if incorrect results. Who wrote GetPercentEtalone?
:P
Well, not overflow, but you are
prepare it for further possible overflow :P :lol
GPF - is not int 3 (int 3 is debugging exception, and have other code). So, we was misinformed :P
Etalone proc - written becuase GetPercent works not properly for 2^31 and above. Since FPU operate only with signed numbers, and when you load a DWORD - higher bit have meaning of the sign. You can see wrong results GetPercent as well.
"Etalone" was written to handle this thing, but probably in too tired and short time :P
Quote from: jj2007 on December 02, 2010, 12:17:38 AM
From a general purpose algo, I would expect that it works even if other parts of the code use the FPU. That is what the ffree instruction is meant for.
In the big program, I will really not expecting that any code will free some regs, which can contain my variables... FPUs rules for general purpose algos require to *not* hold FP values in the regs at time of call to external code.
Quote from: Antariy on December 02, 2010, 12:29:07 AM
In the big program, I will really not expecting that any code will free some regs, which can contain my variables... FPUs rules for general purpose algos require to *not* hold FP values in the regs at time of call to external code.
From Raymond Filiatreault's FpuLib help:
Unless a source parameter was specified as being in the TOP data register, the original Fpulib was designed to initialize the FPU to prevent any potential "stack overflow". This destroyed any data which may have been present in the other FPU registers.
This was revised later to destroy only the data (if any) in the registers which were necessary to perform the function. This new version will not destroy any of the existing data, except possibly the data in the ST(7) register.Re range of algos: Give me one reason why invoke &algo, -1000, 55 should
not return -550.
Quote from: jj2007 on December 02, 2010, 12:46:40 AM
Quote from: Antariy on December 02, 2010, 12:29:07 AM
In the big program, I will really not expecting that any code will free some regs, which can contain my variables... FPUs rules for general purpose algos require to *not* hold FP values in the regs at time of call to external code.
From Raymond Filiatreault's FpuLib help:
Unless a source parameter was specified as being in the TOP data register, the original Fpulib was designed to initialize the FPU to prevent any potential "stack overflow". This destroyed any data which may have been present in the other FPU registers. This was revised later to destroy only the data (if any) in the registers which were necessary to perform the function. This new version will not destroy any of the existing data, except possibly the data in the ST(7) register.
Re range of algos: Give me one reason why invoke &algo, -1000, 55 should not return -550.
But you know this feature of FPU lib, right? So, it should be documented, at least. And this is for programming in ASM only, where you can control flow of the program at all. When I talk about "general purpose" algos, I talk about algos which can be used in any environment, even with HLL... To strictly follow API rules, you should not hold FPU data in the regs. This is disputalble, of course, but only for ASM.
Well, if you like (or forced due to FPU) treat DWORD as signed - then you were right. But for unsigned that's wrong. Right result will be: 8CCCCAA6, and it is returned by integer unsigned code.
Alex
Sorry Alex, no go here.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
-2125943931 for 7FFF0000/1, case 26750588
Press a key after this and you have a GP fault.
I am using XP SP3.
Quote from: hutch-- on December 02, 2010, 01:35:52 AM
Sorry Alex, no go here.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
-2125943931 for 7FFF0000/1, case 26750588
Press a key after this and you have a GP fault.
I am using XP SP3.
Hutch, try to comment
call ChkPrecision
in the sources, please. Code is precise enough for integer code, and I (and Dave, and Luce) have no problems with inexact results.
To avoid some flaw in the checking code, comment call to checking. Then you will go to timings test straightforward.
Thank you!
Alex
Yes, works fine with the call commented out.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
21 cycles for AxGetPercentInt
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
21 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
26 bytes for AxGetPercentInt, result=12345678
--- ok ---
Quote from: hutch-- on December 02, 2010, 01:45:31 AM
Yes, works fine with the call commented out.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
21 cycles for AxGetPercentInt
13 cycles for GetPercentSSE
21 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
21 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
26 bytes for AxGetPercentInt, result=12345678
--- ok ---
Thank you, Hutch!
But results looks strange :eek
Very interesting thing :eek
Quote from: hutch-- on December 02, 2010, 01:45:31 AM
26 bytes for AxGetPercentInt, result=12345678
It seems that PIV hardware have different design of IMUL implementation.
I guess, culprit is in branch after IMUL:
imul edx,[esp+8],28F5C29h
jl @F
So, if change that piece to:
mov edx,[esp+8]
cmp edx,99
ja @F
imul edx,28F5C29h
Code will work guaranteed.
But anyway, this is really strange difference :eek
I have changed code, and timings up by 1 clock.
It should work on any CPU. But this is pity that sign bit and overflow bit have other layouts.
Hutch, test this new one, please!
Alex
Congrats, Alex, it works now, and it's very fast :U
However, for MasmBasic I will keep the old design that yields -550 for PerCent(-1000, 55); the need for an unsigned PerCent(4294966296, 55)=2362231462 is unclear to me.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
11 cycles for GetPercentSSE
37 cycles for GetPercent
14 cycles for GetPercent2c
15 cycles for GetPercent2nc
15 cycles for GetPercentJJ1
14 cycles for GetPercentJJ2
10 cycles for AxGetPercentInt
11 cycles for GetPercentSSE
37 cycles for GetPercent
14 cycles for GetPercent2c
15 cycles for GetPercent2nc
15 cycles for GetPercentJJ1
14 cycles for GetPercentJJ2
10 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
31 bytes for AxGetPercentInt, result=6790122
Quote from: jj2007 on December 02, 2010, 02:47:29 AM
Congrats, Alex, it works now, and it's very fast :U
However, for MasmBasic I will keep the old design that yields -550 for PerCent(-1000, 55); the need for an unsigned PerCent(4294966296, 55) is unclear to me.
Thanks!
Of course, you have using that algo which you want, I have not impose it at all. I just having some spare time, which I spent to it.
I prefer to treat numbers as unsigned, so I trying to make version which work with number of any size without speed loss, and that is all :bg
EDITED: Why I prefer unsigned: because usually in programming you are needed in positive numbers, rather than negative. For example for calculation of some coordinates.
Alex
Quote from: Antariy on December 02, 2010, 02:55:45 AM
For example for calculation of some coordinates.
The Earth's diameter is 40,000 km, that makes 40,000,000 metres or 40,000,000,000 millimetres
40,000,000,000/4294967296 =
9.3 millimetresFor GPS, that is a damn good resolution :bg
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
21 cycles for GetPercentSSE
53 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
21 cycles for GetPercentSSE
52 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
20 cycles for GetPercentJJ2
16 cycles for AxGetPercentInt
AMD Sempron(tm) Processor 3100+ (SSE3)
14 cycles for GetPercentSSE
30 cycles for GetPercent
13 cycles for GetPercent2c
15 cycles for GetPercent2nc
16 cycles for GetPercentJJ1
15 cycles for GetPercentJJ2
9 cycles for AxGetPercentInt
12 cycles for GetPercentSSE
29 cycles for GetPercent
13 cycles for GetPercent2c
14 cycles for GetPercent2nc
14 cycles for GetPercentJJ1
14 cycles for GetPercentJJ2
9 cycles for AxGetPercentInt
The last version is working:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
14 cycles for GetPercentSSE
36 cycles for GetPercent
9 cycles for GetPercent2c
9 cycles for GetPercent2nc
9 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
7 cycles for AxGetPercentInt
13 cycles for GetPercentSSE
36 cycles for GetPercent
9 cycles for GetPercent2c
9 cycles for GetPercent2nc
9 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
7 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
31 bytes for AxGetPercentInt, result=6790122
--- ok ---
:U
Things begin a little unclear for me.
Last version of AxGetPercentInt is this one
Quote
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
PourCent proc source:DWORD, percent:DWORD
mov eax,[esp+4]
if 0
imul edx,[esp+8],28F5C29h
jl @F
else
mov edx,[esp+8]
cmp edx,99
ja @F
imul edx,28F5C29h
endif
mul edx
mov eax,edx
shr edx,32-2
sub eax,edx
@@:
ret 8
PourCent endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Am I correct ?
Quote from: ToutEnMasm on December 02, 2010, 07:32:07 AM
Things begin a little unclear for me.
Last version of AxGetPercentInt is this one
Quote
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
PourCent proc source:DWORD, percent:DWORD
mov eax,[esp+4]
if 0
imul edx,[esp+8],28F5C29h
jl @F
else
mov edx,[esp+8]
cmp edx,99
ja @F
imul edx,28F5C29h
endif
mul edx
mov eax,edx
shr edx,32-2
sub eax,edx
@@:
ret 8
PourCent endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Am I correct ?
The above zip file is the one you should download:
http://www.masm32.com/board/index.php?action=dlattach;topic=15263.0;id=8563
Frank
Thanks,
I have extracted the good one
Quote from: Antariy on December 02, 2010, 02:01:30 AM
It seems that PIV hardware have different design of IMUL implementation.
I guess, culprit is in branch after IMUL:
imul edx,[esp+8],28F5C29h
jl @F
So, if change that piece to:
mov edx,[esp+8]
cmp edx,99
ja @F
imul edx,28F5C29h
Code will work guaranteed.
But anyway, this is really strange difference :eek
Hi,
According to some old documentation I am referring to, only
the carry and overflow flags are valid for the IMUL instruction.
And JL also uses the sign flag, so should not be used. Do you
have differing documentation as to what are the valid flags
when using IMUL?
Regards,
Steve N.
A very good question on the IMUL flags.
The intel says the SF ZF AF PF flags are undefined (U),What it means ?.
undefined means it can be either true or false
Quote
undefined means it can be either true or false
Perhaps a bit short because some comparison seems to work with one cpu and not with another.
I suspect here some secret of particular cpu.
not really - lol
if it doesn't work the same way on all CPU's, you don't want to use it
i think it's NexGen CPU's where they divide 25 by 5 or something
the ZF is clear for all CPU's except theirs (something like that)
pretty hokie, if you ask me
some other manufacturer that isn't aware of their quirk (flaw) might violate the rule
Quote from: jj2007 on December 02, 2010, 03:08:52 AM
Quote from: Antariy on December 02, 2010, 02:55:45 AM
For example for calculation of some coordinates.
The Earth's diameter is 40,000 km, that makes 40,000,000 metres or 40,000,000,000 millimetres
40,000,000,000/4294967296 = 9.3 millimetres
For GPS, that is a damn good resolution :bg
Distance to the Sun: ~149,700,000,000,000 mm / DWORD range= 34854,74 MM = 34 meters. :bg
Quote from: FORTRANS on December 02, 2010, 01:12:22 PM
Hi,
According to some old documentation I am referring to, only
the carry and overflow flags are valid for the IMUL instruction.
And JL also uses the sign flag, so should not be used. Do you
have differing documentation as to what are the valid flags
when using IMUL?
Hi Steve!
Yes, your documentation is right.
Just in the original code used simple (and logical at first look) assumption: if result of multiplication have higher bit set, then IMUL set CF and OF. But, since higher bit setted - this is signed value, and SF should be setted, too (in theory :lol). When truly overflow was occured, then OF was setted but SF - not (since truncation). So, I checked this on my CPU, and this "rule" seemed to work. And I have used it in algo. But it is not seemed to work on some other CPUs :green2
Steve, at page 2, here: "http://www.masm32.com/board/index.php?topic=15263.msg127043#msg127043" contained old code. Test it please if you have spare time. But firstly needed to comment call SSE code.
Alex
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
14 cycles for GetPercentSSE
47 cycles for GetPercent
9 cycles for GetPercent2c
9 cycles for GetPercent2nc
9 cycles for GetPercentJJ1
12 cycles for GetPercentJJ2
19 cycles for AxGetPercentInt
38 cycles for GetPercentSSE
37 cycles for GetPercent
22 cycles for GetPercent2c
9 cycles for GetPercent2nc
9 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
10 cycles for AxGetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
31 bytes for AxGetPercentInt, result=6790122
--- ok ---
:thumbu
-r
Quote from: Antariy on December 02, 2010, 11:26:02 PM
Hi Steve!
Steve, at page 2, here: "http://www.masm32.com/board/index.php?topic=15263.msg127043#msg127043" contained old code. Test it please if you have spare time. But firstly needed to comment call SSE code.
Hi Alex,
Doesn't seem to work. Neither the *.EXE included nor a
rebuild with SSE code commented out Seems to hang.
Results displayed (both cases):
G:\WORK\TEMP>2getperc
pre-P4 (SSE1)
-2125943931 for 7FFF0000/1, case 1
Regards,
Steve N.
Quote
Steve,
There is an int 3 in ChkPrecision:
inkey str$(edi), 13, 10
int 3
invoke AxGetPercentInt, esi, ebx
Just comment the call to ChkPrecision out, but keep in mind it was there for a reason. AxGetPercentInt behaves differently for negative values.
Hi jj2007,
Thanks for pointing that out. It now goes into an infinite loop.
But it does run. (?)
-2061530988 for 7FFF2710/4, case 1243708
-2040056706 for 7FFF2710/5, case 1243708
-2018582425 for 7FFF2710/6, case 1243708
-1997108144 for 7FFF2710/7, case 1243708
-1975633863 for 7FFF2710/8, case 1243708
-1954159582 for 7FFF2710/9, case 1243708
-1932685301 for 7FFF2710/10, case 1243708
Regards,
Steve N.
Quote from: FORTRANS on December 03, 2010, 02:07:26 PM
Hi jj2007,
Thanks for pointing that out. It now goes into an infinite loop.
But it does run. (?)
-2061530988 for 7FFF2710/4, case 1243708
-2040056706 for 7FFF2710/5, case 1243708
-2018582425 for 7FFF2710/6, case 1243708
-1997108144 for 7FFF2710/7, case 1243708
-1975633863 for 7FFF2710/8, case 1243708
-1954159582 for 7FFF2710/9, case 1243708
-1932685301 for 7FFF2710/10, case 1243708
Regards,
Steve N.
Hi Steve!
Thank, that is right - code from page 2, which I asks for testing, uses assumption that SF is setted according to the results.
So, it seemed to work only on Prescotts :bg
Here you can get working version: "http://www.masm32.com/board/index.php?topic=15263.msg127085#msg127085".
Alex
Hi Alex,
After a couple of edits.
G:\WORK\TEMP> 2getperc
pre-P4 (SSE1)
52 cycles for GetPercent
19 cycles for GetPercent2c
21 cycles for GetPercent2nc
21 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
13 cycles for AxGetPercentInt
52 cycles for GetPercent
19 cycles for GetPercent2c
21 cycles for GetPercent2nc
21 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
13 cycles for AxGetPercentInt
Code sizes:
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
31 bytes for AxGetPercentInt, result=6790122
--- ok ---
Regards,
Steve N.
Quote from: FORTRANS on December 03, 2010, 04:51:37 PM
Hi Alex,
After a couple of edits.
Hi Steve!
Thank you!
Yes, here we see drawbacks of the used testbed :bg
Alex
Quote from: Antariy on December 04, 2010, 01:18:38 AM
Yes, here we see drawbacks of the used testbed :bg
The testbed is transparent - anybody with basic search skills can search for "case" and find the "culprit", ChkPrecision was introduced because some results did not yield the expected results, and so I let it crash there for testing purposes. I didn't know that somebody would develop an algo that yields, as result for
invoke GetPercent, -1000, 50 the number 2147483148 (and claims the result is correct...).
So stop blaming the testbed, and start explaining in which real coding situations an "unsigned" GetPercent algo is useful.
well - i know of a simple example
displaying percent completion in a file transfer or other procedure
that is a case where a simple percent routine can be used; always positive, 0 to 100 %
it could even be a macro, but speed is not really critical in that case
we shouldn't spend too much time in the laboratory speeding up algos that are used like that
so - you can assume that this algo should be general purpose, and cover a wide range of input values
Humerously enough the GetPercent() algo was designed to do something really simple, calculate screen percentages to the nearest pixel for sizing windows on the screen. I was happy enough to use JJs technique that made it faster and smaller but for what it was designed for, ain't like you are going to tell the difference.
Quote from: dedndave on December 04, 2010, 03:15:06 AM
... and cover a wide range of input values
Indeed. Including negative ones like -1000 (which is what algos above do, except AxGetpercent)
:bg
Quote from: jj2007 on December 04, 2010, 03:03:55 AM
Quote from: Antariy on December 04, 2010, 01:18:38 AM
Yes, here we see drawbacks of the used testbed :bg
The testbed is transparent - anybody with basic search skills can search for "case" and find the "culprit", ChkPrecision was introduced because some results did not yield the expected results, and so I let it crash there for testing purposes. I didn't know that somebody would develop an algo that yields, as result for invoke GetPercent, -1000, 50 the number 2147483148 (and claims the result is correct...).
So stop blaming the testbed, and start explaining in which real coding situations an "unsigned" GetPercent algo is useful.
Well...
1. You always can use this algo with any (ANY) kind of numbers. If you want get percent of *negative* number - just NEG it before call, and NEG the result :P
2. Like Dave said, yes - if you copy a file with size of 3.21 GB, that's would be nice, if routine says: "Copied
-808.96 MB" :P
3. With screen coordinates, as Hutch said, you is not needed in FPUs precisions and rounding, you can use short, fast integer i386 SX capable code :P
Stop blaming integer code due its "unsignedness". You can use unsigned algo with any kind of data, because it treat all bits as data bits. But, FPU code working with DWORD cannot work with unsigned numbers, because it treat only 31 bits as data. So, unsigned code have no limitations, signed - is.
Alex
Quote from: jj2007 on December 04, 2010, 07:32:31 AM
Quote from: dedndave on December 04, 2010, 03:15:06 AM
... and cover a wide range of input values
Indeed. Including negative ones like -1000 (which is what algos above do, except AxGetpercent)
:bg
Some algos produce wrong results. That's negative results, though.
Quote from: jj2007 on December 04, 2010, 03:03:55 AM
So stop blaming the testbed, and start explaining in which real coding situations an "unsigned" GetPercent algo is useful.
First. I said about that Steve was commented SSE2 code - that's the point of testbed "blaming" - old testbed is not skip not supported algos.
Second. You can use unsigned algo with signed numbers. It would cost ~1-2 clocks more, but it is more flexible than signed algo - which not work with unsigned numbers.
Third. This algo tweak is by 2 clocks faster (for me):
AxGetPercentInt proc source:DWORD, percent:DWORD
mov edx,[esp+8]
mov eax,[esp+4]
cmp edx,99
ja @F
imul edx,28F5C29h
mul edx
mov eax,edx
shr edx,32-2
sub eax,edx
@@:
ret 8
AxGetPercentInt endp
16 bytes of code is taken by loading and checking and exiting. As macro, this code would be 15 bytes long, some cycles long.
It *works* with negative numbers. Just 2 clocks and 4 bytes more as favour. And I claim that 50% of -1000 = 2147483148 is correct. Since -1000 is FFFFFC18h (4294966296).
Alex
Ok, so we'll ask Hutch to put your fast unsigned AxGetPercent into Masm32.
Perhaps the documentatiosn should then contain a little example:
QuoteIn case you have to translate the y coordinate of a sinus function into screen coordinates:
call MySinus ; get some value for the Y axis
mov edx, ScreenFactorY ; scale factor
test eax, eax
.if Sign?
neg eax
invoke AxGetPercent, eax, edx
neg eax
.else
invoke AxGetPercent, eax, edx
.endif
That is easy and elegant, and avoids starting a flame war here, right?
Quote from: jj2007 on December 05, 2010, 07:39:30 AM
That is easy and elegant, and avoids starting a flame war here, right?
Where you see flame war? There is not Soap-Box :bg
Just you did not understand what I want to say with "bad old testbed". I want to say that old testbed is not skip unsupported algos from test, and Steve (FORTRANS) was forced to comment SSE2 code.
All next - is answer about real coding situations.
Nothing flame. Just funny that such small piece of code lead to such discussion :bg
Jochen, it would be interesting if you will post timings for the new tweak, because your clocks is always very different.
Alex
Quote from: Antariy on December 06, 2010, 02:42:19 AM
Just funny that such small piece of code lead to such discussion :bg
Small is big on the MASM32 forum :lol