Here is an algo that yields the share of source expressed in % - a variant of the Masm32 library GetPercent function discussed here (http://www.masm32.com/board/index.php?topic=15263.0).
My problem is that I do not understand why the two options below do not behave identically. If the result of the multiplication does not fit into eax alone, edx will be nonzero, and the carry and overflow flags should be set. That is my understanding of the not so clear documentation.
Where am I wrong?
QuoteGetPercentInt proc source:DWORD, percent:DWORD
mov eax, [esp+8] ; source
imul dword ptr [esp+4] ; multiply first with percent, result in edx::eax, then div 100
if 0
jc @F ; fails although carry flag should be set if edx!=0
else
test edx, edx ; works fine
jne @F
endif
inc eax
mov edx, 0a3d70a3dh ; Agner Fog
mul edx
shr edx, 6
mov eax, edx
ret 8
@@:
push 100
idiv dword ptr [esp] ; yields correct result but slow
pop edx
ret 8
GetPercentInt endp
to made a signed multiply you need to modify the proc as this
GetPercentInt PROTO :Sdword,:Sdword
other suggest ,masm optimize very well this sort of test,you can verify it by used of:
.if ZERO?
...
.endif
don't you want esp+8 and esp+C, not 4 and 8? +4 is the return address, which will never fit in just one register
EDIT - on second thought, it probably would.
-r
red,
+4 is arg1. The proggie works fine, and is blazing fast, on a P4 faster than the SSE2 version, but I'd like to understand why jc @F does not work.
Quote from: ToutEnMasm on November 08, 2010, 09:41:21 PM
to made a signed multiply you need to modify the proc as this
GetPercentInt PROTO :Sdword,:Sdword
Yves, there is no stack frame.
Quote from: jj2007 on November 08, 2010, 09:25:39 PM
Here is an algo that yields the share of source expressed in % - a variant of the Masm32 library GetPercent function discussed here (http://www.masm32.com/board/index.php?topic=15263.0).
My problem is that I do not understand why the two options below do not behave identically. If the result of the multiplication does not fit into eax alone, edx will be nonzero, and the carry and overflow flags should be set. That is my understanding of the not so clear documentation.
Where am I wrong?
QuoteGetPercentInt proc source:DWORD, percent:DWORD
mov eax, [esp+8] ; source
imul dword ptr [esp+4] ; multiply first with percent, result in edx::eax, then div 100
if 0
jc @F ; fails although carry flag should be set if edx!=0
else
test edx, edx ; works fine
jne @F
endif
inc eax
mov edx, 0a3d70a3dh ; Agner Fog
mul edx
shr edx, 6
mov eax, edx
ret 8
@@:
push 100
idiv dword ptr [esp] ; yields correct result but slow
pop edx
ret 8
GetPercentInt endp
Oh nvm is what ToutEnASM give on its description :lol
No proto Don't be stopped by this:
Quote
GetPercentInt proc source:SDWORD, percent:SDWORD
Explain for cf and Of flags cleared are here
Quote
Multiplies two signed operands. The number of operands determines the form of the instruction.
If a single operand is specified, the instruction multiplies the value in the specified general-purpose
register or memory location by the value in the AL, AX, EAX, or RAX register (depending on the
operand size) and stores the product in AX, DX:AX, EDX:EAX, or RDX:RAX, respectively.
If two operands are specified, the instruction multiplies the value in a general-purpose register (first
operand) by an immediate value or the value in a general-purpose register or memory location (second
operand) and stores the product in the first operand location.
If three operands are specified, the instruction multiplies the value in a general-purpose register or
memory location (second operand), by an immediate value (third operand) and stores the product in a
register (first operand).
The IMUL instruction sign-extends an immediate operand to the length of the other register/memory
operand.
The CF and OF flags are set if, due to integer overflow, the double-width multiplication result cannot
be represented in the half-width destination register. Otherwise the CF and OF flags are cleared.
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.
Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.
Alex
Quote from: Antariy on November 08, 2010, 09:59:49 PM
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.
Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.
Just run the executable in post #1, Alex. It is damn fast.
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
11 cycles for GetPercentSSE
36 cycles for GetPercent
14 cycles for GetPercent2c
14 cycles for GetPercent2nc
15 cycles for GetPercentJJ1
14 cycles for GetPercentJJ2
12 cycles for GetPercentInt
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
21 cycles for GetPercentSSE
16 cycles for GetPercentInt
@theunknownguy:
if 0 means conditional assembly, that's why you don't see it.
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Quote from: Antariy on November 08, 2010, 09:59:49 PM
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.
Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.
Just run the executable in post #1, Alex. It is damn fast.
@theunknownguy: if 0 means conditional assembly, that's why you don't see it.
Rolf sorry i read it fast (its late here):
7FFFC350 * 0 = CF (0) OF (0)
7FFFC350 * 1 = CF (0) OF (0)
7FFFC350 * >1 = CF (1) OF (1)
The CF and OF flags are set if, due to integer overflow, the double-width multiplication result cannot
be represented in the half-width destination register. Otherwise the CF and OF flags are cleared.
ToutEnASM explain...
Thats all i can do for today lol at this hour i am watching little ASM goblins on my bead :lol
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Just run the executable in post #1, Alex. It is damn fast.
It fast until result is fit into 32bit after imul, otherwise...
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Just run the executable in post #1, Alex. It is damn fast.
Run it with getting of 50% of the number 85899346 ;)
I got this:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
22 cycles for GetPercentSSE
49 cycles for GetPercent
25 cycles for GetPercent2c
27 cycles for GetPercent2nc
25 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
53 cycles for GetPercentInt
22 cycles for GetPercentSSE
49 cycles for GetPercent
24 cycles for GetPercent2c
27 cycles for GetPercent2nc
25 cycles for GetPercentJJ1
21 cycles for GetPercentJJ2
53 cycles for GetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
37 bytes for GetPercentInt, result=6790122
GetPercent(2c) have nice results.
Note: I have changed only Algo macro, so the CodeSize "result" is not affected.
For this short algo I get 42 clocks with nubers mentinioned in previous post:
align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
AxGetPercent proc source:DWORD, percent:DWORD
mov ecx,100
mov eax,[esp+4]
imul dword ptr [esp+8]
idiv ecx
ret 8
AxGetPercent endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Edited: code size is something about 19-20 bytes, I guess.
Alex
:bg
AxGetPercent proc source:DWORD, percent:DWORD
AxVersion = 2
if AxVersion eq 1
push 100 ; 17 bytes, 39 cycles, preserves ecx
mov eax, [esp+8]
imul dword ptr [esp+12]
idiv dword ptr [esp]
pop edx
ret 8
elseif AxVersion eq 2
pop ecx ; only 13 bytes, 58 cycles, trashes ecx
pop eax
pop edx
imul edx
push 100
idiv dword ptr [esp]
pop edx
jmp ecx
else
mov ecx, 100 ; 18 bytes, 37 cycles, trashes ecx
mov eax, [esp+4]
imul dword ptr [esp+8]
idiv ecx
ret 8
endif
AxGetPercent endp
elseif AxVersion eq 2
pop ecx ; only 13 bytes, 58 cycles, trashes ecx
pop eax
pop edx
push ecx
imul edx
mov ecx,100
idiv ecx
ret
else
~14 bytes, 3 clocks faster :bg
but still many clocks slower than esp-frame based
Quote from: theunknownguy on November 08, 2010, 10:07:51 PM
Rolf Jochen sorry i read it fast (its late here):
7FFFC350 * 0 = CF (0) OF (0)
7FFFC350 * 1 = CF (0) OF (0)
7FFFC350 * >1 = CF (1) OF (1)
Thanks for pointing me to this. What I see in Olly is interesting, and contradicts the documentation in various places, although the Intel manual
might indicate the behaviour correctly:
QuoteThe CF and OF flags are set when significant bit (including the sign bit) are carried
into the upper half of the result. The CF and OF flags are cleared when the result
(including the sign bit) fits exactly in the lower half of the result.
Since that is cryptic, here what happens:
Quoteeax=7FFF0000h, dword ptr [esp+4]=
1
imul dword ptr [esp+4] ->edx=0, CF=0
2
imul dword ptr [esp+4] ->edx=0, CF=1
In other words: When multiplying by 2, edx is still zero (
no significant bits shifted into edx) but the carry flag is already set!
Now only Intel knows what "sign bit are carried into the upper half of the result" really means ::)
Quote from: Antariy on November 08, 2010, 11:49:39 PM
~14 bytes, 3 clocks faster :bg
but still many clocks slower than esp-frame based
38 cycles, 14 bytes - very good if speed is not the highest priority! But
m2m ecx, 100 makes it 37 cycles and 12 bytes on my CPU :bg
The rule for IMUL is that if the result is entirely contained in the lower register then the flags are cleared.
If the result of the multiplication sets EAX MSB then, as this is signed multiplication, the result in EAX is negative but the real result in EDX:EAX may be positive so the full result is not contained entirely in EAX as the value in EDX must be looked at to determine the full result.
If you look carefully at the algorithm used by IMUL as described in the Intel manual, the description has changed (presumably, because the earlier version had it wrong).
The version of the manual from 1997 shows this:
QuoteEDX:EAX ¬ EAX * SRC (* signed multiplication *)
IF ((EDX = 00000000H) OR (EDX = FFFFFFFFH))
THEN CF = 0; OF = 0;
ELSE CF = 1; OF = 1;
FI;
The version of the manual from May 2007 shows this:
QuoteEDX:EAX ← EAX ∗ SRC (* Signed multiplication *)
IF EAX = EDX:EAX
THEN CF ← 0; OF ← 0;
ELSE CF ← 1; OF ← 1; FI;
Note that it's no longer described as EDX=0 clears the flags but EAX = EDX:EAX clears the flags. Since these are treated as signed numbers, if the multiply sets the sign bit of EAX but there is no carry to EDX then EAX does not equal EDX:EAX so the carry and overflow flags will be set even though EDX = 0.
You can, of course, avoid the problem if you use the unsigned MUL instruction, but this may not suit your needs.
Paul.
AMD Phenom(tm) II X4 945 Processor (SSE3)
9 cycles for GetPercentSSE
21 cycles for GetPercent
11 cycles for GetPercent2c
14 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
12 cycles for GetPercentJJ2
11 cycles for GetPercentInt
9 cycles for GetPercentSSE
21 cycles for GetPercent
11 cycles for GetPercent2c
14 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
12 cycles for GetPercentJJ2
11 cycles for GetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
37 bytes for GetPercentInt, result=6790122
--- ok ---
Quote from: dioxin on November 09, 2010, 12:15:26 AM
The rule for IMUL is ...
Thanks a lot, Paul. This is indeed pretty confusing :(
With [esp]=80001170h:
eax=1, imul dword ptr [esp]: CF=0, edx=-1, eax=80001170 (both eax and edx negative)
eax=2, imul dword ptr [esp]: CF=1, edx=-1, eax=000022E0 (edx negative, eax positive)
eax=3, imul dword ptr [esp]: CF=1, edx=-2, eax=80003450 (both eax and edx negative)
Fortunately, the algo works fine with test edx, edx...
Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...
It will work fast 50/50. Only if percents and/or numbers small.
Alex
Quote from: Antariy on November 09, 2010, 12:45:14 AM
Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...
It will work fast 50/50. Only if percents and/or numbers small.
Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 3 more than the original Masm32 library algo.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
11 cycles for GetPercentSSE
12 cycles for GetPercentInt
37 cycles for GetPercent
14 cycles for GetPercent2c
15 cycles for GetPercent2nc
14 cycles for GetPercentJJ2
37 cycles for AxGetPercent
Quote from: jj2007 on November 09, 2010, 12:55:55 AM
Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 4 more than the original Masm32 library algo.
I not sayed that algo is bad - just it is speedy in not general way.
But this is dilemma: if use algo with small (relatively) numbers - for example: calculate screen coordinates - then such speed is not needed. If algo would be used in real world with possibilities of any numbers and/or percents - that is not speed algo. It is just interesting algo.
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
42 cycles for GetPercentSSE
126 cycles for GetPercent
42 cycles for GetPercent2c
43 cycles for GetPercent2nc
45 cycles for GetPercentJJ1
46 cycles for GetPercentJJ2
33 cycles for GetPercentInt
41 cycles for GetPercentSSE
126 cycles for GetPercent
42 cycles for GetPercent2c
43 cycles for GetPercent2nc
45 cycles for GetPercentJJ1
45 cycles for GetPercentJJ2
31 cycles for GetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
37 bytes for GetPercentInt, result=6790122
--- ok ---
On my Office PC:
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
13 cycles for GetPercentSSE
36 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
8 cycles for GetPercentInt
13 cycles for GetPercentSSE
36 cycles for GetPercent
8 cycles for GetPercent2c
9 cycles for GetPercent2nc
11 cycles for GetPercentJJ1
10 cycles for GetPercentJJ2
8 cycles for GetPercentInt
Code sizes:
39 bytes for GetPercentSSE, result=6790123
36 bytes for GetPercent, result=-2147483648
41 bytes for GetPercent2c, result=-2147483648
45 bytes for GetPercent2nc, result=6790123
37 bytes for GetPercentJJ1, result=6790123
32 bytes for GetPercentJJ2, result=6790123
37 bytes for GetPercentInt, result=6790122
--- ok ---
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
21 cycles for GetPercentSSE
49 cycles for GetPercent
24 cycles for GetPercent2c
26 cycles for GetPercent2nc
24 cycles for GetPercentJJ1
20 cycles for GetPercentJJ2
15 cycles for GetPercentInt
For something so simple and short it makes more sense to inline it.
fild percentage
fimul value
fmul OneHundredth
fistp result
This only uses 1 FPU register and always gives the right result and it's probably faster than any of the other methods as well. The only reason it doesn't appear as fast is the messing around to transfer the result from the FPU to EAX but if you inline it then that's not needed.
Paul.
Here is posted new GetPercentInt with using of IMUL only: "http://www.masm32.com/board/index.php?topic=15263.msg127002#msg127002"