The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on November 08, 2010, 09:25:39 PM

Title: IMUL and flags
Post by: jj2007 on November 08, 2010, 09:25:39 PM
Here is an algo that yields the share of source expressed in % - a variant of the Masm32 library GetPercent function discussed here (http://www.masm32.com/board/index.php?topic=15263.0).

My problem is that I do not understand why the two options below do not behave identically. If the result of the multiplication does not fit into eax alone, edx will be nonzero, and the carry and overflow flags should be set. That is my understanding of the not so clear documentation.

Where am I wrong?

QuoteGetPercentInt proc source:DWORD, percent:DWORD
   mov eax, [esp+8]   ; source
   imul dword ptr [esp+4]   ; multiply first with percent, result in edx::eax, then div 100
   if 0
      jc @F   ; fails although carry flag should be set if edx!=0
   else
      test edx, edx   ; works fine
      jne @F
   endif
   inc eax
   mov edx, 0a3d70a3dh   ; Agner Fog
   mul edx
   shr edx, 6
   mov eax, edx
   ret 8
@@:
   push 100
   idiv dword ptr [esp]   ; yields correct result but slow
   pop edx
   ret 8
GetPercentInt endp
Title: Re: IMUL and flags
Post by: ToutEnMasm on November 08, 2010, 09:41:21 PM

to made a signed multiply you need to modify the proc as this

GetPercentInt PROTO :Sdword,:Sdword

other suggest ,masm optimize very well this sort of test,you can verify it by used of:

.if ZERO?
...
.endif
Title: Re: IMUL and flags
Post by: redskull on November 08, 2010, 09:45:01 PM
don't you want esp+8 and esp+C, not 4 and 8?  +4 is the return address, which will never fit in just one register

EDIT - on second thought, it probably would.
-r
Title: Re: IMUL and flags
Post by: jj2007 on November 08, 2010, 09:49:21 PM
red,
+4 is arg1. The proggie works fine, and is blazing fast, on a P4 faster than the SSE2 version, but I'd like to understand why jc @F does not work.
Title: Re: IMUL and flags
Post by: jj2007 on November 08, 2010, 09:50:54 PM
Quote from: ToutEnMasm on November 08, 2010, 09:41:21 PM

to made a signed multiply you need to modify the proc as this

GetPercentInt PROTO :Sdword,:Sdword


Yves, there is no stack frame.
Title: Re: IMUL and flags
Post by: theunknownguy on November 08, 2010, 09:55:29 PM
Quote from: jj2007 on November 08, 2010, 09:25:39 PM
Here is an algo that yields the share of source expressed in % - a variant of the Masm32 library GetPercent function discussed here (http://www.masm32.com/board/index.php?topic=15263.0).

My problem is that I do not understand why the two options below do not behave identically. If the result of the multiplication does not fit into eax alone, edx will be nonzero, and the carry and overflow flags should be set. That is my understanding of the not so clear documentation.

Where am I wrong?

QuoteGetPercentInt proc source:DWORD, percent:DWORD
   mov eax, [esp+8]   ; source
   imul dword ptr [esp+4]   ; multiply first with percent, result in edx::eax, then div 100
   if 0
      jc @F   ; fails although carry flag should be set if edx!=0
   else
      test edx, edx   ; works fine
      jne @F
   endif
   inc eax
   mov edx, 0a3d70a3dh   ; Agner Fog
   mul edx
   shr edx, 6
   mov eax, edx
   ret 8
@@:
   push 100
   idiv dword ptr [esp]   ; yields correct result but slow
   pop edx
   ret 8
GetPercentInt endp

Oh nvm is what ToutEnASM give on its description  :lol

Title: Re: IMUL and flags
Post by: ToutEnMasm on November 08, 2010, 09:57:39 PM
No proto Don't be stopped by this:

Quote
GetPercentInt proc source:SDWORD, percent:SDWORD

Explain for cf and Of flags cleared are here

Quote
Multiplies two signed operands. The number of operands determines the form of the instruction.
If a single operand is specified, the instruction multiplies the value in the specified general-purpose
register or memory location by the value in the AL, AX, EAX, or RAX register (depending on the
operand size) and stores the product in AX, DX:AX, EDX:EAX, or RDX:RAX, respectively.
If two operands are specified, the instruction multiplies the value in a general-purpose register (first
operand) by an immediate value or the value in a general-purpose register or memory location (second
operand) and stores the product in the first operand location.
If three operands are specified, the instruction multiplies the value in a general-purpose register or
memory location (second operand), by an immediate value (third operand) and stores the product in a
register (first operand).
The IMUL instruction sign-extends an immediate operand to the length of the other register/memory
operand.
The CF and OF flags are set if, due to integer overflow, the double-width multiplication result cannot
be represented in the half-width destination register. Otherwise the CF and OF flags are cleared.
Title: Re: IMUL and flags
Post by: Antariy on November 08, 2010, 09:59:49 PM
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.

Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.



Alex
Title: Re: IMUL and flags
Post by: jj2007 on November 08, 2010, 10:03:50 PM
Quote from: Antariy on November 08, 2010, 09:59:49 PM
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.

Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.

Just run the executable in post #1, Alex. It is damn fast.
QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11      cycles for GetPercentSSE
36      cycles for GetPercent
14      cycles for GetPercent2c
14      cycles for GetPercent2nc
15      cycles for GetPercentJJ1
14      cycles for GetPercentJJ2
12      cycles for GetPercentInt

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
21      cycles for GetPercentSSE
16      cycles for GetPercentInt

@theunknownguy: if 0 means conditional assembly, that's why you don't see it.
Title: Re: IMUL and flags
Post by: theunknownguy on November 08, 2010, 10:07:51 PM
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Quote from: Antariy on November 08, 2010, 09:59:49 PM
Quote from: jj2007 on November 08, 2010, 09:49:21 PM
blazing fast, on a P4 faster than the SSE2 versiondoes not work.

Jochen, it is very slow. ??? On your CPU it is faster than others? On mine - it slower that others. integer division very slow anyway.

Just run the executable in post #1, Alex. It is damn fast.

@theunknownguy: if 0 means conditional assembly, that's why you don't see it.

Rolf sorry i read it fast (its late here):

7FFFC350 * 0 = CF (0) OF (0)
7FFFC350 * 1 = CF (0) OF (0)
7FFFC350 * >1 = CF (1) OF (1)

The CF and OF flags are set if, due to integer overflow, the double-width multiplication result cannot
be represented in the half-width destination register. Otherwise the CF and OF flags are cleared.


ToutEnASM explain...

Thats all i can do for today lol at this hour i am watching little ASM goblins on my bead  :lol
Title: Re: IMUL and flags
Post by: Antariy on November 08, 2010, 10:13:08 PM
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Just run the executable in post #1, Alex. It is damn fast.
It fast until result is fit into 32bit after imul, otherwise...
Title: Re: IMUL and flags
Post by: Antariy on November 08, 2010, 10:22:37 PM
Quote from: jj2007 on November 08, 2010, 10:03:50 PM
Just run the executable in post #1, Alex. It is damn fast.

Run it with getting of 50% of the number 85899346 ;)

I got this:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
22      cycles for GetPercentSSE
49      cycles for GetPercent
25      cycles for GetPercent2c
27      cycles for GetPercent2nc
25      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
53      cycles for GetPercentInt

22      cycles for GetPercentSSE
49      cycles for GetPercent
24      cycles for GetPercent2c
27      cycles for GetPercent2nc
25      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
53      cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122


GetPercent(2c) have nice results.
Note: I have changed only Algo macro, so the CodeSize "result" is not affected.
Title: Re: IMUL and flags
Post by: Antariy on November 08, 2010, 10:26:38 PM
For this short algo I get 42 clocks with nubers mentinioned in previous post:


align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
AxGetPercent proc source:DWORD, percent:DWORD
mov ecx,100
mov eax,[esp+4]
imul dword ptr [esp+8]
idiv ecx
ret 8
AxGetPercent endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Edited: code size is something about 19-20 bytes, I guess.



Alex
Title: Re: IMUL and flags
Post by: jj2007 on November 08, 2010, 11:42:50 PM
 :bg
AxGetPercent proc source:DWORD, percent:DWORD
AxVersion = 2
if AxVersion eq 1
push 100 ; 17 bytes, 39 cycles, preserves ecx
mov eax, [esp+8]
imul dword ptr [esp+12]
idiv dword ptr [esp]
pop edx
ret 8
elseif AxVersion eq 2
pop ecx ; only 13 bytes, 58 cycles, trashes ecx
pop eax
pop edx
imul edx
push 100
idiv dword ptr [esp]
pop edx
jmp ecx
else
mov ecx, 100 ; 18 bytes, 37 cycles, trashes ecx
mov eax, [esp+4]
imul dword ptr [esp+8]
idiv ecx
ret 8
endif
AxGetPercent endp
Title: Re: IMUL and flags
Post by: Antariy on November 08, 2010, 11:49:39 PM

elseif AxVersion eq 2
pop ecx ; only 13 bytes, 58 cycles, trashes ecx
pop eax
pop edx
push ecx
imul edx
mov ecx,100
idiv ecx
ret
else


~14 bytes, 3 clocks faster  :bg
but still many clocks slower than esp-frame based
Title: Re: IMUL and flags
Post by: jj2007 on November 09, 2010, 12:08:56 AM
Quote from: theunknownguy on November 08, 2010, 10:07:51 PM
Rolf Jochen sorry i read it fast (its late here):
7FFFC350 * 0 = CF (0) OF (0)
7FFFC350 * 1 = CF (0) OF (0)
7FFFC350 * >1 = CF (1) OF (1)

Thanks for pointing me to this. What I see in Olly is interesting, and contradicts the documentation in various places, although the Intel manual might indicate the behaviour correctly:

QuoteThe CF and OF flags are set when significant bit (including the sign bit) are carried
into the upper half of the result. The CF and OF flags are cleared when the result
(including the sign bit) fits exactly in the lower half of the result.

Since that is cryptic, here what happens:
Quoteeax=7FFF0000h, dword ptr [esp+4]=
1
imul dword ptr [esp+4] ->edx=0, CF=0
2
imul dword ptr [esp+4] ->edx=0, CF=1

In other words: When multiplying by 2, edx is still zero (no significant bits shifted into edx) but the carry flag is already set!
Now only Intel knows what "sign bit are carried into the upper half of the result" really means ::)
Title: Re: IMUL and flags
Post by: jj2007 on November 09, 2010, 12:11:54 AM
Quote from: Antariy on November 08, 2010, 11:49:39 PM
~14 bytes, 3 clocks faster  :bg
but still many clocks slower than esp-frame based

38 cycles, 14 bytes - very good if speed is not the highest priority! But m2m ecx, 100 makes it 37 cycles and 12 bytes on my CPU :bg
Title: Re: IMUL and flags
Post by: dioxin on November 09, 2010, 12:15:26 AM
The rule for IMUL is that if the result is entirely contained in the lower register then the flags are cleared.
If the result of the multiplication sets EAX MSB then, as this is signed multiplication, the result in EAX is negative but the real result in EDX:EAX may be positive so the full result is not contained entirely in EAX as the value in EDX must be looked at to determine the full result.


If you look carefully at the algorithm used by IMUL as described in the Intel manual, the description has changed (presumably, because the earlier version had it wrong).


The version of the manual from 1997 shows this:

QuoteEDX:EAX ¬ EAX * SRC (* signed multiplication *)
IF ((EDX = 00000000H) OR (EDX = FFFFFFFFH))
THEN CF = 0; OF = 0;
ELSE CF = 1; OF = 1;
FI;


The version of the manual from May 2007 shows this:

QuoteEDX:EAX ← EAX ∗ SRC (* Signed multiplication *)
IF EAX = EDX:EAX
THEN CF ← 0; OF ← 0;
ELSE CF ← 1; OF ← 1; FI;


Note that it's no longer described as EDX=0 clears the flags but EAX = EDX:EAX clears the flags. Since these are treated as signed numbers, if the multiply sets the sign bit of EAX but there is no carry to EDX then EAX does not equal EDX:EAX so the carry and overflow flags will be set even though EDX = 0.


You can, of course, avoid the problem if you use the unsigned MUL instruction, but this may not suit your needs.

Paul.
Title: Re: IMUL and flags
Post by: dioxin on November 09, 2010, 12:26:50 AM
AMD Phenom(tm) II X4 945 Processor (SSE3)
9       cycles for GetPercentSSE
21      cycles for GetPercent
11      cycles for GetPercent2c
14      cycles for GetPercent2nc
11      cycles for GetPercentJJ1
12      cycles for GetPercentJJ2
11      cycles for GetPercentInt

9       cycles for GetPercentSSE
21      cycles for GetPercent
11      cycles for GetPercent2c
14      cycles for GetPercent2nc
11      cycles for GetPercentJJ1
12      cycles for GetPercentJJ2
11      cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---
Title: Re: IMUL and flags
Post by: jj2007 on November 09, 2010, 12:38:14 AM
Quote from: dioxin on November 09, 2010, 12:15:26 AM
The rule for IMUL is ...

Thanks a lot, Paul. This is indeed pretty confusing :(

With [esp]=80001170h:
eax=1, imul dword ptr [esp]:   CF=0, edx=-1, eax=80001170 (both eax and edx negative)
eax=2, imul dword ptr [esp]:   CF=1, edx=-1, eax=000022E0 (edx negative, eax positive)
eax=3, imul dword ptr [esp]:   CF=1, edx=-2, eax=80003450 (both eax and edx negative)

Fortunately, the algo works fine with test edx, edx...
Title: Re: IMUL and flags
Post by: Antariy on November 09, 2010, 12:45:14 AM
Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...

It will work fast 50/50. Only if percents and/or numbers small.



Alex
Title: Re: IMUL and flags
Post by: jj2007 on November 09, 2010, 12:55:55 AM
Quote from: Antariy on November 09, 2010, 12:45:14 AM
Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...

It will work fast 50/50. Only if percents and/or numbers small.

Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 3 more than the original Masm32 library algo.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11      cycles for GetPercentSSE
12      cycles for GetPercentInt
37      cycles for GetPercent
14      cycles for GetPercent2c
15      cycles for GetPercent2nc
14      cycles for GetPercentJJ2
37      cycles for AxGetPercent
Title: Re: IMUL and flags
Post by: Antariy on November 09, 2010, 12:59:37 AM
Quote from: jj2007 on November 09, 2010, 12:55:55 AM
Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 4 more than the original Masm32 library algo.

I not sayed that algo is bad - just it is speedy in not general way.

But this is dilemma: if use algo with small (relatively) numbers - for example: calculate screen coordinates - then such speed is not needed. If algo would be used in real world with possibilities of any numbers and/or percents - that is not speed algo. It is just interesting algo.
Title: Re: IMUL and flags
Post by: clive on November 09, 2010, 03:13:14 AM
Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
42      cycles for GetPercentSSE
126     cycles for GetPercent
42      cycles for GetPercent2c
43      cycles for GetPercent2nc
45      cycles for GetPercentJJ1
46      cycles for GetPercentJJ2
33      cycles for GetPercentInt

41      cycles for GetPercentSSE
126     cycles for GetPercent
42      cycles for GetPercent2c
43      cycles for GetPercent2nc
45      cycles for GetPercentJJ1
45      cycles for GetPercentJJ2
31      cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---
Title: Re: IMUL and flags
Post by: frktons on November 09, 2010, 10:37:52 AM
On my Office PC:

Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
13      cycles for GetPercentSSE
36      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
11      cycles for GetPercentJJ1
10      cycles for GetPercentJJ2
8       cycles for GetPercentInt

13      cycles for GetPercentSSE
36      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
11      cycles for GetPercentJJ1
10      cycles for GetPercentJJ2
8       cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---
Title: Re: IMUL and flags
Post by: dedndave on November 09, 2010, 10:57:23 AM
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
21      cycles for GetPercentSSE
49      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
20      cycles for GetPercentJJ2
15      cycles for GetPercentInt
Title: Re: IMUL and flags
Post by: dioxin on November 09, 2010, 04:39:30 PM
For something so simple and short it makes more sense to inline it.
fild percentage
fimul value
fmul OneHundredth
fistp result 


This only uses 1 FPU register and always gives the right result and it's probably faster than any of the other methods as well. The only reason it doesn't appear as fast is the messing around to transfer the result from the FPU to EAX but if you inline it then that's not needed.


Paul.
Title: Re: IMUL and flags
Post by: Antariy on December 01, 2010, 02:03:18 PM
Here is posted new GetPercentInt with using of IMUL only: "http://www.masm32.com/board/index.php?topic=15263.msg127002#msg127002"