News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

IMUL and flags

Started by jj2007, November 08, 2010, 09:25:39 PM

Previous topic - Next topic

jj2007

Quote from: theunknownguy on November 08, 2010, 10:07:51 PM
Rolf Jochen sorry i read it fast (its late here):
7FFFC350 * 0 = CF (0) OF (0)
7FFFC350 * 1 = CF (0) OF (0)
7FFFC350 * >1 = CF (1) OF (1)

Thanks for pointing me to this. What I see in Olly is interesting, and contradicts the documentation in various places, although the Intel manual might indicate the behaviour correctly:

QuoteThe CF and OF flags are set when significant bit (including the sign bit) are carried
into the upper half of the result. The CF and OF flags are cleared when the result
(including the sign bit) fits exactly in the lower half of the result.

Since that is cryptic, here what happens:
Quoteeax=7FFF0000h, dword ptr [esp+4]=
1
imul dword ptr [esp+4] ->edx=0, CF=0
2
imul dword ptr [esp+4] ->edx=0, CF=1

In other words: When multiplying by 2, edx is still zero (no significant bits shifted into edx) but the carry flag is already set!
Now only Intel knows what "sign bit are carried into the upper half of the result" really means ::)

jj2007

Quote from: Antariy on November 08, 2010, 11:49:39 PM
~14 bytes, 3 clocks faster  :bg
but still many clocks slower than esp-frame based

38 cycles, 14 bytes - very good if speed is not the highest priority! But m2m ecx, 100 makes it 37 cycles and 12 bytes on my CPU :bg

dioxin

The rule for IMUL is that if the result is entirely contained in the lower register then the flags are cleared.
If the result of the multiplication sets EAX MSB then, as this is signed multiplication, the result in EAX is negative but the real result in EDX:EAX may be positive so the full result is not contained entirely in EAX as the value in EDX must be looked at to determine the full result.


If you look carefully at the algorithm used by IMUL as described in the Intel manual, the description has changed (presumably, because the earlier version had it wrong).


The version of the manual from 1997 shows this:

QuoteEDX:EAX ¬ EAX * SRC (* signed multiplication *)
IF ((EDX = 00000000H) OR (EDX = FFFFFFFFH))
THEN CF = 0; OF = 0;
ELSE CF = 1; OF = 1;
FI;


The version of the manual from May 2007 shows this:

QuoteEDX:EAX ← EAX ∗ SRC (* Signed multiplication *)
IF EAX = EDX:EAX
THEN CF ← 0; OF ← 0;
ELSE CF ← 1; OF ← 1; FI;


Note that it's no longer described as EDX=0 clears the flags but EAX = EDX:EAX clears the flags. Since these are treated as signed numbers, if the multiply sets the sign bit of EAX but there is no carry to EDX then EAX does not equal EDX:EAX so the carry and overflow flags will be set even though EDX = 0.


You can, of course, avoid the problem if you use the unsigned MUL instruction, but this may not suit your needs.

Paul.

dioxin

AMD Phenom(tm) II X4 945 Processor (SSE3)
9       cycles for GetPercentSSE
21      cycles for GetPercent
11      cycles for GetPercent2c
14      cycles for GetPercent2nc
11      cycles for GetPercentJJ1
12      cycles for GetPercentJJ2
11      cycles for GetPercentInt

9       cycles for GetPercentSSE
21      cycles for GetPercent
11      cycles for GetPercent2c
14      cycles for GetPercent2nc
11      cycles for GetPercentJJ1
12      cycles for GetPercentJJ2
11      cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---

jj2007

Quote from: dioxin on November 09, 2010, 12:15:26 AM
The rule for IMUL is ...

Thanks a lot, Paul. This is indeed pretty confusing :(

With [esp]=80001170h:
eax=1, imul dword ptr [esp]:   CF=0, edx=-1, eax=80001170 (both eax and edx negative)
eax=2, imul dword ptr [esp]:   CF=1, edx=-1, eax=000022E0 (edx negative, eax positive)
eax=3, imul dword ptr [esp]:   CF=1, edx=-2, eax=80003450 (both eax and edx negative)

Fortunately, the algo works fine with test edx, edx...

Antariy

Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...

It will work fast 50/50. Only if percents and/or numbers small.



Alex

jj2007

Quote from: Antariy on November 09, 2010, 12:45:14 AM
Quote from: jj2007 on November 09, 2010, 12:38:14 AM
Fortunately, the algo works fine with test edx, edx...

It will work fast 50/50. Only if percents and/or numbers small.

Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 3 more than the original Masm32 library algo.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11      cycles for GetPercentSSE
12      cycles for GetPercentInt
37      cycles for GetPercent
14      cycles for GetPercent2c
15      cycles for GetPercent2nc
14      cycles for GetPercentJJ2
37      cycles for AxGetPercent

Antariy

Quote from: jj2007 on November 09, 2010, 12:55:55 AM
Depends on how you define "small": 100% of 21,474,836 is still in the fast range. For most practical purposes, this will be more than sufficient. For the rare cases, it's 40 cycles, 4 more than the original Masm32 library algo.

I not sayed that algo is bad - just it is speedy in not general way.

But this is dilemma: if use algo with small (relatively) numbers - for example: calculate screen coordinates - then such speed is not needed. If algo would be used in real world with possibilities of any numbers and/or percents - that is not speed algo. It is just interesting algo.

clive

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
42      cycles for GetPercentSSE
126     cycles for GetPercent
42      cycles for GetPercent2c
43      cycles for GetPercent2nc
45      cycles for GetPercentJJ1
46      cycles for GetPercentJJ2
33      cycles for GetPercentInt

41      cycles for GetPercentSSE
126     cycles for GetPercent
42      cycles for GetPercent2c
43      cycles for GetPercent2nc
45      cycles for GetPercentJJ1
45      cycles for GetPercentJJ2
31      cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---
It could be a random act of randomness. Those happen a lot as well.

frktons

On my Office PC:

Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
13      cycles for GetPercentSSE
36      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
11      cycles for GetPercentJJ1
10      cycles for GetPercentJJ2
8       cycles for GetPercentInt

13      cycles for GetPercentSSE
36      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
11      cycles for GetPercentJJ1
10      cycles for GetPercentJJ2
8       cycles for GetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
37      bytes for GetPercentInt, result=6790122

--- ok ---
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
21      cycles for GetPercentSSE
49      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
20      cycles for GetPercentJJ2
15      cycles for GetPercentInt

dioxin

For something so simple and short it makes more sense to inline it.
fild percentage
fimul value
fmul OneHundredth
fistp result 


This only uses 1 FPU register and always gives the right result and it's probably faster than any of the other methods as well. The only reason it doesn't appear as fast is the messing around to transfer the result from the FPU to EAX but if you inline it then that's not needed.


Paul.

Antariy