Hi all,
In the last days I've written an tool for determining the "magic number" for unsigned and signed division of 64, 32 and 16 bit numbers.
The algorithm based on c-source that can be found in AMD's "Software Optimization Guide".
I've test the program randomly and it looks like (for me :bg) that it works propper - but there is no guarantee at the moment.
Hope people here can help me to improve the program - so, feel free and post helpful suggestions, criticism and of course failure reports.
In the attached file you found the executable and source.
regards, qWord
-----
28-09-2009: new version attached
02-10-2009: fixed some bugs - new version uploaded
[attachment deleted by admin]
This is great!
I also wanted to write that sort of program ( damn you :P), but been working on other things and just wrote this Mathematica function:
(based on Agner Fog code)
<<Developer`
MagicDiv64[d_]:=
If[FractionalPart[2^(64+Floor[Log[2,d]])/d]>=0.5,
Print[BaseForm[Round[2^(64+Round[Log[2,d]])/d],16],"<-"],
Print[BaseForm[Floor[2^(64+Floor[Log[2,d]])/d],16]," add 1"]];
And i have also compiled programs from AMD manual :)
About your program:
- i would like to see an editbox instead of a static for display
- make the startup position : centerscreen :)
Code Suggestion:
- Log2(N) == bsr eax,N (count leading zeros)
- use also the mu64l,div64,add64,sub64 from AMD manual :)
Anyway it was time to ditch the old The_Svin "Magic divider" prog :)
hi drizz,
Quote from: drizz on September 18, 2008, 03:23:42 PM
I also wanted to write that sort of program ( damn you :P) ...
that's life :bdg
Quote
i would like to see an editbox instead of a static for display
I've replace the static with an RichEdit, so the user can select and copy code from it.
Quote
- make the startup position : centerscreen :)
-> done :U
Log[sub]2[/sub](N) == bsr eax,N
OMG , i should know this :bg
The next days i will rewrite some part of the mathematic routins.
An updated version is attached in my first post.
regards qWord
Good work qWord.
Recently, I had a more in-depth look at the MagicNumber64 source code. The original C code seemed to emanate from programmers such as those from MS using MMX registers to add 2+2. Maybe the language used doesn't lend itself to anything better. However, I was a wee bit disappointed to see it translated "verbatim" to assembly, specially the section for 32-bit numbers.
For example, from the C code (where t is the source dword divided by 2 until it becomes an odd number)
j = 0xFFFFFFFF % t
where j is declared as a U64; also declared as a qword in assembly. However, it is obvious that the modulo of a dword cannot be any larger than a dword.
That j is then used to compute k (also declared as a U64 in C and a qword in assembly)
k = (((U64) (1)) << (32 + l)) / ((U64) (0xffffffff - j))
From the above description of j, it cannot obviously ever be larger than 0xffffffff and the Cflag can never be set by the subtraction. But the assembly code is:
mov DWORD ptr tmp[0],-1
mov eax,DWORD ptr j
sub DWORD ptr tmp[0],eax
mov eax,DWORD ptr j+4
sbb DWORD ptr tmp[0+4],eax
With a bit of analysis, I'm sure the entire code for the 32-bit divisors (and possibly for 16-bit divisors) could be simplified significantly. I will eventually have a crack at the 32-bit one.
Also, your displayed text for one of the cases looks as follows, having replaced the magic number and the shifting value by variable names:
; valid for: 0 <= X < 0ffffffffh
mov eax,X
mov edx,magic_number
inc eax
mul edx
shr edx,shift
; edx = quotient
; REMARKS:
; if it's possible that X gets 0ffffffffh then
; use following code:
; mov edx,X
; mov eax,magic_number
; mul edx
; add eax,magic_number
; adc edx,0
; shr edx,shift
The valid range and REMARKS could be eliminated with the following:
mov eax,X
mov edx,magic_number
add eax,1
.if !CARRY?
mul edx
.endif
shr edx,shift
; edx = quotient
yes - i have been using qWord's tool to generate numbers - although i am using some code i snagged off Drizz :bg
i find that some of the numbers may require incrementing, rather than incrementing the multiplicand as Ray suggested
something else that would be nice is a way to show the range of values, although, it is asking a bit much
what i mean is, for some applications, i am not using the full register width on the input
so, i can shr the magic number (sometimes incrementing it again) and reduce the number of shifts at the end to get the quotient
in cases where i have to use 32-bit multiple precision multiplication, this can save an ADC and sometimes the use of a register
in all cases, i am using Drizz's examples to further extend the result and extract the remainder, as well
Drizz has some neat tricks :U
i think he should go ahead and write one - lol
thanks for your contribution raymond,
The routines are 1:1 translate from c to asm - this is done so, because I was a bit scared to change the behave of the routine when changing datatypes or something else. I will clean up the code the next days and follow your suggestions. How ever, if you are working on a better routine ("cracked" one :bg) please inform me.
regards, qWord
Quoteplease inform me
I will certainly do. I should have a bit of time during the next few days to work on it. But, it will not be a "better" routine.
How does a magic number work exactly? i've seen lingo use them alot to speed things up, but i'm abit fuzzy as to what they are.
Quote from: E^cube on September 28, 2009, 06:26:48 AM
How does a magic number work exactly? i've seen lingo use them alot to speed things up, but i'm abit fuzzy as to what they are.
http://blogs.msdn.com/devdev/archive/2005/12/12/502980.aspx
awesome :U
nothing really "magic" E^Cube - lol
MUL is much faster than DIV
so - multiply by the inverse of the number you want to divide by
then - take the high order register as a result
d/l qWord's tool in the first post of this thread and try it out :U
Attached is the source code of my simplified version for computing the magic number for unsigned 32-bit divisors. I've left in place the list of LOCALs used by qWord, commenting out those which had become unused. I've also left the C code translated by qWord and added my comments for the coding I used.
It returns the values required by qWord's program in a struct declared in his global variables. If you are not familiar with his source code and its goal, the output may be useless.
I've checked the output of this simplified version with several random numbers and all were identical to qWord's results.
I've also written procedures to convert dwords (signed and unsigned) and unsigned qwords (signed qwords still to come) using magic numbers (the qword one works all the way up to -1 with results identical to those obtained with Window's calculator). I'll post those within a few days after I add required comments. I may also add those procedures to the Fpulib when I get the time to update the help file so that their description will be included.
wow - you have reduce the problem noticeable :U
looking your code I've found that the loops could be eliminated using bsr/bsf instruction:
; Reduce uDword until not divisible by 2
xor eax,eax ; EAX = n for the 2^n factor of uDword
@@:
shr ebx,1
jc @F ; break when too far
inc eax
dec ecx
jmp @B
@@: ;ECX = log2(t)
--------------------------------------------------
==>
mov edx,ecx ; save log2(uDword)
xor ecx,ecx
bsf ecx,ebx ; ecx = n
shr ebx,cl ; reduce uDword ==> t
mov eax,edx
sub eax,ecx ; log2(uDword)-n = log2(t)
mov n,ecx
mov ecx,eax ; ecx = log2(t)
the same for the last loop:
@@:
shr edx,1
jc @F
dec ecx
jmp @B
@@:
lea edx,[edx*2+1]
add eax,n
mov d32.m,edx
mov d32.s,eax
ret
-------------------------------------------------
==>
mov eax,ecx
xor ecx,ecx
bsf ecx,edx
shr edx,cl
sub eax,ecx
add eax,n
mov d32.m,edx
mov d32.s,eax
ret
.if eax != ebx
mov d32.a,0
@@:
dec L
mov ecx,eax ; m_low
mov edx,ebx ; m_high
shr eax,1
shr ebx,1
cmp eax,ebx
jnz @B
; EDX = magic number, L = shift
.else
with this loop I'm not sure (manly caused by 'jnz'):
m_low and m_high must be at least bit-equal with their highest bit, other wise this loop will never exit?
mov d32.a,0
dec L
mov edx,ebx ; save m_high to edx
xor eax,ebx ; make different bits "visible"
xor ecx,ecx
bsr ecx,eax ; get the position of highest unequal bit ...
shr edx,cl
sub L,ecx
I've test it and it seems like it is correct - how ever, I'm not sure with this one.
regards, qWord
EDIT: just striked out some babel
Your additional code reductions are great. Using them, let's continue reducing.
Starting from the very begining, how about:
mov ebx,uDword
bsr ecx,ebx ; get log2(uDword) in ECX
mov edx,ecx ; save log2(uDword)
;your code with some rearranging
; xor ecx,ecx zeroing ecx is not necessary, same with all your other bsr/bsf
bsf ecx,ebx ; ecx = n
mov eax,edx
sub eax,ecx ; log2(uDword)-n = log2(t)
.if ZERO? ;would have been a power of 2
mov d32.m,0
mov d32.s,edx
mov d32.a,3
ret
.endif
mov n,ecx
shr ebx,cl ; reduce uDword ==> t
mov ecx,eax ; ecx = log2(t)
Your other loop eliminations also look good. However, as I mentioned in the attachement, the very last loop could even be eliminated entirely, not being necessary anymore.
i am using xp pro (mce2005) sp2
the copy to clipboard does not work for me - am i doing something wrong ?
dedndave,
On WinVista the button "copy to clipboard" works (the same with Copy and Past: Ctrl +c/v). Maybe the problem is the rich edit control - I'm using WM_GETTEXT to receive text. I'm also noticed that there are problems with the edit-control (were you type in the divisor): copying a number to it using CTRL+v/c fails, but through the context-menu it works. I fix this issues in the next version.
cool - many thanks qWord
As promised, the attached zip file contains my four procedures to convert signed/unsigned dword/qword to ascii. They are in modular format ready for inclusion into a library. Or just copy/paste the procedure code.
You will notice that they all require the address of a buffer of adequate size to return the ascii string. Each procedure returns the address of the first byte of the string in the EAX register. Since conversions generate the digits in reverse order, this approach avoided code to place the string starting always at the first byte of the supplied buffer.
Dword to ascii algos have been beaten almost to death as to how fast they compare. I thus doubt that the attached ones would fare any better (nor much worse) than the best.
Qword to ascii algos have not been as studied. The attached ones may have an edge if anyone is interested in making comparisons. (Timings will obviously go down with 64-bit computers and related code.)
very nice Ray
qwords have been worked over pretty well, also - lol
Lingo, Drizz, and Jochen all have pretty fast algo's
i do like yours, though - they are nice and small and should time favorably :U
this is a link to Drizz's u64toa
http://www.masm32.com/board/index.php?topic=11741.msg88897#msg88897
as you can see, it is quite a bit longer than yours
Lingo modified a P. Dixon routine that has a look-up table
i think that is the fastest one around, but it is fairly large
as you know, i am working on a bignum routine that combines ling long kai fang (horner's rule) and multiply-to-divide
i may take a shot at a 64-bit version of that once i get finished - lol
here are some times...
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
2147483647
-2147483648
4294967295
9223372036854775807
-9223372032559808512
18446744073709551615
smdtoa (2147483647) 266 267 265 clock cycles
smdtoa (-2147483648) 265 266 275 clock cycles
umdtoa (4294967295) 227 231 233 clock cycles
smqtoa (9223372036854775807) 1192 1181 1183 clock cycles
smqtoa (-9223372036854775808) 1173 1173 1169 clock cycles
umqtoa (18446744073709551615) 1084 1056 1054 clock cycles
it looks like smqtoa has a problem displaying the correct digits for a negative value
(i know that feeling - lol - my llkf bignum had a similar problem)
(program and source attached)
this might be the problem...
test esi,esi
pushfd
.if SIGN?
not esi
neg ebx
.endif
here is one possible fix...
test esi,esi
pushfd
mov eax,esi
cdq
xor ebx,edx
xor esi,edx
sub ebx,edx
sbb esi,edx
the sign bit is extended into edx
then, if negative, invert both dwords and add 1
i applied that fix and smqtoa is now faster than umqtoa
Ray,
if you want to speed up both signed routines.....
something i learned while writing llkf bignum is that pushfd and popfd are slow
rather than pushing and poping the flags, extend the sign bit into edx with cdq
then, push the edx value
at the end, pop it and....
pop edx ;recall extended sign
mov al,2Dh ;minus sign "-"
and eax,edx
jz @F
dec edi
mov [edi],al
@@:
here are some times for comparison...
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
2147483647
-2147483648
4294967295
9223372036854775807
-9223372036854775808
18446744073709551615
smdtoa (2147483647) 197 197 198 clock cycles
smdtoa (-2147483648) 202 201 201 clock cycles
umdtoa (4294967295) 229 229 228 clock cycles
smqtoa (9223372036854775807) 1034 1032 1030 clock cycles
smqtoa (-9223372036854775808) 1018 1018 1049 clock cycles
umqtoa (18446744073709551615) 1057 1056 1055 clock cycles
(program and source attached)
Thanks Dave. Here's my correction for the stupid error I made negating the number. That's the standard way it should have been done in the first place.
.if SIGN?
not esi
not ebx
add ebx,1
adc esi,0
.endif
As for the pushfd, I'll take your word for it. However, I simply push the high dword and pop it at the end and retest it. Using cdq for the qword routine would have involved a few more instructions.
I've modified my atttachement in the previous post.
very good, Ray
did you time the new fixes ?
you should be able to use that program as a skeleton
I'm probably missing something here, but what do the last nine posts have to do with the original topic? Perhaps they should be moved to their own thread?
i think Ray wanted to demonstrate the advantages of using multiply-to-divide
but, you may be right - lol - oh well - i am sure that qWord won't mind
i have a good question for you guys...
is there some equation or algorithm that will calculate the minimum number of multiplier bits
required for some number of dividend bits and some number of divisor bits ?
it is not just to reduce the number of shifts,
but to determine how much precision is required for the multiplication for a given set of input parameters
i am going to try to write a simple program to test it for divisors that are multiples of 10
and see if i can "gleen" an equation - lol - we all know how well my equations work :lol
qWord ?
Ray ?
anybody ?
is this a puzzle for project euler ???
don't make me be the one to solve it :P
dedndave ,
Not sure if I have understand you right, but even if you can calculate such thinks it makes not much sense on x86, because we have only 3 (or 4 on x86-64) kinds of multiplication: 8,16 and 32bit. In generally I would say that the needed precision depends on your dividend, nothing more.
But manly raymond is the man that could give you a right and more detailed answer.
qWord
BTW: have you received my PM? I've upload an new version of MagicNumber64.exe - did the 'copy to clipboard'-button now works on your machine?
yes qWord - the clipboard works great :U
as for the resolution, i am working with some larger values that require multiple precision multiplication
at the moment, i am using double-double precision multiplication (64-bit x 64-bit)
i would like to reduce it to single-double precision (32-bit x 64-bit) if i can
there are too many possible input values to do a 100% value test (it would take centuries)
so, i have to show it works in theory if i want to reduce it
i have written a program that runs with divisors of 10 to 1,000,000,000 and reduces the multiplier
as soon as it is finished running, i will post the results (text file)
here is a simple program i wrote to test multipliers, with the resultant text output
for the purpose of trying to find an equation, the values that work all the way to FFFFFFFFh aren't helping - lol
i may make them multiple precision to see where they die
to proof your theory you can reduce it to 8bit x 16 bit. If this works, it will also work with 32bit x 64bit.
i don't think that is really true, qWord
i would like to use simple algebraic statements to determine an equation
the problem, as i see it, is this:
in BASIC, we had a function named INT()
it truncated the fractional part of real number into an integer
this is a natural function in computing, as registers, values, etc have a defined number of bits
when we determine a multiplier, we use this function, as most multipliers have bits that are truncated
also, when we multiply to divide, we use the function again because the multiplication step limits the low-order bits
in algebra, however, there are no simple "rules" for working with this function
if the INT() function appears in an algebraic equation, it is hard to manipulate/evaluate/reduce
perhaps we can somehow define it in a form like this:
R = W + F, where R = a real value, W = the whole part, F = the fractional part
if we can seperate them algebraically, we can manipulate them in an equation
the problem here is how to define algebraically where the whole part ends and the fractional part starts - lol
let me toss it around in my brain for a while
dedndave,
What qword said has to be true because the difference between the two is a constant factor of 2x2. In other words
(8*4)*(16*4)=32*64
Paul
well - it may be that i can derive an equation using smaller registers - that is true
Quoteto proof your theory you can reduce it to 8bit x 16 bit. If this works, it will also work with 32bit x 64bit.
the problem with qWords statement is....
i don't have a theory, yet :lol
If you are talking about multiplying an integer by a reciprocal instead of dividing by an integer, consider this:
- The size of the integer divisor is totally immaterial. Its reciprocal can be computed with an infinite number of decimal places with full precision. That reciprocal is a fraction, whether it be in decimal or binary.
- The number of significant digits (precision) in the product of an integer by a fraction will be dependant on the number of significant digits retained in the fraction.
In other words, if you have a 10 digit integer and you multiply it by a fraction having only 5 significant digits, the result will only be precise to 5 significant digits. The number of significant digits in the multiplying fraction must thus have as many significant digits as the integer being multiplied to result in full precision. (In binary, digit = bit)
Similarly, the precision of the product of an integer by a fraction will not be any more precise by using more digits than necessary in the fraction. As an example, if you need the circumference of a circle whose diameter is given with 3 significant digits, using a value of pi with 4 significant digits would be sufficient, The result would not be any more precise by using a value of pi with 30 significant digits.
that makes sense, Raymond
more or less - lol
so - if i have a 57 bit maximum dividend, i need a 57 bit multiplier ?
that makes it easy
what about this....
i have a 62 bit maximum dividend
after the multiplication, i shift right 29 bits (essentially, "unneccessary resolution")
i have a 33 bit maximum quotient
does this mean i can get away with a 33 bit multiplier ? (62-29=33)
if i understand what you are saying, it is the size of the quotient that determines the required multiplier resolution
(which would also mean - the divisor does make a difference - larger divisor -> smaller quotient)
this seems to make sense
i can speed up my routine by not using a full 64x64 multiplication
i will see what i can come up with and test it
ok - we have a theory, now - lol
not only should the multiplier have as many bits as the max quotient, but....
Dave's Theory #1 (assuming it will change - lol)
The minimum multiplier should be equal to or greater than the maximum expected quotient.
Hey Dave,
Good luck with your project. I am extremely interested.
btw - theories often give birth to theorems...
Paul
thanks Paul - well - "Dave's Multiplier Theory #1" looks great on paper
it may wind up on a roll in the bathroom - lol
S-o-o-o,
Either way it will prove useful; sounds like a win-win situation. ::)
Paul
ok - scratch Dave's Multiplier Theory #1 - lol
i have a good handle on it, now
i just need to state it mathematically
here is a case that never makes it to FFFF_FFFF_FFFF_FFFF
i am dividing by 1,000,000,000
i am using a multiplier of 8970_5F41_36B4_A598 and a shift of 29 (which is really /2^93)
HiA LoA
X HiB LoB
------------------
2^96 2^64 2^32 2^0
LoALoB1 LoALoB0
HiALoB1 HiALoB0
LoAHiB1 LoAHiB0
HiAHiB1 HiAHiB0
Multiplier: 89705F41_36B4A598 Shift: 29 Fail: AA4EA381_AE3507FF
A: 89705F41_36B4A598
B: AA4EA381_AE3507FF
carry: 1 2
253A1DA9 06F01A68
5D86D94C DFFEA8BF
2464C3F1 FAC73998
5B6ED745 FE1462C1
-------------------------------------------------
5B6ED746 80000000 00000000 06F01A68
5B6ED746_80000000 / 2^29 = 2_DB76_BA34
result: 2_DB76_BA34 (5B6ED746_80000000 / 2^29)
expected: 2_DB76_BA33 (AA4EA381_AE3507FF / 1,000,000,000)
in this specific application, my maximum dividend is 3B9A_CA03_4B82_FA17, so i am ok
but, the multiply-to-divide should work all the way to at least FFFF_FFFF_FFFF_FFFE
i can't make any of our statements about multiplier resolution work for me, here
here is the test program and data (it takes a while to run - lol)
notice that there is little or no logic in the failure pattern as i shift the multiplier right
Multiplier: 89705F41_36B4A598 Shift: 29 Fail: AA4EA381_AE3507FF
Multiplier: 44B82FA0_9B5A52CC Shift: 28 Fail: AA4EA381_AE3507FF
Multiplier: 225C17D0_4DAD2966 Shift: 27 Fail: AA4EA381_AE3507FF
Multiplier: 112E0BE8_26D694B3 Shift: 26 Fail: AA4EA381_AE3507FF
Multiplier: 089705F4_136B4A5A Shift: 25 Fail: 0F9B0B1C_F79815FF
Multiplier: 044B82FA_09B5A52D Shift: 24 Fail: 0F9B0B1C_F79815FF
Multiplier: 0225C17D_04DAD297 Shift: 23 Fail: 035E36C8_CDE0B3FF
Multiplier: 0112E0BE_826D694C Shift: 22 Fail: 014FB4D0_15340DFF
Multiplier: 0089705F_4136B4A6 Shift: 21 Fail: 014FB4D0_15340DFF
Multiplier: 0044B82F_A09B5A53 Shift: 20 Fail: 014FB4D0_15340DFF
Multiplier: 00225C17_D04DAD2A Shift: 19 Fail: 00390AF2_316B81FF
Multiplier: 00112E0B_E826D695 Shift: 18 Fail: 00390AF2_316B81FF
Multiplier: 00089705_F4136B4B Shift: 17 Fail: 000D340E_991957FF
Multiplier: 00044B82_FA09B5A6 Shift: 16 Fail: 00053444_56AAADFF
Multiplier: 000225C1_7D04DAD3 Shift: 15 Fail: 00053444_56AAADFF
Multiplier: 000112E0_BE826D6A Shift: 14 Fail: 0001852A_D5DF1BFF
Multiplier: 00008970_5F4136B5 Shift: 13 Fail: 0001852A_D5DF1BFF
Multiplier: 000044B8_2FA09B5B Shift: 12 Fail: 00006591_D3EA21FF
Multiplier: 0000225C_17D04DAE Shift: 11 Fail: 000028FC_FA86ADFF
Multiplier: 0000112E_0BE826D7 Shift: 10 Fail: 000028FC_FA86ADFF
Multiplier: 00000897_05F4136C Shift: 9 Fail: 00000C1B_356D35FF
Multiplier: 0000044B_82FA09B6 Shift: 8 Fail: 00000C1B_356D35FF
Multiplier: 00000225_C17D04DB Shift: 7 Fail: 00000C1B_356D35FF
Multiplier: 00000112_E0BE826E Shift: 6 Fail: 000001D3_0EEADBFF
Multiplier: 00000089_705F4137 Shift: 5 Fail: 000001D3_0EEADBFF
Multiplier: 00000044_B82FA09C Shift: 4 Fail: 0000006A_675299FF
Multiplier: 00000022_5C17D04E Shift: 3 Fail: 0000006A_675299FF
Multiplier: 00000011_2E0BE827 Shift: 2 Fail: 0000006A_675299FF
Multiplier: 00000008_9705F414 Shift: 1 Fail: 0000000E_E6B27FFF
Multiplier: 00000004_4B82FA0A Shift: 0 Fail: 0000000E_E6B27FFF
According to qWord's dialog box, the multiplier should be 89705F4136B4A597h (instead of 89705F4136B4A598h) and you then need to increment the dividend before the multiplication.
You then get the expected result.
thanks Ray
oh - i thought we were incrementing the multipliers - lol
let me modify the test program and run it again
running it now, Ray
thanks again
that's a little better - lol
of course, i can make the first one pass easy enough
this test checks each point just before and just after the quotient rollover
i want to add all the points where the low dword is 00000000 and FFFFFFFF to the sequence, as well
that will make for a fairly comprehensive 64-bit test
Multiplier: 89705F41_36B4A597 Shift: 29 Fail: FFFFFFFF_FFFFFFFF
Multiplier: 44B82FA0_9B5A52CB Shift: 28 Fail: 73348093_3BA6C600
Multiplier: 225C17D0_4DAD2965 Shift: 27 Fail: 2B0B3E03_EF940E00
Multiplier: 112E0BE8_26D694B2 Shift: 26 Fail: 131B7A69_5A7B4000
Multiplier: 089705F4_136B4A59 Shift: 25 Fail: 131B7A69_5A7B4000
Multiplier: 044B82FA_09B5A52C Shift: 24 Fail: 05ED06A0_F2511800
Multiplier: 0225C17D_04DAD296 Shift: 23 Fail: 05ED06A0_F2511800
Multiplier: 0112E0BE_826D694B Shift: 22 Fail: 05ED06A0_F2511800
Multiplier: 0089705F_4136B4A5 Shift: 21 Fail: 00E8B62F_54D78200
Multiplier: 0044B82F_A09B5A52 Shift: 20 Fail: 00566838_C33B0A00
Multiplier: 00225C17_D04DAD29 Shift: 19 Fail: 00566838_C33B0A00
Multiplier: 00112E0B_E826D694 Shift: 18 Fail: 0018957D_925BDC00
Multiplier: 00089705_F4136B4A Shift: 17 Fail: 0018957D_925BDC00
Multiplier: 00044B82_FA09B5A5 Shift: 16 Fail: 0018957D_925BDC00
Multiplier: 000225C1_7D04DAD2 Shift: 15 Fail: 0003A7FD_3BAEAE00
Multiplier: 000112E0_BE826D69 Shift: 14 Fail: 0003A7FD_3BAEAE00
Multiplier: 00008970_5F4136B4 Shift: 13 Fail: 0000D47A_97ED2600
Multiplier: 000044B8_2FA09B5A Shift: 12 Fail: 0000D47A_97ED2600
Multiplier: 0000225C_17D04DAD Shift: 11 Fail: 0000D47A_97ED2600
Multiplier: 0000112E_0BE826D6 Shift: 10 Fail: 00001D93_E2A71A00
Multiplier: 00000897_05F4136B Shift: 9 Fail: 00001D93_E2A71A00
Multiplier: 0000044B_82FA09B5 Shift: 8 Fail: 000006A8_51FFF000
Multiplier: 00000225_C17D04DA Shift: 7 Fail: 0000029C_74EF6600
Multiplier: 00000112_E0BE826D Shift: 6 Fail: 0000029C_74EF6600
Multiplier: 00000089_705F4136 Shift: 5 Fail: 000000C2_E1167200
Multiplier: 00000044_B82FA09B Shift: 4 Fail: 000000C2_E1167200
Multiplier: 00000022_5C17D04D Shift: 3 Fail: 00000032_FD6ACE00
Multiplier: 00000011_2E0BE826 Shift: 2 Fail: 00000014_B8D03A00
Multiplier: 00000008_9705F413 Shift: 1 Fail: 00000014_B8D03A00
Multiplier: 00000004_4B82FA09 Shift: 0 Fail: 00000006_0DB88400
QuoteMultiplier: 89705F41_36B4A597 Shift: 29 Fail: FFFFFFFF_FFFFFFFF
This only fails if you blindly add 1 resulting in 0, without considering the carry, and then multiplying. If you shift your multiplier by 29, you get the correct result (Look at my very first post in this thread for such an occurence.)
i was more interested in all the other lines, Ray - lol
i was hoping to make some kind of correlation between the result and multiplier resolution
i have totally re-written this program
it checks all values where the low 32-bits are zero or FFFFFFFF
it checks all values where the high 32-bits are zero or FFFFFFFF
it checks all values just before/after quotient rollover (this is where most errors occur first)
it checks them in sequence
if an error is found, it backs up to the last "pass" dividend and checks one by one until the earliest fail is found
it also runs 64 shift values - some have been removed in the following text (they all fail like the last one after a point)
it takes a while to run - i have throttled it back to only use 25% CPU time
for a divisor of 1,000,000,000, it took about 4 1/2 hrs on my machine
for smaller divisors, it will take even longer
anyways, i thought you guys might like to have a 64 x 64 32-bit mul2div test program
source is attached...
Multiplier: 89705F41_36B4A597 Mult Divisor: 2^93 Pass: FFFFFFFF_FFFFFFFF
Multiplier: 44B82FA0_9B5A52CB Mult Divisor: 2^92 Fail: 73348093_3BA6C600
Multiplier: 225C17D0_4DAD2965 Mult Divisor: 2^91 Fail: 2B0B3E03_EF940E00
Multiplier: 112E0BE8_26D694B2 Mult Divisor: 2^90 Fail: 131B7A69_5A7B4000
Multiplier: 089705F4_136B4A59 Mult Divisor: 2^89 Fail: 131B7A69_5A7B4000
Multiplier: 044B82FA_09B5A52C Mult Divisor: 2^88 Fail: 05ED06A0_F2511800
Multiplier: 0225C17D_04DAD296 Mult Divisor: 2^87 Fail: 05ED06A0_F2511800
Multiplier: 0112E0BE_826D694B Mult Divisor: 2^86 Fail: 05ED06A0_F2511800
Multiplier: 0089705F_4136B4A5 Mult Divisor: 2^85 Fail: 00E8B62F_54D78200
Multiplier: 0044B82F_A09B5A52 Mult Divisor: 2^84 Fail: 00566838_C33B0A00
Multiplier: 00225C17_D04DAD29 Mult Divisor: 2^83 Fail: 00566838_C33B0A00
Multiplier: 00112E0B_E826D694 Mult Divisor: 2^82 Fail: 0018957D_925BDC00
Multiplier: 00089705_F4136B4A Mult Divisor: 2^81 Fail: 0018957D_925BDC00
Multiplier: 00044B82_FA09B5A5 Mult Divisor: 2^80 Fail: 0018957D_925BDC00
Multiplier: 000225C1_7D04DAD2 Mult Divisor: 2^79 Fail: 0003A7FD_3BAEAE00
Multiplier: 000112E0_BE826D69 Mult Divisor: 2^78 Fail: 0003A7FD_3BAEAE00
Multiplier: 00008970_5F4136B4 Mult Divisor: 2^77 Fail: 0000D47A_97ED2600
Multiplier: 000044B8_2FA09B5A Mult Divisor: 2^76 Fail: 0000D47A_97ED2600
Multiplier: 0000225C_17D04DAD Mult Divisor: 2^75 Fail: 0000D47A_97ED2600
Multiplier: 0000112E_0BE826D6 Mult Divisor: 2^74 Fail: 00001D93_E2A71A00
Multiplier: 00000897_05F4136B Mult Divisor: 2^73 Fail: 00001D93_E2A71A00
Multiplier: 0000044B_82FA09B5 Mult Divisor: 2^72 Fail: 000006A8_51FFF000
Multiplier: 00000225_C17D04DA Mult Divisor: 2^71 Fail: 0000029C_74EF6600
Multiplier: 00000112_E0BE826D Mult Divisor: 2^70 Fail: 0000029C_74EF6600
Multiplier: 00000089_705F4136 Mult Divisor: 2^69 Fail: 000000C2_E1167200
Multiplier: 00000044_B82FA09B Mult Divisor: 2^68 Fail: 000000C2_E1167200
Multiplier: 00000022_5C17D04D Mult Divisor: 2^67 Fail: 00000032_FD6ACE00
Multiplier: 00000011_2E0BE826 Mult Divisor: 2^66 Fail: 00000014_B8D03A00
Multiplier: 00000008_9705F413 Mult Divisor: 2^65 Fail: 00000014_B8D03A00
Multiplier: 00000004_4B82FA09 Mult Divisor: 2^64 Fail: 00000006_0DB88400
Multiplier: 00000002_25C17D04 Mult Divisor: 2^63 Fail: 00000000_3B9ACA00
now, to put the finishing touches on my base 1,000,000,000 ling long kai fang bignum routine :bg
i really like using qWords program
but, it occured to me that what it needs, now, is a nice icon - lol
the attached file has an ico with several sizes in it
(http://img98.imageshack.us/img98/3966/numbers128x128x32.png)
i was playing with this..... again :P
i want to write some code that uses multiply-to-divide, but with variable multiplier and shift values
when the divisor is changed, i calculate new mult and shift values and store them
then, the multiply-to-divide code uses those values until the divisor changes again
the problem here is, the code in the magic number generator program may be different from one divisor to another
i can probably overcome this by carefully choosing the multiplier
at any rate, i was looking at the code that is generated for an example divisor of 7...
mov eax,X
mov edx,092492492h
add eax,1
.if !CARRY?
mul edx
.endif
shr edx,02h
; edx = quotient
when disassembled, it looks something like this...
mov eax,X
mov edx,92492492h
add eax,1
jc @F
mul edx
@@: shr edx,2
i was wondering why ADD EAX,1 and JC are used instead of INC EAX and JZ...
mov eax,X
mov edx,92492492h
inc eax
jz @F
mul edx
@@: shr edx,2
or, in code that everyone else but me seems to like :P
mov eax,X
mov edx,092492492h
inc eax
.if !ZERO?
mul edx
.endif
shr edx,02h
; edx = quotient
Quote from: dedndave on June 26, 2011, 05:05:28 PM
i was wondering why ADD EAX,1 and JC are used instead of INC EAX and JZ...
why? - code size is irrelevant in this case.
Also, INC and DEC will usually cause a stall if the next instruction tests the flags (since they avoid modifying carry.)
is that so for newer cores ?
i would think that instruction pair would be optimized to some degree :P
Quote from: Tedd on July 01, 2011, 01:08:03 PM
Also, INC and DEC will usually cause a stall if the next instruction tests the flags (since they avoid modifying carry.)
That is a 0.2 cycles stall on my P4 - but why do we need an extra
test anyway?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1648 cycles for dec
1665 cycles for sub
1898 cycles for dec + test
1698 cycles for sub + test
Quote from: jj2007 on July 01, 2011, 02:36:35 PMbut why do we need an extra test anyway?
which
test ?
Quote from: Tedd on July 01, 2011, 01:08:03 PM... stall if the next instruction tests the flags
Maybe I misunderstood something ::)
that is a p4 - newer cores should be a little better :P
the test is necessary if you want a multiply-to-divide routine to cover the full range
in my case, i am dividing small numbers and managed to select multipliers that do not require the INC/JZ
but some of these multipliers may fail if the dividend is large
Sorry, bad wording on my part.
"INC and DEC will usually cause a stall if the next instruction tests depends on the flags" -- e.g. a conditional jump.
It's an out-of-order execution effect. The conditional instruction can't complete until the state of the flags is known, and partially modifying the flags causes an uncertainty (the full state also depends on the outcome of previous instructions.) So it's not something that can be worked around easily.
ADD and SUB fully modify the flags, so their outcome can be predicted, allowing the jump to be taken before the previous instructions have necessarily completed. So prefer ADD or SUB in place of INC or DEC when the next instruction is conditional.
Tedd,
The theory sounds plausible, but the effect is 0.2 cycles on my P4 and 0.0 on my Celeron...
i have seen cases where it is advantageous to put an unrelated instruction between the 2
cmp eax,5
mov SomePlace,SomeThing
ja SomeLabel
but, i find this applies to ADD, SUB, CMP, INC, TEST - any instruction where the flags are tested, really
I have just finished reading this thread as I was after qWords technique but in relation to the last comment of Ted, Intel have been recommending ADD and SUB over INC and DEC for a long time in optimisation manuals. Often it does not matter but in enough places I have found it faster using their recommendation that INC DEC so I rarely ever use INC DEC any longer.
don't toss them out altogether
they are great if the next instruction does not examine the flags, such as conditional branches
inc and dec do not alter the carry flag
that is a sometimes a desirable feature
also - inc/dec on a dword register is a single byte :P
Quote from: qWord on September 18, 2008, 12:07:32 PM
Hi all,
In the last days I've written an tool for determining the "magic number" for unsigned and signed division of 64, 32 and 16 bit numbers.
The algorithm based on c-source that can be found in AMD's "Software Optimization Guide".
I've test the program randomly and it looks like (for me :bg) that it works propper - but there is no guarantee at the moment.
Hope people here can help me to improve the program - so, feel free and post helpful suggestions, criticism and of course failure reports.
In the attached file you found the executable and source.
regards, qWord
-----
28-09-2009: new version attached
02-10-2009: fixed some bugs - new version uploaded
qWord,
Thank you! Thank You! Thank You!
I had been using the old magic number routine and suddenly found that it failed for a value of 100 and so I was faced with resolving what was wrong with the old one, or getting your new one. Your new one works!
Dave.