The attachment contains an implementation of the Park-Miller-Carta generator from here (http://www.firstpr.com.au/dsp/rand31/). This implementation has the same interface as nrandom, produces a sequence with similar characteristics and the same period, and is significantly faster. On a P3 it takes 56 clock cycles, versus 98 for nrandom. On a P4 (Willamette) it takes 44 clock cycles, versus 100 for nrandom. If the code looks a little strange it's because I started with the assembly output from GCC, using the optimization option that produced the fastest code (surprisingly -Os, instead of -O3). I managed to make a few improvements, but I have no doubt that the code could be made faster still.
[attachment deleted by admin]
Michael,
Nice find. It might be noted that the Carta modification can also be applied to 32/64 bit operations as well and avoid one of the imul ops.
The Park-Miller algorithm for generating a seed sequence is:
N = newseed = (16807 * seed) mod (m)
Where m = 0x7fff ffff and 2*(m+1) = 0x1 0000 0000
The 64 bit product, N, is created in edx:eax with a single mul op.
To find (N mod m) we note that
N= eax + 2*(m+1)*edx = A + 2*(m+1)*D
and
N mod m = [A + 2*(m+1)*D] mod (m)
We may subtract m any integer number of times from the term [...] without altering the result so that
N mod m = [A + 2*D] mod (m)
The simplification is that we need subtract m no more than twice instead of doing the division to reduce [A + 2*D]
to a value <m (worst case, A=0xffffffff):
if A>m then A =A-m
if [A + 2*D]>m then [A + 2*D]=[A + 2*D]-m
Rand32 proc base:DWORD
mov eax,seed
mov ecx,16807 ;a = 7^5
mul ecx ;edx:eax == a*seed == D:A
mov ecx,7fffffffh ;ecx = m
;**** Park-Miller mod
; div ecx
; mov eax,edx ;(a*seed) % 7fffffff
;**** Carta alternative mod
add edx,edx ;edx = 2*D
cmp eax,ecx ;eax = A
jna @F
sub eax,ecx ;if A>m, A = A - m
@@: add eax,edx ;eax = A + 2*D
jns @F
sub eax,ecx ;If (A + 2*D)>m
;****
@@: mov seed,eax ;save new seed
xor edx,edx
div base
mov eax,edx
ret
Rand32 endp
CHI Square probabilities for identical input parameters showing identical outputs for
20 successive arrays of 10^6 random numbers. As CHI square results depend only on the
numbers within a sequence, a sorted sequence would yield the same result. Hence
difference and summed arrays derived from the same initial array are also analyzed to
pick up such correlations. (Michael's 16/32 bit Carta code generates a table identical
to the 32/64 bit results shown.)
RNG Plugin: RandDefault.dll (Park-Miller)
Array Length: 1000000
Array Modulus: 256
Current Seed: 0x16D88040
Array Difference Integrated Seed
Run 1 0.45893123 0.49381066 0.56116807 0x00001000
Run 2 0.13524775 0.79452499 0.81649648 0x6DB93924
Run 3 0.20580701 0.89417035 0.35652004 0x734A8D78
Run 4 0.49430838 0.96259768 0.22683721 0x7488687E
Run 5 0.78547309 0.46700421 0.74466750 0x4F04F658
Run 6 0.60489427 0.22812298 0.30355235 0x74CF8E0C
Run 7 0.99529184 0.80108366 0.36893360 0x663CA9E6
Run 8 0.06161230 0.05181587 0.06558641 0x22BEF35B
Run 9 0.00539331 0.79619091 0.76937956 0x74320955
Run 10 0.16010214 0.17519416 0.99083166 0x68BCE24F
Run 11 0.72523433 0.75827752 0.98996407 0x14A50D2D
Run 12 0.76084247 0.88293980 0.37090924 0x628675C4
Run 13 0.41637353 0.10459893 0.82441603 0x315AE3A4
Run 14 0.19960434 0.34343956 0.98568563 0x427AD4A5
Run 15 0.88955753 0.85056813 0.33715978 0x03ED7D87
Run 16 0.17128498 0.33123620 0.13632519 0x0F0A345A
Run 17 0.93527746 0.10697403 0.44926112 0x5599705D
Run 18 0.60718463 0.07669533 0.35294815 0x4DF492E0
Run 19 0.04966568 0.52528021 0.09335049 0x56E31CB9
Run 20 0.47137351 0.00182499 0.14546479 0x6419A815
Mean (x2) 0.91334598 0.96463502 0.98894573
Variance (x12) 1.19511535 1.28637631 1.13330639
------------------------------------------------------------------------------
RNG Plugin: RandPMC.dll (Park-Miller-Carta)
Array Length: 1000000
Array Modulus: 256
Current Seed: 0x16D88040
Array Difference Integrated Seed
Run 1 0.45893123 0.49381066 0.56116807 0x00001000
Run 2 0.13524775 0.79452499 0.81649648 0x6DB93924
Run 3 0.20580701 0.89417035 0.35652004 0x734A8D78
Run 4 0.49430838 0.96259768 0.22683721 0x7488687E
Run 5 0.78547309 0.46700421 0.74466750 0x4F04F658
Run 6 0.60489427 0.22812298 0.30355235 0x74CF8E0C
Run 7 0.99529184 0.80108366 0.36893360 0x663CA9E6
Run 8 0.06161230 0.05181587 0.06558641 0x22BEF35B
Run 9 0.00539331 0.79619091 0.76937956 0x74320955
Run 10 0.16010214 0.17519416 0.99083166 0x68BCE24F
Run 11 0.72523433 0.75827752 0.98996407 0x14A50D2D
Run 12 0.76084247 0.88293980 0.37090924 0x628675C4
Run 13 0.41637353 0.10459893 0.82441603 0x315AE3A4
Run 14 0.19960434 0.34343956 0.98568563 0x427AD4A5
Run 15 0.88955753 0.85056813 0.33715978 0x03ED7D87
Run 16 0.17128498 0.33123620 0.13632519 0x0F0A345A
Run 17 0.93527746 0.10697403 0.44926112 0x5599705D
Run 18 0.60718463 0.07669533 0.35294815 0x4DF492E0
Run 19 0.04966568 0.52528021 0.09335049 0x56E31CB9
Run 20 0.47137351 0.00182499 0.14546479 0x6419A815
Mean (x2) 0.91334598 0.96463502 0.98894573
Variance (x12) 1.19511535 1.28637631 1.13330639
------------------------------------------------------------------------------
What i would be interested to see with this new algo is how well it performs with the testing program ENT by John Wlaker. In partitular the chi squared test.
Abel, thank you again for providing a better method.
The attachment contains a test app with Abel's version added, an app to generate data files for the ENT tests, and the output from ENT tests. Abel's version generates the same output as my version, in every test, but even without optimization it is 3 cycles faster on a P3.
Edit:
I forgot to add rns.asm to the zip. It will be needed to rebuild the test app, and you can get it from the attachment in my first post.
[attachment deleted by admin]
Hutch,
The decimal numbers shown correspond the same chi-square probabilities given by ENT, if my code
is correct. The value 0.45893123 is equivalent to an ENT result that this value would be exceeded
45.89% of the time.
There may be a misconception that the better the RNG, the closer the ENT result will be to 0.500. Not
so! A 'perfect' RNG will return chi-square probabilities randomly distributed between 0 and 1 for a
series of runs given with different seeds (mean = 0.5, variance = 1/12). Any generator always returning
results close to 0.5 is definitely not a RNG.
ENT has several drawbacks. It is restricted to base 256 sequences, requires files for input, and
sorts probabilities into broad categories, >50%, >25%, ..., >0.01%. The program I've used, still
downloadable from this forum, allows arbitrary bases, generates and analyzes sequences in RAM, and returns
numerical probabilities. All that was required was pasting the above code into a DLL RNG template. With
a 2GHz P4 XP, generation and analysis of the 20 sequences of 10^6 bytes took 2 sec. total.
Abel
This it the ENT output for 1000000 bytes generated by the Windows 2000 cryptographic service provider with the user default provider of type PROV_RSA_FULL:
Entropy = 7.999810 bits per byte.
Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 263.37, and randomly
would exceed this value 50.00 percent of the times.
Arithmetic mean value of data bytes is 127.5135 (127.5 = random).
Monte Carlo value for Pi is 3.141948568 (error 0.01 percent).
Serial correlation coefficient is -0.000298 (totally uncorrelated = 0.0).
Is it faster without two jumps? :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
; mov eax, rand32_seed
; mov ecx, 16807 ; a = 7^5
; mul ecx ; edx:eax == a*seed == D:A
; mov ecx, 7fffffffh ; ecx = m
;
;**** Park-Miller mod
; div ecx
; mov eax, edx ; (a*seed) % 7fffffff
;**** Carta alternative mod
;
; add edx, edx ; edx = 2*D
; cmp eax, ecx ; eax = A
; jna @F
; sub eax, ecx ; if A>m, A = A - m
; @@:
; add eax, edx ; eax = A + 2*D
; jns @F
; sub eax, ecx ; If (A + 2*D)>m
;****
; @@:
; mov rand32_seed, eax ; save new seed
; xor edx, edx
; div DWORD PTR [esp+4] ; base
; mov eax, edx
; ret 4
mov eax, 16807 ;
mul rand32_seed ;edx:eax == a*seed == D:A
cmp eax, 80000000h ;
lea ecx, [eax+edx*2-7fffffffh]
sbb eax, eax ; eax->0 or -1
shr eax, 1 ; eax->0 or 7FFFFFFFh
xor edx, edx
add eax, ecx
sets dl
mov ecx, [esp+4] ; ecx->base
neg edx ; edx-> 0 or -1
shr edx, 1 ; edx-> 0 or 7FFFFFFFh
sub eax, edx ; eax-7FFFFFFFh or eax-0
xor edx, edx
mov rand32_seed, eax ; save new seed
div ecx
mov eax, edx
ret 4
rand32 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Regards,
Lingo
Michael,
As an experiment, try assembling rand_ent.exe with the seeds below instead of 0x00000001.
Seed = 6ef88367h
Entropy = 7.999868 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 182.20, and randomly
would exceed this value 99.95 percent of the times.
Arithmetic mean value of data bytes is 127.4925 (127.5 = random).
Monte Carlo value for Pi is 3.142452570 (error 0.03 percent).
Serial correlation coefficient is 0.001189 (totally uncorrelated = 0.0).
Seed = 504859a5h
Entropy = 7.999748 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 349.53, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 127.4839 (127.5 = random).
Monte Carlo value for Pi is 3.146964588 (error 0.17 percent).
Serial correlation coefficient is -0.000552 (totally uncorrelated = 0.0).
My code gave probabilities of 0.99981758 and 0.00007638 for these seeds.
Note that when ENT says 50.00%, it means anything between 25% and 75%.
Lingo,
Looks like a neat optimization, but I'll defer to the cyclists.
In any case, it yields identical results here to its predecessors.
Abel
I'm having trouble understanding why these seeds affect the Chi square distribution as they do. I have seen generators that would produce more than one sequence, depending on the starting seed, but after testing, that does not appear to be what is happening here. The period for these seeds is the same as for every other seed that I have tested, 2147483646, and the value 6ef88367h is in the sequence produced from a starting seed of 1. These seeds also disturb the Chi square distribution for nrandom, here starting with 6ef88367h:
Entropy = 7.999862 bits per byte.
Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 191.80, and randomly
would exceed this value 99.50 percent of the times.
Arithmetic mean value of data bytes is 127.4923 (127.5 = random).
Monte Carlo value for Pi is 3.142524570 (error 0.03 percent).
Serial correlation coefficient is 0.001094 (totally uncorrelated = 0.0).
And here starting with 504859a5h:
Entropy = 7.999752 bits per byte.
Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 344.76, and randomly
would exceed this value 0.05 percent of the times.
Arithmetic mean value of data bytes is 127.4841 (127.5 = random).
Monte Carlo value for Pi is 3.146916588 (error 0.17 percent).
Serial correlation coefficient is -0.000597 (totally uncorrelated = 0.0).
For the "cyclists":
P3:
98 cycles, nrandom
56 cycles, nrand
53 cycles, rand32
62 cycles, rand32_lingo
P4:
100 cycles, nrandom
44 cycles, nrand
36 cycles, rand32
60 cycles, rand32_lingo
EDIT:
The above cycle counts were from my second set of macros. I should have anticipated that others would be using the first set of macros, and used those instead. Here are the counts for the first set of macros:
P3:
86 cycles, nrandom
38 cycles, nrand
42 cycles, rand32
39 cycles, rand32_lingo
P4 (Willamette):
98 cycles, nrandom
49 cycles, nrand
52 cycles, rand32
52 cycles, rand32_lingo
Michael,
To clarify your understanding of chi-square results, a few experiments may be insightful. As a hypothesis, suppose that
the expression, "Chi square distribution for 1000000 samples is 191.80, and randomly would exceed this value 99.50 percent
of the times.", means that if we were to repeat the test 200 times we might expect to see a result less that 191.80.
The only options that could lead to different results are changes of seed or sequence length or both.
My results for 20 tests with different seeds were reported above, e.g.
Seed Probability
0x00001000 0.45893123
0x6DB93924 0.13524775
.....
If you reassemble rand_ent.exe with these seeds and ENT the BINs you should find equivalent values. I did that half a
dozen times before the exercise became overly onerous. Should you examine the source file ent.c, you'll find that ENT
translates probability ranges into % as:
.....
.025-.050 5%
.050-.100 10%
.100-.250 25%
.250-.750 50%
.750-.900 75%
.900-.950 90%
.950-.975 95%
.....
To show that probabilities are, in fact, uniformly distributed in the 0 to 1 interval, their mean should be
0.5 and their variance 1/12, within a factor of order 1/sqrt(number of tests).
The two "bad" seeds I provided were culled from analyses of several hundred 1M sequences in a minute or so of calculation.
I'm a bit surprised that Lingo's suggestion faired so poorly. Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps. CMOVS and CMOVC anyone?
I've attached a zip containing my test program for anyone interested in playing with numbers. Any discrepancies with ENT
results would be appreciated.
Abel
[attachment deleted by admin]
"I'm a bit surprised that Lingo's suggestion faired so poorly."
Don't worry! here is the real result with the testing program :lol
My CPU is P4 3.6Ghz Prescott, OS - Vista
Please terminate any high-priority tasks and press ENTER to begin.
rand32 - Tests:
rand32 : 76 clocks; Return value: 4
rand32 - Lingo : 56 clocks; Return value: 4
Press ENTER to exit...
Regards,
Lingo
[attachment deleted by admin]
Quote from: Abel on February 02, 2007, 02:21:05 PM
Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps. CMOVS and CMOVC anyone?
Abel
I'm a cyclist :) The problem with CMOVxx is that unless the probability of taking each branch is close to 50%, then it runs slower than a standard conditional jump. I code it up both ways and time it to see which is faster for my particular code/data.
On my A64:
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 47 clocks; Return value: 4
Being a RNG jumps inside it won't be miss-predicted almost half the times?
Here is a quick tweak on Abel's algo that made it a little faster.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
; *********************
; Carta alternative mod
; *********************
mov eax, 16807
mov ecx, [esp+4] ; load "base" into ECX
mul rand32_seed
add edx, edx ; edx = 2*D
cmp eax, 7FFFFFFFh ; eax = A
jbe @F
sub eax, 7FFFFFFFh ; if A>m, A = A - m
@@:
add eax, edx ; eax = A + 2*D
jns @F
sub eax, 7FFFFFFFh ; If (A + 2*D)>m
@@:
mov rand32_seed, eax ; save new seed
sub edx, edx
div ecx ; div is faster with register
mov eax, edx
ret 4
rand32 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Times on my Northwood core PIV are as follows.
Please terminate any high-priority tasks and press ENTER to begin.
rand32 - Tests:
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
Press ENTER to exit...
Running Lingo's code here showed a virtual dead heat at 53-54 clocks.
When both jumps were replaced with CMOVxx codes, the former dropped to 46 clocks.
When only the jump with a 50:50 probability was replaced:
Quote
Please terminate any high-priority tasks and press ENTER to begin.
rand32 - Tests:
rand32 : 41 clocks; Return value: 68 (base changed from 16 to 256)
rand32 - Lingo : 53 clocks; Return value: 68
Press ENTER to exit...rand32 - Tests:
This fits in with Mark's observations. Couldn't see a further improvement with Hutch's suggestion.
The chi-square probability table was unchanged, indicating that the random sequences generated
were the same as the less-optimized code.
Abel
Pentium 4, 2.00GHZ, 768 MB RAM, XP/SP2
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
mov eax, rand32_seed
mov ecx, 16807 ;a = 7^5
mul ecx ;edx:eax == a*seed == D:A
mov ecx, 7fffffffh ;ecx = m
add edx, edx ;edx = 2*D
add eax,edx ;N = A + 2*D
mov edx,0
jnc @F ;if N > ffffffffh (less unlikely)
sub eax,ecx ; N = N-m
@@: cmovns ecx,edx ;if N > 7fffffffh (50:50)
sub eax, ecx ; N = N-m
mov rand32_seed, eax ; save new seed
xor edx, edx
div dword ptr [esp+4] ; base
mov eax, edx
ret 4
rand32 endp
Abel,
Two rows less in your code (1-2 clocks less)
but I have with all reserve about
1st jump -> if N > ffffffffh (less unlikely) :wink
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
mov eax, 16807
mov ecx, 80000001h
mul rand32_seed ;edx:eax == a*seed == D:A
add edx, edx ;edx = 2*D
add eax, edx ;N = A + 2*D
mov edx, 0
jnc @F ;if N > ffffffffh (less unlikely)
add eax, ecx
@@: cmovns ecx, edx ;if N > 7fffffffh (50:50)
add eax, ecx
mov rand32_seed, eax ; save new seed
div dword ptr [esp+4]; base
mov eax, edx
ret 4
rand32 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Regards,
Lingo
Anytime lingo posts is becomes a thread of epic proportions :U
Lingo,
Thanks for the polish. I'm picking up echoes off the bottom of the barrel.
The most obvious remaining weak point is the closing DIV. I think the rules are to recreate the Park-Miller sequence as expeditiously as possible.
Hutch may have some reservations about using .686 code in library functions. What else should be returned apart from the Park-Miller sequence
is a moot point. We've assumed integers modulo a given base. Others may want floating points 0->1 or integers modulo 2^n.
DIV gets integers using the lowest bits of the Park-Miller seeds. In another thread several years ago, Michael suggested replacing DIV with a MUL
which emphasizes the highest bits of the Park-Miller seeds and chi-square tests didn't turn up any problems. This option reduces my cycle tallies
from 40-41 to 29-32.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
mov eax, 16807 ;a = 7^5
mul rand32_seed ;edx:eax == a*seed == D:A
mov ecx, 7fffffffh ;ecx = m
add edx, edx ;edx = 2*D
add eax,edx ;N = A + 2*D
mov edx,0
jnc @F ;if N > ffffffffh (less likely)
sub eax,ecx ; N = N-m
@@: cmovns ecx,edx ;if N > 7fffffffh (50:50)
sub eax, ecx ; N = N-m
mov rand32_seed, eax ; save new seed
;Using LoBits
; div dword ptr [esp+4] ;base
;Using HiBits
shl eax,1
mul dword ptr [esp+4] ;base
mov eax, edx
ret 4
rand32 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
RNG Plugin: Rand32-Hi.dll
Array Length: 1000000
Array Modulus: 256
Current Seed: 0x02016D88
Array Difference Integrated Seed
Run 1 0.15352543 0.17302848 0.82842721 0x00000001
Run 2 0.25338037 0.42654848 0.30013492 0x4926DB93
Run 3 0.69406418 0.33732162 0.95430686 0x6BC734A8
Run 4 0.24896364 0.85992266 0.04448574 0x43F74886
Run 5 0.75102686 0.95050725 0.93853727 0x32C4F04F
Run 6 0.25241672 0.51695021 0.62466855 0x70674CF8
Run 7 0.88426325 0.89111771 0.98547929 0x4F3663CA
Run 8 0.94895542 0.33389802 0.21863232 0x1ADA2BEF
Run 9 0.66103225 0.51705909 0.71129295 0x4AAF4320
Run 10 0.09923830 0.05070660 0.85115210 0x127E8BCE
Run 11 0.99092664 0.11889222 0.03941943 0x69694A50
Run 12 0.52057062 0.74912639 0.40157083 0x2E262867
Run 13 0.92096010 0.94554446 0.01294125 0x1D2315AE
Run 14 0.06210419 0.15717322 0.37764770 0x252C27AD
Run 15 0.46503752 0.99448840 0.35831148 0x6C383ED7
Run 16 0.04933810 0.45469887 0.21313750 0x22D0F0A3
Run 17 0.93996576 0.99498513 0.48553769 0x02ED5997
Run 18 0.96162106 0.51362063 0.00314153 0x1704DF49
Run 19 0.99891330 0.92949402 0.93302415 0x65CD6E31
Run 20 0.00951699 0.01706214 0.46989837 0x40AE419A
Mean (x2) 1.08658207 1.09321456 0.97517472
Variance (x12) 1.59148875 1.37225082 1.35854662
The rather high variances don't show up in continued runs or shorter array sizes and could just be a statistical fluke associated
with the chosen starting seed. (With small seeds, the high bits of the next several Park-Miller values will also be zero. There are
suggestions for choosing a value higher than 16807, but that's another issue.)
Abel
"There are suggestions for choosing a value higher than 16807,.."
You can try 48271
Regards,
Lingo
> Hutch may have some reservations about using .686 code in library functions.
The trick is to save the last 486 compatible version which was faster than the older Park Miller algo and put it into the library as the standard procedure and add the faster CMOVxx version as an extra. The only difference is documentation so that a prospective user does not try and use the standard version on an old box that does not support the CMOVxx familiy of instructions.
My main concern with an algo of thsi type is that it continues to test well on the range of random data tests which is actually more important than the more recent speed gains.
Under the heading "What I don't understand about ..."
Yesterday for 29-30 cycles:
rand32 proc base:DWORD
mov eax, 16807 ;a = 7^5
mul rand32_seed ;edx:eax == a*seed == D:A
mov ecx, 7fffffffh ;ecx = m
add edx, edx ;edx = 2*D
....
ret 4
rand32 endp
Today for 20-21 cycles:
rand32 proc base:DWORD
mov eax, 16807 ;a = 7^5
mov ecx, 7fffffffh ;ecx = m
mul rand32_seed ;edx:eax == a*seed == D:A
add edx, edx ;edx = 2*D
....
ret 4
rand32 endp
Couldn't 'chimp' similar effect in the rand32lingo code.
Abel