Possible replacement for nrandom

MichaelW · January 21, 2007, 12:27:43 AM

The attachment contains an implementation of the Park-Miller-Carta generator from here. This implementation has the same interface as nrandom, produces a sequence with similar characteristics and the same period, and is significantly faster. On a P3 it takes 56 clock cycles, versus 98 for nrandom. On a P4 (Willamette) it takes 44 clock cycles, versus 100 for nrandom. If the code looks a little strange it's because I started with the assembly output from GCC, using the optimization option that produced the fastest code (surprisingly -Os, instead of -O3). I managed to make a few improvements, but I have no doubt that the code could be made faster still.

[attachment deleted by admin]

Abel · January 31, 2007, 09:27:56 PM

Michael,

Nice find. It might be noted that the Carta modification can also be applied to 32/64 bit operations as well and avoid one of the imul ops.

The Park-Miller algorithm for generating a seed sequence is:
   N = newseed = (16807 * seed) mod (m)

Where m = 0x7fff ffff and 2*(m+1) = 0x1 0000 0000

The 64 bit product, N, is created in edx:eax with a single mul op.

To find (N mod m) we note that
   N= eax + 2*(m+1)*edx = A + 2*(m+1)*D
and
   N mod m = [A + 2*(m+1)*D] mod (m)

We may subtract m any integer number of times from the term [...] without altering the result so that
   N mod m = [A + 2*D] mod (m)

The simplification is that we need subtract m no more than twice instead of doing the division to reduce [A + 2*D]
to a value <m (worst case, A=0xffffffff):
   if A>m then A =A-m
   if [A + 2*D]>m then [A + 2*D]=[A + 2*D]-m

Code Select



Rand32	proc	base:DWORD
	mov	eax,seed    
	mov	ecx,16807		;a = 7^5
	mul	ecx			;edx:eax == a*seed == D:A
	mov	ecx,7fffffffh		;ecx = m
	
;**** Park-Miller mod
;	div	ecx
;	mov	eax,edx			;(a*seed) % 7fffffff
	
;**** Carta alternative mod
	add	edx,edx			;edx = 2*D
        cmp     eax,ecx			;eax = A
        jna     @F
        sub     eax,ecx			;if A>m, A = A - m
@@:	add	eax,edx			;eax = A + 2*D
	jns	@F
	sub	eax,ecx			;If (A + 2*D)>m
;****
@@:	mov	seed,eax		;save new seed 

	xor	edx,edx
	div	base
	mov	eax,edx
  	ret

Rand32	endp

CHI Square probabilities for identical input parameters showing identical outputs for
20 successive arrays of 10^6 random numbers. As CHI square results depend only on the
numbers within a sequence, a sorted sequence would yield the same result. Hence
difference and summed arrays derived from the same initial array are also analyzed to
pick up such correlations. (Michael's 16/32 bit Carta code generates a table identical
to the 32/64 bit results shown.)

RNG Plugin: RandDefault.dll (Park-Miller)
Array Length: 1000000
Array Modulus: 256
Current Seed: 0x16D88040
Array Difference Integrated Seed
Run 1 0.45893123 0.49381066 0.56116807 0x00001000
Run 2 0.13524775 0.79452499 0.81649648 0x6DB93924
Run 3 0.20580701 0.89417035 0.35652004 0x734A8D78
Run 4 0.49430838 0.96259768 0.22683721 0x7488687E
Run 5 0.78547309 0.46700421 0.74466750 0x4F04F658
Run 6 0.60489427 0.22812298 0.30355235 0x74CF8E0C
Run 7 0.99529184 0.80108366 0.36893360 0x663CA9E6
Run 8 0.06161230 0.05181587 0.06558641 0x22BEF35B
Run 9 0.00539331 0.79619091 0.76937956 0x74320955
Run 10 0.16010214 0.17519416 0.99083166 0x68BCE24F
Run 11 0.72523433 0.75827752 0.98996407 0x14A50D2D
Run 12 0.76084247 0.88293980 0.37090924 0x628675C4
Run 13 0.41637353 0.10459893 0.82441603 0x315AE3A4
Run 14 0.19960434 0.34343956 0.98568563 0x427AD4A5
Run 15 0.88955753 0.85056813 0.33715978 0x03ED7D87
Run 16 0.17128498 0.33123620 0.13632519 0x0F0A345A
Run 17 0.93527746 0.10697403 0.44926112 0x5599705D
Run 18 0.60718463 0.07669533 0.35294815 0x4DF492E0
Run 19 0.04966568 0.52528021 0.09335049 0x56E31CB9
Run 20 0.47137351 0.00182499 0.14546479 0x6419A815
Mean (x2) 0.91334598 0.96463502 0.98894573
Variance (x12) 1.19511535 1.28637631 1.13330639
------------------------------------------------------------------------------

RNG Plugin: RandPMC.dll (Park-Miller-Carta)
Array Length: 1000000
Array Modulus: 256
Current Seed: 0x16D88040
Array Difference Integrated Seed
Run 1 0.45893123 0.49381066 0.56116807 0x00001000
Run 2 0.13524775 0.79452499 0.81649648 0x6DB93924
Run 3 0.20580701 0.89417035 0.35652004 0x734A8D78
Run 4 0.49430838 0.96259768 0.22683721 0x7488687E
Run 5 0.78547309 0.46700421 0.74466750 0x4F04F658
Run 6 0.60489427 0.22812298 0.30355235 0x74CF8E0C
Run 7 0.99529184 0.80108366 0.36893360 0x663CA9E6
Run 8 0.06161230 0.05181587 0.06558641 0x22BEF35B
Run 9 0.00539331 0.79619091 0.76937956 0x74320955
Run 10 0.16010214 0.17519416 0.99083166 0x68BCE24F
Run 11 0.72523433 0.75827752 0.98996407 0x14A50D2D
Run 12 0.76084247 0.88293980 0.37090924 0x628675C4
Run 13 0.41637353 0.10459893 0.82441603 0x315AE3A4
Run 14 0.19960434 0.34343956 0.98568563 0x427AD4A5
Run 15 0.88955753 0.85056813 0.33715978 0x03ED7D87
Run 16 0.17128498 0.33123620 0.13632519 0x0F0A345A
Run 17 0.93527746 0.10697403 0.44926112 0x5599705D
Run 18 0.60718463 0.07669533 0.35294815 0x4DF492E0
Run 19 0.04966568 0.52528021 0.09335049 0x56E31CB9
Run 20 0.47137351 0.00182499 0.14546479 0x6419A815
Mean (x2) 0.91334598 0.96463502 0.98894573
Variance (x12) 1.19511535 1.28637631 1.13330639
------------------------------------------------------------------------------

hutch-- · February 01, 2007, 12:19:14 AM

What i would be interested to see with this new algo is how well it performs with the testing program ENT by John Wlaker. In partitular the chi squared test.

MichaelW · February 01, 2007, 03:00:50 AM

Abel, thank you again for providing a better method.

The attachment contains a test app with Abel's version added, an app to generate data files for the ENT tests, and the output from ENT tests. Abel's version generates the same output as my version, in every test, but even without optimization it is 3 cycles faster on a P3.

Edit:

I forgot to add rns.asm to the zip. It will be needed to rebuild the test app, and you can get it from the attachment in my first post.

[attachment deleted by admin]

Abel · February 01, 2007, 11:14:41 AM

Hutch,

The decimal numbers shown correspond the same chi-square probabilities given by ENT, if my code
is correct. The value 0.45893123 is equivalent to an ENT result that this value would be exceeded
45.89% of the time.

There may be a misconception that the better the RNG, the closer the ENT result will be to 0.500. Not
so! A 'perfect' RNG will return chi-square probabilities randomly distributed between 0 and 1 for a
series of runs given with different seeds (mean = 0.5, variance = 1/12). Any generator always returning
results close to 0.5 is definitely not a RNG.

ENT has several drawbacks. It is restricted to base 256 sequences, requires files for input, and
sorts probabilities into broad categories, >50%, >25%, ..., >0.01%. The program I've used, still
downloadable from this forum, allows arbitrary bases, generates and analyzes sequences in RAM, and returns
numerical probabilities. All that was required was pasting the above code into a DLL RNG template. With
a 2GHz P4 XP, generation and analysis of the 20 sequences of 10^6 bytes took 2 sec. total.

Abel

MichaelW · February 01, 2007, 04:25:08 PM

This it the ENT output for 1000000 bytes generated by the Windows 2000 cryptographic service provider with the user default provider of type PROV_RSA_FULL:

Entropy = 7.999810 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 263.37, and randomly
would exceed this value 50.00 percent of the times.

Arithmetic mean value of data bytes is 127.5135 (127.5 = random).
Monte Carlo value for Pi is 3.141948568 (error 0.01 percent).
Serial correlation coefficient is -0.000298 (totally uncorrelated = 0.0).

lingo · February 01, 2007, 08:40:33 PM

Is it faster without two jumps? :lol

Code Select

 
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
;    mov eax, rand32_seed
;    mov ecx, 16807              ; a = 7^5
;    mul ecx                     ; edx:eax == a*seed == D:A
;    mov ecx, 7fffffffh          ; ecx = m
;
;**** Park-Miller mod
;   div     ecx
;   mov     eax, edx            ; (a*seed) % 7fffffff
;**** Carta alternative mod
;
;    add edx, edx                ; edx = 2*D
;    cmp eax, ecx                ; eax = A
;    jna @F
;    sub eax, ecx                ; if A>m, A = A - m
;  @@:
;    add eax, edx                ; eax = A + 2*D
;    jns @F
;    sub eax, ecx                ; If (A + 2*D)>m
;****
;  @@:
;    mov rand32_seed, eax        ; save new seed
;    xor edx, edx
;    div DWORD PTR [esp+4]       ; base
;    mov eax, edx
;    ret 4

 mov    eax, 16807       ;
 mul    rand32_seed      ;edx:eax == a*seed == D:A
 cmp    eax, 80000000h   ;   
 lea    ecx, [eax+edx*2-7fffffffh]
 sbb    eax, eax         ; eax->0 or -1
 shr    eax, 1           ; eax->0 or 7FFFFFFFh 
 xor    edx, edx 
 add    eax, ecx
 sets   dl
 mov    ecx, [esp+4]     ; ecx->base
 neg    edx              ; edx-> 0 or -1
 shr    edx, 1	        ; edx-> 0 or 7FFFFFFFh
 sub    eax, edx         ; eax-7FFFFFFFh or eax-0  
 xor    edx, edx 
 mov    rand32_seed, eax ; save new seed 	
 div    ecx 
 mov    eax, edx
 ret    4
rand32  endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Regards,
Lingo

Abel · February 01, 2007, 10:13:08 PM

Michael,
As an experiment, try assembling rand_ent.exe with the seeds below instead of 0x00000001.

Seed = 6ef88367h
Entropy = 7.999868 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 182.20, and randomly
would exceed this value 99.95 percent of the times.
Arithmetic mean value of data bytes is 127.4925 (127.5 = random).
Monte Carlo value for Pi is 3.142452570 (error 0.03 percent).
Serial correlation coefficient is 0.001189 (totally uncorrelated = 0.0).

Seed = 504859a5h
Entropy = 7.999748 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 349.53, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 127.4839 (127.5 = random).
Monte Carlo value for Pi is 3.146964588 (error 0.17 percent).
Serial correlation coefficient is -0.000552 (totally uncorrelated = 0.0).

My code gave probabilities of 0.99981758 and 0.00007638 for these seeds.
Note that when ENT says 50.00%, it means anything between 25% and 75%.

Lingo,
Looks like a neat optimization, but I'll defer to the cyclists.
In any case, it yields identical results here to its predecessors.

Abel

MichaelW · February 02, 2007, 03:03:09 AM

I'm having trouble understanding why these seeds affect the Chi square distribution as they do. I have seen generators that would produce more than one sequence, depending on the starting seed, but after testing, that does not appear to be what is happening here. The period for these seeds is the same as for every other seed that I have tested, 2147483646, and the value 6ef88367h is in the sequence produced from a starting seed of 1. These seeds also disturb the Chi square distribution for nrandom, here starting with 6ef88367h:

Entropy = 7.999862 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 191.80, and randomly
would exceed this value 99.50 percent of the times.

Arithmetic mean value of data bytes is 127.4923 (127.5 = random).
Monte Carlo value for Pi is 3.142524570 (error 0.03 percent).
Serial correlation coefficient is 0.001094 (totally uncorrelated = 0.0).

And here starting with 504859a5h:

Entropy = 7.999752 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 344.76, and randomly
would exceed this value 0.05 percent of the times.

Arithmetic mean value of data bytes is 127.4841 (127.5 = random).
Monte Carlo value for Pi is 3.146916588 (error 0.17 percent).
Serial correlation coefficient is -0.000597 (totally uncorrelated = 0.0).

For the "cyclists":

P3:

Code Select


98 cycles, nrandom
56 cycles, nrand
53 cycles, rand32
62 cycles, rand32_lingo

P4:

Code Select


100 cycles, nrandom
44 cycles, nrand
36 cycles, rand32
60 cycles, rand32_lingo

EDIT:

The above cycle counts were from my second set of macros. I should have anticipated that others would be using the first set of macros, and used those instead. Here are the counts for the first set of macros:

P3:

Code Select


86 cycles, nrandom
38 cycles, nrand
42 cycles, rand32
39 cycles, rand32_lingo

P4 (Willamette):

Code Select


98 cycles, nrandom
49 cycles, nrand
52 cycles, rand32
52 cycles, rand32_lingo

Abel · February 02, 2007, 02:21:05 PM

Michael,
To clarify your understanding of chi-square results, a few experiments may be insightful. As a hypothesis, suppose that
the expression, "Chi square distribution for 1000000 samples is 191.80, and randomly would exceed this value 99.50 percent
of the times.", means that if we were to repeat the test 200 times we might expect to see a result less that 191.80.
The only options that could lead to different results are changes of seed or sequence length or both.

My results for 20 tests with different seeds were reported above, e.g.

Seed      Probability
0x00001000   0.45893123
0x6DB93924   0.13524775
.....

If you reassemble rand_ent.exe with these seeds and ENT the BINs you should find equivalent values. I did that half a
dozen times before the exercise became overly onerous. Should you examine the source file ent.c, you'll find that ENT
translates probability ranges into % as:
.....
.025-.050 5%
.050-.100 10%
.100-.250 25%
.250-.750 50%
.750-.900 75%
.900-.950 90%
.950-.975 95%
.....

To show that probabilities are, in fact, uniformly distributed in the 0 to 1 interval, their mean should be
0.5 and their variance 1/12, within a factor of order 1/sqrt(number of tests).

The two "bad" seeds I provided were culled from analyses of several hundred 1M sequences in a minute or so of calculation.

I'm a bit surprised that Lingo's suggestion faired so poorly. Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps. CMOVS and CMOVC anyone?

I've attached a zip containing my test program for anyone interested in playing with numbers. Any discrepancies with ENT
results would be appreciated.

Abel

[attachment deleted by admin]

lingo · February 02, 2007, 04:57:23 PM

"I'm a bit surprised that Lingo's suggestion faired so poorly."

Don't worry! here is the real result with the testing program :lol

My CPU is P4 3.6Ghz Prescott, OS - Vista

Code Select


 Please terminate any high-priority tasks and press ENTER to begin.


 rand32 - Tests:

 rand32 : 76 clocks; Return value: 4
 rand32 - Lingo : 56 clocks; Return value: 4

 Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]

Mark_Larson · February 02, 2007, 05:24:33 PM

Quote from: Abel on February 02, 2007, 02:21:05 PM
Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps. CMOVS and CMOVC anyone?

Abel

I'm a cyclist :) The problem with CMOVxx is that unless the probability of taking each branch is close to 50%, then it runs slower than a standard conditional jump. I code it up both ways and time it to see which is faster for my particular code/data.

EduardoS · February 03, 2007, 12:22:17 AM

On my A64:

Code Select


rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 47 clocks; Return value: 4

Being a RNG jumps inside it won't be miss-predicted almost half the times?

hutch-- · February 03, 2007, 06:29:02 AM

Here is a quick tweak on Abel's algo that made it a little faster.

Code Select


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

rand32 proc base:DWORD

  ; *********************
  ; Carta alternative mod
  ; *********************

    mov eax, 16807
    mov ecx, [esp+4]            ; load "base" into ECX
    mul rand32_seed
    add edx, edx                ; edx = 2*D

    cmp eax, 7FFFFFFFh          ; eax = A
    jbe @F
    sub eax, 7FFFFFFFh          ; if A>m, A = A - m

  @@:
    add eax, edx                ; eax = A + 2*D
    jns @F
    sub eax, 7FFFFFFFh          ; If (A + 2*D)>m

  @@:
    mov rand32_seed, eax        ; save new seed
    sub edx, edx
    div ecx                     ; div is faster with register
    mov eax, edx
    ret 4

rand32  endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Times on my Northwood core PIV are as follows.

Code Select


 Please terminate any high-priority tasks and press ENTER to begin.


 rand32 - Tests:

rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4

 Press ENTER to exit...

Abel · February 03, 2007, 11:42:03 AM

Running Lingo's code here showed a virtual dead heat at 53-54 clocks.

When both jumps were replaced with CMOVxx codes, the former dropped to 46 clocks.
When only the jump with a 50:50 probability was replaced:

Quote
Please terminate any high-priority tasks and press ENTER to begin.
rand32 - Tests:
rand32 : 41 clocks; Return value: 68 (base changed from 16 to 256)
rand32 - Lingo : 53 clocks; Return value: 68
Press ENTER to exit...rand32 - Tests:

This fits in with Mark's observations. Couldn't see a further improvement with Hutch's suggestion.

The chi-square probability table was unchanged, indicating that the random sequences generated
were the same as the less-optimized code.

Abel
Pentium 4, 2.00GHZ, 768 MB RAM, XP/SP2

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
    mov eax, rand32_seed
    mov ecx, 16807              ;a = 7^5
    mul ecx                     ;edx:eax == a*seed == D:A
    mov ecx, 7fffffffh          ;ecx = m
    add edx, edx                ;edx = 2*D
    add eax,edx                 ;N = A + 2*D
    mov edx,0
    jnc @F                      ;if N > ffffffffh (less unlikely)
    sub eax,ecx                 ;  N = N-m
@@: cmovns  ecx,edx             ;if N > 7fffffffh (50:50)
    sub eax, ecx                ;  N = N-m
    mov rand32_seed, eax        ; save new seed
    xor edx, edx
    div dword ptr [esp+4]       ; base
    mov eax, edx
    ret 4
rand32  endp

News:

Possible replacement for nrandom

Abel

Abel

Abel

Abel

EduardoS

Abel