News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Possible replacement for nrandom

Started by MichaelW, January 21, 2007, 12:27:43 AM

Previous topic - Next topic

MichaelW

The attachment contains an implementation of the Park-Miller-Carta generator from here. This implementation has the same interface as nrandom, produces a sequence with similar characteristics and the same period, and is significantly faster. On a P3 it takes 56 clock cycles, versus 98 for nrandom. On a P4 (Willamette) it takes 44 clock cycles, versus 100 for nrandom. If the code looks a little strange it's because I started with the assembly output from GCC, using the optimization option that produced the fastest code (surprisingly -Os, instead of -O3). I managed to make a few improvements, but I have no doubt that the code could be made faster still.


[attachment deleted by admin]
eschew obfuscation

Abel

Michael,

Nice find.  It might be noted that the Carta modification can also be applied to 32/64 bit operations as well and avoid one of the imul ops.

The Park-Miller algorithm for generating a seed sequence is:
   N = newseed = (16807 * seed) mod (m)

Where m = 0x7fff ffff and 2*(m+1) = 0x1 0000 0000

The 64 bit product, N, is created in edx:eax with a single mul op.

To find (N mod m) we note that
   N= eax + 2*(m+1)*edx = A + 2*(m+1)*D
and
   N mod m = [A + 2*(m+1)*D] mod (m)

We may subtract m any integer number of times from the term [...] without altering the result so that
   N mod m = [A + 2*D] mod (m)

The simplification is that we need subtract m no more than twice instead of doing the division to reduce [A + 2*D]
to a value <m (worst case, A=0xffffffff):
   if A>m then A =A-m
   if [A + 2*D]>m then [A + 2*D]=[A + 2*D]-m



Rand32 proc base:DWORD
mov eax,seed   
mov ecx,16807 ;a = 7^5
mul ecx ;edx:eax == a*seed == D:A
mov ecx,7fffffffh ;ecx = m

;**** Park-Miller mod
; div ecx
; mov eax,edx ;(a*seed) % 7fffffff

;**** Carta alternative mod
add edx,edx ;edx = 2*D
        cmp     eax,ecx ;eax = A
        jna     @F
        sub     eax,ecx ;if A>m, A = A - m
@@: add eax,edx ;eax = A + 2*D
jns @F
sub eax,ecx ;If (A + 2*D)>m
;****
@@: mov seed,eax ;save new seed

xor edx,edx
div base
mov eax,edx
  ret

Rand32 endp


CHI Square probabilities for identical input parameters showing identical outputs for
20 successive arrays of 10^6 random numbers.  As CHI square results depend only on the
numbers within a sequence, a sorted sequence would yield the same result.  Hence
difference and summed arrays derived from the same initial array are also analyzed to
pick up such correlations.  (Michael's 16/32 bit Carta code generates a table identical
to the 32/64 bit results shown.)


RNG Plugin: RandDefault.dll (Park-Miller)
Array Length:  1000000
Array Modulus:  256
Current Seed: 0x16D88040
                Array           Difference      Integrated      Seed   
Run 1           0.45893123      0.49381066      0.56116807      0x00001000     
Run 2           0.13524775      0.79452499      0.81649648      0x6DB93924     
Run 3           0.20580701      0.89417035      0.35652004      0x734A8D78     
Run 4           0.49430838      0.96259768      0.22683721      0x7488687E     
Run 5           0.78547309      0.46700421      0.74466750      0x4F04F658     
Run 6           0.60489427      0.22812298      0.30355235      0x74CF8E0C     
Run 7           0.99529184      0.80108366      0.36893360      0x663CA9E6     
Run 8           0.06161230      0.05181587      0.06558641      0x22BEF35B     
Run 9           0.00539331      0.79619091      0.76937956      0x74320955     
Run 10          0.16010214      0.17519416      0.99083166      0x68BCE24F     
Run 11          0.72523433      0.75827752      0.98996407      0x14A50D2D     
Run 12          0.76084247      0.88293980      0.37090924      0x628675C4     
Run 13          0.41637353      0.10459893      0.82441603      0x315AE3A4     
Run 14          0.19960434      0.34343956      0.98568563      0x427AD4A5     
Run 15          0.88955753      0.85056813      0.33715978      0x03ED7D87     
Run 16          0.17128498      0.33123620      0.13632519      0x0F0A345A     
Run 17          0.93527746      0.10697403      0.44926112      0x5599705D     
Run 18          0.60718463      0.07669533      0.35294815      0x4DF492E0     
Run 19          0.04966568      0.52528021      0.09335049      0x56E31CB9     
Run 20          0.47137351      0.00182499      0.14546479      0x6419A815     
Mean (x2)       0.91334598      0.96463502      0.98894573             
Variance (x12)  1.19511535      1.28637631      1.13330639             
------------------------------------------------------------------------------

RNG Plugin: RandPMC.dll (Park-Miller-Carta)
Array Length:  1000000
Array Modulus:  256
Current Seed: 0x16D88040
                Array           Difference      Integrated      Seed   
Run 1           0.45893123      0.49381066      0.56116807      0x00001000     
Run 2           0.13524775      0.79452499      0.81649648      0x6DB93924     
Run 3           0.20580701      0.89417035      0.35652004      0x734A8D78     
Run 4           0.49430838      0.96259768      0.22683721      0x7488687E     
Run 5           0.78547309      0.46700421      0.74466750      0x4F04F658     
Run 6           0.60489427      0.22812298      0.30355235      0x74CF8E0C     
Run 7           0.99529184      0.80108366      0.36893360      0x663CA9E6     
Run 8           0.06161230      0.05181587      0.06558641      0x22BEF35B     
Run 9           0.00539331      0.79619091      0.76937956      0x74320955     
Run 10          0.16010214      0.17519416      0.99083166      0x68BCE24F     
Run 11          0.72523433      0.75827752      0.98996407      0x14A50D2D     
Run 12          0.76084247      0.88293980      0.37090924      0x628675C4     
Run 13          0.41637353      0.10459893      0.82441603      0x315AE3A4     
Run 14          0.19960434      0.34343956      0.98568563      0x427AD4A5     
Run 15          0.88955753      0.85056813      0.33715978      0x03ED7D87     
Run 16          0.17128498      0.33123620      0.13632519      0x0F0A345A     
Run 17          0.93527746      0.10697403      0.44926112      0x5599705D     
Run 18          0.60718463      0.07669533      0.35294815      0x4DF492E0     
Run 19          0.04966568      0.52528021      0.09335049      0x56E31CB9     
Run 20          0.47137351      0.00182499      0.14546479      0x6419A815     
Mean (x2)       0.91334598      0.96463502      0.98894573             
Variance (x12)  1.19511535      1.28637631      1.13330639             
------------------------------------------------------------------------------


hutch--

What i would be interested to see with this new algo is how well it performs with the testing program ENT by John Wlaker. In partitular the chi squared test.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Abel, thank you again for providing a better method.

The attachment contains a test app with Abel's version added, an app to generate data files for the ENT tests, and the output from ENT tests. Abel's version generates the same output as my version, in every test, but even without optimization it is 3 cycles faster on a P3.

Edit:

I forgot to add rns.asm to the zip. It will be needed to rebuild the test app, and you can get it from the attachment in my first post.

[attachment deleted by admin]
eschew obfuscation

Abel

Hutch,

The decimal numbers shown correspond the same chi-square probabilities given by ENT, if my code
is correct.  The value 0.45893123 is equivalent to an ENT result that this value would be exceeded
45.89% of the time.

There may be a misconception that the better the RNG, the closer the ENT result will be to 0.500. Not
so!  A 'perfect' RNG will return chi-square probabilities randomly distributed between 0 and 1 for a
series of runs given with different seeds (mean = 0.5, variance = 1/12).  Any generator always returning
results close to 0.5 is definitely not a RNG.

ENT has several drawbacks.  It is restricted to base 256 sequences, requires files for input, and
sorts probabilities into broad categories, >50%, >25%, ..., >0.01%.  The program I've used, still
downloadable from this forum, allows arbitrary bases, generates and analyzes sequences in RAM, and returns
numerical probabilities.  All that was required was pasting the above code into a DLL RNG template.  With
a 2GHz P4 XP, generation and analysis of the 20 sequences of 10^6 bytes took 2 sec. total.

Abel

MichaelW

This it the ENT output for 1000000 bytes generated by the Windows 2000 cryptographic service provider with the user default provider of type PROV_RSA_FULL:

Entropy = 7.999810 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 263.37, and randomly
would exceed this value 50.00 percent of the times.

Arithmetic mean value of data bytes is 127.5135 (127.5 = random).
Monte Carlo value for Pi is 3.141948568 (error 0.01 percent).
Serial correlation coefficient is -0.000298 (totally uncorrelated = 0.0).
eschew obfuscation

lingo

Is it faster without two jumps? :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
;    mov eax, rand32_seed
;    mov ecx, 16807              ; a = 7^5
;    mul ecx                     ; edx:eax == a*seed == D:A
;    mov ecx, 7fffffffh          ; ecx = m
;
;**** Park-Miller mod
;   div     ecx
;   mov     eax, edx            ; (a*seed) % 7fffffff
;**** Carta alternative mod
;
;    add edx, edx                ; edx = 2*D
;    cmp eax, ecx                ; eax = A
;    jna @F
;    sub eax, ecx                ; if A>m, A = A - m
;  @@:
;    add eax, edx                ; eax = A + 2*D
;    jns @F
;    sub eax, ecx                ; If (A + 2*D)>m
;****
;  @@:
;    mov rand32_seed, eax        ; save new seed
;    xor edx, edx
;    div DWORD PTR [esp+4]       ; base
;    mov eax, edx
;    ret 4

mov    eax, 16807       ;
mul    rand32_seed      ;edx:eax == a*seed == D:A
cmp    eax, 80000000h   ;   
lea    ecx, [eax+edx*2-7fffffffh]
sbb    eax, eax         ; eax->0 or -1
shr    eax, 1           ; eax->0 or 7FFFFFFFh
xor    edx, edx
add    eax, ecx
sets   dl
mov    ecx, [esp+4]     ; ecx->base
neg    edx              ; edx-> 0 or -1
shr    edx, 1         ; edx-> 0 or 7FFFFFFFh
sub    eax, edx         ; eax-7FFFFFFFh or eax-0 
xor    edx, edx
mov    rand32_seed, eax ; save new seed
div    ecx
mov    eax, edx
ret    4
rand32  endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Regards,
Lingo

Abel

Michael,
As an experiment, try assembling rand_ent.exe with the seeds below instead of 0x00000001.


Seed = 6ef88367h
Entropy = 7.999868 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 182.20, and randomly
   would exceed this value 99.95 percent of the times.
Arithmetic mean value of data bytes is 127.4925 (127.5 = random).
Monte Carlo value for Pi is 3.142452570 (error 0.03 percent).
Serial correlation coefficient is 0.001189 (totally uncorrelated = 0.0).

Seed = 504859a5h
Entropy = 7.999748 bits per byte.
Optimum compression would reduce the size of this 1000000 byte file by 0 percent.
Chi square distribution for 1000000 samples is 349.53, and randomly
   would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 127.4839 (127.5 = random).
Monte Carlo value for Pi is 3.146964588 (error 0.17 percent).
Serial correlation coefficient is -0.000552 (totally uncorrelated = 0.0).


My code gave probabilities of 0.99981758 and 0.00007638 for these seeds.
Note that when ENT says 50.00%, it means anything between 25% and 75%.


Lingo,
Looks like a neat optimization, but I'll defer to the cyclists.
In any case, it yields identical results here to its predecessors.

Abel

MichaelW

#8
I'm having trouble understanding why these seeds affect the Chi square distribution as they do. I have seen generators that would produce more than one sequence, depending on the starting seed, but after testing, that does not appear to be what is happening here. The period for these seeds is the same as for every other seed that I have tested, 2147483646, and the value 6ef88367h is in the sequence produced from a starting seed of 1. These seeds also disturb the Chi square distribution for nrandom, here starting with 6ef88367h:

Entropy = 7.999862 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 191.80, and randomly
would exceed this value 99.50 percent of the times.

Arithmetic mean value of data bytes is 127.4923 (127.5 = random).
Monte Carlo value for Pi is 3.142524570 (error 0.03 percent).
Serial correlation coefficient is 0.001094 (totally uncorrelated = 0.0).

And here starting with 504859a5h:

Entropy = 7.999752 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 344.76, and randomly
would exceed this value 0.05 percent of the times.

Arithmetic mean value of data bytes is 127.4841 (127.5 = random).
Monte Carlo value for Pi is 3.146916588 (error 0.17 percent).
Serial correlation coefficient is -0.000597 (totally uncorrelated = 0.0).

For the "cyclists":

P3:

98 cycles, nrandom
56 cycles, nrand
53 cycles, rand32
62 cycles, rand32_lingo


P4:

100 cycles, nrandom
44 cycles, nrand
36 cycles, rand32
60 cycles, rand32_lingo


EDIT:

The above cycle counts were from my second set of macros. I should have anticipated that others would be using the first set of macros, and used those instead. Here are the counts for the first set of macros:

P3:

86 cycles, nrandom
38 cycles, nrand
42 cycles, rand32
39 cycles, rand32_lingo


P4 (Willamette):

98 cycles, nrandom
49 cycles, nrand
52 cycles, rand32
52 cycles, rand32_lingo

eschew obfuscation

Abel

Michael,
To clarify your understanding of chi-square results, a few experiments may be insightful.  As a hypothesis, suppose that
the expression, "Chi square distribution for 1000000 samples is 191.80, and randomly would exceed this value 99.50 percent
of the times.", means that if we were to repeat the test 200 times we might expect to see a result less that 191.80. 
The only options that could lead to different results are changes of seed or sequence length or both.

My results for 20 tests with different seeds were reported above, e.g.

Seed      Probability
0x00001000   0.45893123
0x6DB93924   0.13524775
.....

If you reassemble rand_ent.exe with these seeds and ENT the BINs you should find equivalent values.  I did that half a
dozen times before the exercise became overly onerous.  Should you examine the source file ent.c, you'll find that ENT
translates probability ranges into % as:
.....
.025-.050    5%
.050-.100   10%
.100-.250   25%
.250-.750   50%
.750-.900   75%
.900-.950   90%
.950-.975   95%
.....

To show that probabilities are, in fact, uniformly distributed in the 0 to 1 interval, their mean should be
0.5 and their variance 1/12, within a factor of order 1/sqrt(number of tests).

The two "bad" seeds I provided were culled from analyses of several hundred 1M sequences in a minute or so of calculation.

I'm a bit surprised that Lingo's suggestion faired so poorly.  Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps.  CMOVS and CMOVC anyone?

I've attached a zip containing my test program for anyone interested in playing with numbers.  Any discrepancies with ENT
results would be appreciated.

Abel


[attachment deleted by admin]

lingo

"I'm a bit surprised that Lingo's suggestion faired so poorly."

Don't worry! here is the real result with the testing program  :lol

My CPU is P4 3.6Ghz Prescott, OS - Vista
Please terminate any high-priority tasks and press ENTER to begin.


rand32 - Tests:

rand32 : 76 clocks; Return value: 4
rand32 - Lingo : 56 clocks; Return value: 4

Press ENTER to exit...


Regards,
Lingo

[attachment deleted by admin]

Mark_Larson

Quote from: Abel on February 02, 2007, 02:21:05 PM
Intel docs suggest using SETxx and CMOVxx to optimize and
eliminate conditional jumps.  CMOVS and CMOVC anyone?

Abel


  I'm a cyclist :)  The problem with CMOVxx is that unless the probability of taking each branch is close to 50%, then it runs slower than a standard conditional jump.  I code it up both ways and time it to see which is faster for my particular code/data.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

EduardoS

On my A64:

rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 47 clocks; Return value: 4


Being a RNG jumps inside it won't be miss-predicted almost half the times?

hutch--

Here is a quick tweak on Abel's algo that made it a little faster.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

rand32 proc base:DWORD

  ; *********************
  ; Carta alternative mod
  ; *********************

    mov eax, 16807
    mov ecx, [esp+4]            ; load "base" into ECX
    mul rand32_seed
    add edx, edx                ; edx = 2*D

    cmp eax, 7FFFFFFFh          ; eax = A
    jbe @F
    sub eax, 7FFFFFFFh          ; if A>m, A = A - m

  @@:
    add eax, edx                ; eax = A + 2*D
    jns @F
    sub eax, 7FFFFFFFh          ; If (A + 2*D)>m

  @@:
    mov rand32_seed, eax        ; save new seed
    sub edx, edx
    div ecx                     ; div is faster with register
    mov eax, edx
    ret 4

rand32  endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


Times on my Northwood core PIV are as follows.


Please terminate any high-priority tasks and press ENTER to begin.


rand32 - Tests:

rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 52 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4
rand32 : 53 clocks; Return value: 4
rand32 - Lingo : 53 clocks; Return value: 4

Press ENTER to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Abel

Running Lingo's code here showed a virtual dead heat at 53-54 clocks.

When both jumps were replaced with CMOVxx codes, the former dropped to 46 clocks.
When only the jump with a 50:50 probability was replaced:

Quote
Please terminate any high-priority tasks and press ENTER to begin.
rand32 - Tests:
  rand32 : 41 clocks; Return value: 68 (base changed from 16 to 256)
  rand32 - Lingo : 53 clocks; Return value: 68
Press ENTER to exit...rand32 - Tests:

This fits in with Mark's observations.  Couldn't see a further improvement with Hutch's suggestion.

The chi-square probability table was unchanged, indicating that the random sequences generated
were the same as the less-optimized code.

Abel
Pentium 4, 2.00GHZ, 768 MB RAM, XP/SP2


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
rand32 proc base:DWORD
    mov eax, rand32_seed
    mov ecx, 16807              ;a = 7^5
    mul ecx                     ;edx:eax == a*seed == D:A
    mov ecx, 7fffffffh          ;ecx = m
    add edx, edx                ;edx = 2*D
    add eax,edx                 ;N = A + 2*D
    mov edx,0
    jnc @F                      ;if N > ffffffffh (less unlikely)
    sub eax,ecx                 ;  N = N-m
@@: cmovns  ecx,edx             ;if N > 7fffffffh (50:50)
    sub eax, ecx                ;  N = N-m
    mov rand32_seed, eax        ; save new seed
    xor edx, edx
    div dword ptr [esp+4]       ; base
    mov eax, edx
    ret 4
rand32  endp