News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

RNG

Started by MichaelW, January 26, 2005, 07:31:16 PM

Previous topic - Next topic

MichaelW

RNG
I have essentially completed the RNG project I have been working on for the last month or so. The attachment contains my results.

In the final tests I included 8 generators:

The original nrandom.

The "fixed" nrandom, with the correction for negative seeds suggested by Abel (fixed_nrandom).

The shorter Park-Miller implementation that was posted by Abel (abel_nrandom).

The shorter Park-Miller implementation that was posted by Abel, with the final MOD operation changed to a shift and multiply (abel_nrandom_opt)

The msvcrt.dll (multithreaded) rand function (crt_rand).

An ASM version of the single-threaded CRT rand function (crt_rand_asm).

An ASM version of the single-threaded CRT rand function, with the final MOD operation changed to a shift and multiply (crt_rand_asm_opt).

The Cryptographic Service Provider RNG (csp_rand).

I am satisfied that all of these generators (other than the original nrandom) are at least reasonable. In the ENT tests, fixed_nrandom, abel_nrandom, abel_nrandom_opt, crt_rand_asm_opt, and csp_rand all produce similar results. And in the visual test they all (other than the original nrandom) produce apparently identical results.

The shift and multiply method of scaling the results, in addition to reducing execution time significantly, actually seems to improve the randomness of the results over those obtained by using the DIV method. But note in the MODTEST application, for calculating n mod m, the shift and multiply method and the faster AND method duplicate the results of the DIV method only when m is a power of two. I could not discover any reasonable method of correcting the results for either of the faster mod methods so they would duplicate the results for the DIV method over the entire range of m (even though I feel that it should be possible). My attempts to substitute the AND method for the shift and multiply method always destroyed the randomness of the results, as did my attempts to replace the first (64-bit) divide in the Park-Miller generators with a shift and multiply (or a combination of shifts and multiplies).

Due to the limitations of the ENT program I was unable to verify that the shorter Park-Miller generators have the same effective range as the longer Park-Miller generators (the original and fixed nrandom). AFAICT the first multiply (by 16807) cannot overflow the 64-bit result, so there is no justification for using the longer implementation, and both implementations should produce the same sequences (which they do) and have the same effective range.

The CRT generators are limited to a 15-bit effective range.

The .txt files contain the results for the ENT tests. I did not include the ENT test program on the assumption that the license prohibits redistribution. It is available (with the source code) here:

http://www.fourmilab.ch/hotbits/

Note that RNG.ASM requires that the new MSVCRT include file and library (recently posted by Hutch) be in the appropriate MASM32 directories.

The final version of csp_rand was tested under Windows 2000 only. It probably will not work under Windows 95, and it might have problems under Windows 98 or Windows XP.


[attachment deleted by admin]
eschew obfuscation

hutch--

Micael,

Thanks for doing all of this work, the results look very useful. Just to make sure I have got it right, the "fixed_nrandom" algo can be used to replace the original "nrandom" so that it has an improvement in random distribution ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Quote from: hutch-- on January 27, 2005, 12:26:56 AM
Just to make sure I have got it right, the "fixed_nrandom" algo can be used to replace the original "nrandom" so that it has an improvement in random distribution ?
Yes, a giant improvement. Here are the Chi square results for the original nrandom:

Chi square distribution for 1000000 samples is 414.26, and randomly
would exceed this value 0.01 percent of the times.

And the results for the fixed version:

Chi square distribution for 1000000 samples is 255.77, and randomly
would exceed this value 50.00 percent of the times.

From the ENT documentation:

"The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. We interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect. Percentages between 90% and 95% and 5% and 10% indicate the sequence is "almost suspect". Note that our JPEG file, while very dense in information, is far from random as revealed by the chi-square test. "

And here is the original nrandom source with the fix applied:

nrandom PROC base:DWORD

    mov eax, nrandom_seed
    xor edx, edx
    ;-------------------------
    test eax, 80000000h
    jz  @F
    add eax, 7fffffffh
  @@:
    ;-------------------------
    mov ecx, 127773
    div ecx
    mov ecx, eax
    mov eax, 16807
    mul edx
    mov edx, ecx
    mov ecx, eax
    mov eax, 2836
    mul edx
    sub ecx, eax
    xor edx, edx
    mov eax, ecx
    mov nrandom_seed, ecx
    div base

    mov eax, edx
    ret

nrandom ENDP



eschew obfuscation

MichaelW

#3
I came across an interesting generator that Paul Dixon posted to the PowerBASIC forum several years back.
Quote
The sequence repeats after 2^33-1 numbers (8.5E9). It is a fast implementation of a standard pseudo random binary sequence generator using a 33 bit feedback shift register.

So I created three different MASM versions of the generator and included them in the tests. Paul had posted several fast generators on the old forum, and I think one of them was at least similar to this one. But I recall detecting some significant problems with all of the SR-based generators I tested at that time, where this generator seems to do well in all of the tests. In my initial tests the fastest generator was crt_rand_asm_opt at 13 clock cycles on a P3, including the call and scaling overhead. Paul's generator, without scaling but including the call overhead, runs in 7 clock cycles on a P3, and this is for a full 32-bit return value, where the crt-based generators are limited to a 15-bit return value. With scaling added, the clock cycle counts increase to 17 when scaling with a multiply and 45 when scaling with a divide. I included both scaling methods because, for this generator only, scaling with a multiply degraded the chi-square distribution significantly.

I also devised what I believe to be a valid test to determine if the shorter Park-Miller generators have the same effective range as the longer Park-Miller generators. According to the test they do not, see PM_TEST.ASM.

I have updated the original attachment.
eschew obfuscation

Mark_Larson


Thanks for updating the attachment with the new stuff.  I wanna take a look see!


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dioxin

Michael,
   for information, here are some of the timings from my AthlonXP:
crt_rand_asm_opt: 12 clock cycles
pd_rand_base_div: 52 clock cycles
pd_rand_base_mul: 13 clock cycles
pd_rand: 6 clock cycles

I think I can explain why multiply degraded the chi-square and fix it so MUL gives the good distribution.

If you want less than 32 bits, I can do you a quicker version as it won't need to handle the 33rd bit. The code you're using was the fastest way I found to do the full 32 bits as there is no single tap maximal length 32 bit shift register we need to calculate 33 and then extract 32 of them. There are versions which will produce less than 32 bits and save a few opcodes.

I don't know if you want to discuss these things in this thread, if you do I'll try to explain them and give some examples but, for now, here's a faster 63 bit version, but it needs mmx.

Paul.

rng&&=1234566789 :REM the seed, ANY number other than zero can be used
mask&&=&h7fffffffffffffff    'keep the 63 used bits only


movq mm0,rng&&  ;get the last number
movq mm1,mm0
psllq mm0,1
pxor mm1,mm0
pand mm1,mask&&
movq mm0,mm1
psrlq mm1,62
pxor mm0,mm1
movq rng&&,mm0  ;write back the result



MichaelW

#6
Hi Paul,

After reading your post and thinking more about why the multiply degraded the chi-square, I realized that after learning from the modtest app exactly what was required to do a proper mod with a shift and multiply, I had not applied that knowledge to the scaling code. The multiply method would work correctly only if the base were a power of two, and would be unlikely to provide a speed advantage if the required shift count had to be computed at run time. And if the base were a power of two, scaling with an AND would obviously be preferable. The genrands app generates byte streams so it uses a base that is a power of 2 (the 1024 value in the update was supposed to be 256). Applying the AND method of scaling to pd_rand_base_mul, the ENT output was identical to that for pd_rand_base_div.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; This proc is a MASM version of the generator that Paul
; Dixon posted to the PowerBASIC forum on October 29, 2000,
; modified to take a base argument, scale the result with
; a multiply, and return the result in EAX.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
  ALIGN 4
  rng_base_mul  dd 1
  rngh_base_mul dd 1
.code
ALIGN 4
pd_rand_base_mul proc base:DWORD
    mov   edx, rngh_base_mul
    shr   edx, 1
    mov   eax, rng_base_mul
    mov   ecx, eax
    rcr   ecx, 1
    rcl   edx, 1
    mov   rngh_base_mul, edx
    shl   eax, 12
    xor   ecx, eax
    mov   eax, ecx
    shr   eax, 20
    xor   ecx, eax   
    mov   rng_base_mul, ecx
    ; -------------
    ;mov   eax, ecx
    ;shl   eax, 2
    ;mul   base
    ;mov   eax, edx
    mov   edx, base
    mov   eax, ecx
    sub   edx, 1
    and   eax, edx       
    ; -------------
    ret
pd_rand_base_mul endp


Entropy = 7.999809 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 264.32, and randomly
would exceed this value 50.00 percent of the times.

Arithmetic mean value of data bytes is 127.5590 (127.5 = random).
Monte Carlo value for Pi is 3.145764583 (error 0.13 percent).
Serial correlation coefficient is 0.000932 (totally uncorrelated = 0.0).


But I could be wrong (again) and I would still like to see your explanation.

My goal with the scaling was to provide uniformly distributed PRNs in a given range. The vast majority of my use of PRNs has been in test code, and the ranges produced by the RTL generators have seldom been suitable. So I view a PRNG that takes a base (range) argument as a worthwhile convenience.

Given the 32-bit nature of current processors, I think a general-purpose PRNG should be capable of returning a full 32-bit value. For reference, the Park-Miller generators are limited to 31 bits. I also think that a general-purpose PRNG should be capable of running on a 486, P1, etc. Your 33-bit feedback shift register code meets all of my requirements, and it is the fastest generator I have tested so far. The return value could be easily and quickly truncated to any smaller bit length, and a wrapper procedure that scaled with a divide could be provided for ranges that do not correspond to bit lengths.

I tested your 63-bit MMX code, scaling the lower dword with a divide. The chi-square was way off and in the visual test, at certain window sizes, the screen would stabilize to a checker board pattern of single set and clear pixels. I repeated the ENT tests, just returning the lower dword (genrands uses AL only), and the chi-square was still way off, as shown below.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
  ALIGN 4
  pd_rand63_rng  dq 1234566789 ;the seed, ANY number other than zero can be used
  pd_rand63_mask dq 7fffffffffffffffh    ;keep the 63 used bits only
.code
ALIGN 4
pd_rand63 proc base:DWORD
    movq  mm0, pd_rand63_rng  ;get the last number
    movq  mm1, mm0
    psllq mm0, 1
    pxor  mm1, mm0
    pand  mm1, pd_rand63_mask
    movq  mm0, mm1
    psrlq mm1, 62
    pxor  mm0, mm1
    movq  pd_rand63_rng, mm0  ;write back the result
    ; -------------
    mov   eax, DWORD PTR pd_rand63_rng
    ;xor   edx, edx
    ;div   base
    ;mov   eax,edx
    ; -------------   
    ret
pd_rand63 endp


Entropy = 7.999728 bits per byte.

Optimum compression would reduce the size
of this 1000000 byte file by 0 percent.

Chi square distribution for 1000000 samples is 378.02, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 127.6186 (127.5 = random).
Monte Carlo value for Pi is 3.118068472 (error 0.75 percent).
Serial correlation coefficient is -0.002671 (totally uncorrelated = 0.0).

eschew obfuscation

dioxin

Linear Feedback Shift Registers
  The "shift register with feedback" type of random number generator is straight forward and it expalined in lots of places e.g.: http://www.newwaveinstruments.com/resources/articles/m_sequence_linear_feedback_shift_register_lfsr.htm
   
The principle is to take a shift register of the required number of bits then shift it one bit at a time while shifting in a new bit which consists of the bit being shifted out XORed with a number of other bits from the register which are the "tap" positions.
This is repeated for each bit in the register so that all bits are updated. So, for a 32 bit shift register, it's necessary to do 32 shift/XORs.
Certain combinations of register length and tap positions give rise to a maximal length sequence which gives a sequence length of (2^NumberOfBits)-1. Which sequences have this property have been figured out by others and I just take their results. Websites are out there with lists of such things such as the webiste mentioned above.
At the bottom of that page is a set of liks to lots of maximal length sequences.
There is another list at: http://homepage.mac.com/afj/taplist.html

All I do is implement these sequences in an efficient way.


How the New Bits form

Suppose we have a 32bit shift register with a tap at bits 32(MSB, there is always a tap at the MSB) and 20 and we shift it left each time.
Let's give each bit a number, 1 to 32, representing where that bit was in the register at the start.
The first shift will drop bit 32 and the new bit 1 which gets shifted in will be (32 XOR 20).
The next shift will give a new bit of (31 XOR 19), then (30 XOR 18) and so on for 20 bits. The 20th shift will give a new bit of (13 XOR 1).
Now things change, the next shift will give a new bit of the current MSB XORed with the bit in the tap position, these are now 12 and the first new bit, i.e.(32 XOR 20) as the first new bit has now been shifted along to the tap position.

So this new bit is (12 XOR 32 XOR 20).
Similarly, the next new bit will be (11 XOR 31 XOR 19) and so on until and so on until the 32nd and final bit is done which will be (1 XOR 21 XOR 9)


You can see that the 20 most significant new bits are combinations of 2 old bits. The remaining 12 new bits are combinations of 3 old bits. So the amount of toggling done by the lower order bits is greater than that of the higher order bits.
This is a characteristic of all pseudo random binary sequences based on this type of feedback shift register.


Why MUL and DIV scale differently

I think this is the reason Michael gets different results when scaling with MUL and DIV. When DIV is used to scale it takes the remainder, which consists of fast changing low order bits. When MUL is used to scale it then it's the higher order bits which are multiplied up to give the scaled number.
So, using MUL in this case will give less rapidly varying random numbers than using DIV.

The solution should be to reverse the shifting so the bits are shifted in to the high end so the rapidly varying bits will end up at the most significant end of the register and then be used by MUL. It makes little difference to the shift register sequence as it's the same logic just in different bit positions.

If I'm right then this version should do better when scaled by MUL than by DIV. The only difference is the shift register is shifted the other way.

mov edx,rngh???
shr edx,1
mov eax,rng???
mov ecx,eax
rcl eax,1
rcl edx,1
mov rngh???,edx

shr ecx,12
xor eax,ecx
mov ecx,eax
shl ecx,20
xor eax,ecx
mov rng???,eax 



How to implement this type of RNG
The obvious way to implement this type of random number generator is to loop the required number of times (once for each bit in the shift register) and in each iteration XOR the relevant bits and shift in the result.
This turns out to be very slow since 32 loops would be needed for a 32 bit register.

But if we look at the bits we need in the output and compare them to the input bits there is a short cut that allows lots of the bits to be shifted and XORed in parallel, saving lots of time.

Provided there is a single tap and it's in the upper half of the register then we can do the following,
Begin with the current seed.
Copy the seed so we have it twice.
Shift the copy left by (register length - tap position) bits, in the above example that would be 32 - 20 = 12 bits.
XOR the original value with the shifted value.
optional MASK to remove unused bits if the register is larger than the sequence generator.
Copy the XORed value so we have it twice.
Shift the copy right by (Tap Position) bits, in the above example this would be 20 bits.
XOR with the unshifted version.
That's it!

The first shift and XOR does all 12 MSBs. The second shift and XOR does the remaining 20 LSBs.

The big advantage of this is that, regardless of the number of bits in the shift register, we can do all of the bits in 2 steps.

Let's do an example.. we may as well use the one I posted earlier!

        Tap position = 62, shift register length = 63 bits.
    movq  mm0, pd_rand63_rng    ;get the last number
    movq  mm1, mm0              ;copy it
    psllq mm0, 1                ;shift left by (register length - tap position) bits = 63-62 = 1
    pxor  mm1, mm0              ;XOR the two
    pand  mm1, pd_rand63_mask   ;this step is required if the register is not fully used, to mask unused bits.
    movq  mm0, mm1              ;take a copy
    psrlq mm1, 62               ;Shift the copy right by (Tap Position) bits, = 62
    pxor  mm0, mm1              ;XOR the two numbers
    movq  pd_rand63_rng, mm0    ;write back the result


        We can now guess why this version, although giving a much longer sequence length, is not a good as the 33 bit version.
Because the tap is only 1 bit from the end then only 1 of the new bits will depend on 3 old bits.
The remaining 62 new bits are combinations of only 2 old bits so the will appear to vary less between consecutive values.

        It would appear best to choose a tap position in the top half but as near to the middle as possible.

        We could increase the number of taps used in order to improve the randomness but it turns out that all maximal length sequences have an even number of taps (including the bit being shifted out) so we would need to triple the work being done and hence increase the time taken, if we try to implement the next number of taps.
        One tap is usually the fastest but let's try a 3 tap version anyway to see how it looks. From one of the tap lists above there's a 32bit maximal length sequence with taps at 32,27,20,18. These are reasonably spread out and have taps near the middle, but still in the top half so we can use the above method:

'a 3 tap, 32 bit PRBS with taps at 27,20 & 18.
mov eax,rng???  ;get the current value
mov ecx,eax     ;take a copy
shr eax,5       ;1st tap @27, 32-27=5 (note, we're shifting right to give MUL the scaling advantage)
xor ecx,eax
shr eax,7       ;2nd tap at 20, that's 7 more bits that last tap (27)
xor ecx,eax
shr eax,2       ;3rd tap at 18, 2 more bits than 20
xor ecx,eax
'and eax,mask   ;no need to mask off unused bits as the entire is being used (32 bits)

mov eax,ecx     ;copy
shl ecx,18      ;2nd half of 3rd tap
xor eax,ecx
shl ecx,2       ;2nd half of 2nd tap
xor eax,ecx
shl ecx,7       ;2nd half of 1st tap
xor eax,ecx
mov rng???,eax
mul base       ;scale if needed


        This one is longer than the 33bit version because of the extra 2 taps needed but it doesn't need to mess with the 33rd bit so time is saved there.
        I've shifted right first to give MUL the advantage when scaling, but with 3 taps this should be less of a problem.
        I've also rearranged the registers so the result is calculated in eax (where Michael seems to want it) instead of needing to be moved there.


Paul.

Abel

Michael,

For a proper coding of Park-Miller-Schrage, correction for negative signs should be done before the seed is used.  The pm_test differences then vanish.  Numerical Recipes has a readable description of Park-Miller (Chap. 7.1),
http://www.library.cornell.edu/nr/cbookfpdf.html

The Bays-Durham shuffle may improve the algorithm a bit, but that's probably gilding the lily.  Because of NR's copyright restrictions, I'd be hesitant to convert their code directly to asm, but the concepts are straightforward.

Abel

I can't get this page to load completely with Opera 7.54 on dial-up, it's hanging waiting for icons and whatever that never materialize, but suggested code is:

fixed_nrandom proc base:DWORD
    mov   eax, fixed_nrandom_seed
    xor   edx, edx
    mov   ecx, 127773
    div   ecx
    mov   ecx, eax
    mov   eax, 16807
    mul   edx
    mov   edx, ecx
    mov   ecx, eax
    mov   eax, 2836
    mul   edx
    sub   ecx, eax
    jns   @F
    add   ecx, 7fffffffh
@@: xor   edx, edx
    mov   eax, ecx
    mov   fixed_nrandom_seed, ecx
    div   base
    mov   eax, edx
    ret
fixed_nrandom endp

MichaelW

Abel,

Thanks for the information and the corrected code. So my fixed_nrandom isn't, and if it were it would produce exactly the same results as your shorter and faster version. That same chapter in Numerical Recipes lists some quick and dirty generators that I would like to investigate. Do you know of a good source for information on statistical testing of random number generators?

Paul,

Thanks for the code and explanations. I'm planning on getting back to this as soon as I get an opportunity.
eschew obfuscation

Abel

Michael,
Statistics isn't one of my long suites.  Along with politics and religion, randomness isn't a fit subject for mixed company.  Chapter 14 of NR delves into some discussion (14.3 for chi-2) of a variety of distributions.  Whether they offer any more than Walker's program, I can't say.  (It might be educational to cast his C code into asm to gain some insight into chi-2.)  In any case, a single number doesn't provide a full characterization for any distribution.  By analogy to skewness and kurtosis for normal distribution analyses, one can extrapolate to chi-3, chi-4, ..., etc., as additional tests, although I've not seen their citation.  Park-Miller has been around a long time and 16807 dates back to 1969 (Lewis, Goodman, and Miller).  Unless you're undertaking a PhD thesis, I'd recommend sticking to coding existing algorithms already exquisitely analyzed and tested by such candidates.

Regards,
Abel

Kestrel

#11
Simple 32bit RNG, try this: (can seed=0)

gsl_rng_vax:
a = 69069, c = 1 and m = 2^32
x(n+1) = (a * x(n) + c) mod m

Random Number Generation
http://www.gnu.org/software/gsl/manual/gsl-ref_17.html

Kestrel

#12
;==========================================================
        .data
        align 8

TinyRNG_x   dd  0
TinyRNG_a   dd  100711433

        .code
;----------------------------------------------------------
        align 8
Tiny_Random proc uses edx   iRange

    rdtsc
    adc eax, edx
    adc eax, TinyRNG_x
    mul TinyRNG_a
    adc eax, edx
    mov TinyRNG_x, eax

    mul iRange
    mov eax, edx
    ret

Tiny_Random endp
;==========================================================

I modified this version:
http://cyrillesavelief.sinfree.net/rng32.html

Mark Jones

Did these ever make it into SP2a?

A few months ago I tried the original nrand routine, using the fractional part of the RTC as seed and the results were very NON-random. In fact, every time I ran it, it produced the same sequence of numbers, regardless of the seed! We need a good, working RNG.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08