ASM for FUN - #4 SUB [Playing with random numbers generation]

Started by frktons, April 24, 2010, 04:18:07 PM

Previous topic - Next topic

frktons

In this topic I'd like to prepare a PB program that generates sequences of random
numbers, creating an array of 5000 rows of 64 long integer numbers, and writing the
array to a file when the generation is complete.

The rules are very simple:

1) numbers are in the range 1-50
2) they are generated in groups of 4 at a time
3) in the same group we can't have the same number more than 1 time
4) 16 groups of 4 numbers are a row in the array
5) when we have 5000 rows we write the whole array into a binary file

The elapsed CPU cycles are displayed on a screen format, along with the
progressive number of generated rows.

We'll have:

- RandGen.bas  - the source program
- RandGen.exe - the compiled program
- RandGen.scn - the screen format displaying the generation progress
- RandGen.dat - the output binary file we create

After I've created the PB version, I'll try to optimize any single SUB
with INLINE ASM.

Working on it, The Idea [Design] is ready, The Algorithm and CODE [implementation] not yet.  :P

I'll be back

Mind is like a parachute. You know what to do in order to use it :-)

dedndave

soooooooo
that's 80,000 groups of 4 dwords   :P
75% of the file is 0's

frktons

Quote from: dedndave on April 24, 2010, 05:02:43 PM
soooooooo
that's 80,000 groups of 4 dwords   :P
75% of the file is 0's

That's right, but I think it is faster doing so. Isn't it?

By the way here we have the first partial test attached.

I guess it is optimizable a lot. Let's see what we get at the end  :8)
And, of course, please suggest any optimization you can imagine  :P

@ Hutch: I suspect the MS tic you previously suggested, aren't milliseconds
or at least they looks like something else.
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

for the memory, i would use GlobalAlloc - don't bother clearing it out
allocate enough for the entire file ~1.22 Mb

for the innermost loop that generates 4 numbers from 1 to 50
use mutiply-to-divide for the scaling - that'll be much faster than using DIV

for the first number, scale the random gen result from 1 to 50
put it in the file buffer (call that #1)

for the second number, scale the random gen result from 1 to 49
if it is equal to or greater than #1, add one to it's value to adjust
put it in the file buffer (call that #2)

for the third number, scale the random gen result from 1 to 48
if it is equal to or greater than #1, add one to it's value to adjust
if it is equal to or greater than #2, add one to it's value to adjust
put it in the file buffer (call that #3)

for the forth number, scale the random gen result from 1 to 47
if it is equal to or greater than #1, add one to it's value to adjust
if it is equal to or greater than #2, add one to it's value to adjust
if it is equal to or greater than #3, add one to it's value to adjust
put it in the file buffer (call that #4)

that way, you generate 4 unique values from 1 to 50 without having to punch the random generator more than 4 times

do that 80,000 times, then write the file   :P
after you are done, free the allocated memory block

EDIT - there is a little flaw in this method
for example, if #4 does not get incremented when tested against #1
but, then, it does get incremented when tested against #2
it may cause it to miss an increment against the value of #1
let me think about it for a bit - there has to be a simple fix   :8)
the idea is, when you get around to picking the forth value, there are only 47 possible valid values for it
the other 3 values have already been used

one other note...
a lot will depend on how random you need the values to be
a "good" random number generator will likely be slower than a "run of the mill" generator
if it isn't that critical, a simpler generator can be pretty fast

dedndave

another approach that sounds promising would be to generate one value
from 0 to 49*48*47*46
use multiply-to-divide to decode the 4 values
then you have the same problem, plus you need to scramble the 4 values - lol
but - it sounds like it could have merit
the scramble permutation code could be part of the value
so, it would be 0 to 49*48*47*46*23, or 116,955,552 (the number of possible 4-value groups)
the advantage here is there would only be one pull from the generator for each group
this is the way to fly, if you can figure out how to translate the groups

frktons

@ Dave: if you'd like to do so, please try to download the previous attached zip file, and have a look at what I
already did, so we can see, step by step, how to put some INLINE ASM,
with a better algorithm, into the pgm.  ::)

The main part of the pgm related with random number generation:

'--------------------------------------------------------------------------------
' Generate 5000 rows of 64 integer numbers - PB version
'--------------------------------------------------------------------------------

SUB TestPB

    FOR x = 1 TO 5000
        CALL GenRowPB
        LOCATE 10,21
        PRINT FORMAT$(x,"###,")
    NEXT x
   
END SUB

'--------------------------------------------------------------------------------
' Generate single row of 64 integer numbers - PB version
'--------------------------------------------------------------------------------

SUB GenRowPB
     
    RANDOMIZE TIMER

    FOR y = 1 TO 16
   
    again:
   
        g&(1) = RND(Min_n,Max_n)
        g&(2) = RND(Min_n,Max_n)
        g&(3) = RND(Min_n,Max_n)
        g&(4) = RND(Min_n,Max_n)
       
        ok = si
       
        FOR x2 = 1 TO 3
            FOR x3 = x2 + 1 TO 4
                IF g&(x2) = g&(x3) THEN
                    ok = no
                    EXIT
                END IF
            NEXT x3
            IF ok = no THEN
                EXIT
            END IF
         NEXT x2
         IF ok = no THEN
             GOTO again
         END IF
    NEXT y

END SUB


Enjoy

Mind is like a parachute. You know what to do in order to use it :-)

dedndave

┌───────────────────────────────── RandGen ────────────────────────────────────┐
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│ Testing the performance of PB and INLINE ASSEMBLY for the generation of      │
│                                                                              │
│ random integer numbers [5000 rows of 64 numbers total].                      │
│                                                                              │
│         Version: PB                               Version: ASM               │
│┌────────────────────────────────────────────────────────────────────────────┐│
││ Rows generation: 5,000                  Rows generation: x,xxx             ││
│└────────────────────────────────────────────────────────────────────────────┘│
│┌────────────────────────────────────────────────────────────────────────────┐│
││ Elapsed CPU cycles: 5,614,049,693       Elapsed CPU cycles: x,xxx,xxx,xxx  ││
│└────────────────────────────────────────────────────────────────────────────┘│
│┌────────────────────────────────────────────────────────────────────────────┐│
││ Elapsed time in MS: 1,859               Elapsed time in MS: x,xxx          ││
│└────────────────────────────────────────────────────────────────────────────┘│
│┌────────────────────────────────────────────────────────────────────────────┐│
││                     M A S M - F O R U M    2 0 1 0                         ││
│└────────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ Messages from the program: Random numbers generation complete - PB version   │
│                                                                              │

Messages from Dave: that's over 70,000 cycles per group - i know we can beat that   :bg
but - you can speed that code up just by eliminating some loop overhead
try a single loop of 80,000 groups
i may play with it somemore - after my nap

frktons

Well, I eliminated the unnecessary print of the counter, and used a single loop
of 80000 cycles of 4 numbers, and some other tricks, and the CPU counter
shows a 50 fold improvement.
This is just the algorithm, we didn't put our hands on the bytes with ASM yet  :8)


'--------------------------------------------------------------------------------
' Generate 5000 rows of 64 integer numbers - PB version
'--------------------------------------------------------------------------------

SUB TestPB

    DIM gen(1 TO 50) AS LONG

    FOR x = 1 TO 50
        gen(x) = 0
    NEXT x

    RANDOMIZE TIMER

    FOR x = 1 TO 80000

        g&(1) = RND(Min_n,Max_n)
        gen(g&(1)) = 1

    gen2_again:

        g&(2) = RND(Min_n,Max_n)

        IF gen(g&(2)) = 0 THEN
            gen(g&(2)) = 1
        ELSE
            GOTO gen2_again
        END IF

    gen3_again:

        g&(3) = RND(Min_n,Max_n)

        IF gen(g&(3)) = 0 THEN
            gen(g&(3)) = 1
        ELSE
            GOTO gen3_again
        END IF

    gen4_again:

        g&(4) = RND(Min_n,Max_n)


        IF gen(g&(4)) = 0 THEN
            gen(g&(1)) = 0
        ELSE
            GOTO gen4_again
        END IF

        gen(g&(2)) = 0
        gen(g&(3)) = 0

    NEXT x

    LOCATE 10,21
    PRINT "5,000"

END SUB     

         
The main improvements are:

- the RANDOMIZE instruction is used only once instead of 80000 times
- the printing of the counter has been eliminated during the process
- instead of 6 IF we do 3 IF
- instead of regenerating the whole group if a number is generated more than once,
  we regenerate only the single number
- instead of 5000 CALLs to a SUB we have only one CALL
- we test if the number is already generated in the group with an array of flags

Maybe we can find some more points to optimize before ASM translation  :P
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

that's a little better - i get about a 30:1 improvement on my prescott CPU

the heart of the code is probably the random number generator, itself
you could make a big improvement by using an ASM random number generator inside the PB loop
as far as improving the loop, ASM isn't likely to make that much difference, as you saw in the loop test

but, deriving all 4 group values from a single random number would speed things up, too
that would hold true whether the code was written in ASM or PB
i am trying to come up with an algo for that in my head

frktons

Night time, going to sleep.  :dazzled:

I'll be back.

Cheers
Mind is like a parachute. You know what to do in order to use it :-)

frktons

Quote from: dedndave on April 24, 2010, 10:28:23 PM
deriving all 4 group values from a single random number would speed things up, too
that would hold true whether the code was written in ASM or PB
i am trying to come up with an algo for that in my head

Well I', curious about that single random number generating 4 numbers.
Let me know when the algo comes to you  :bg
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

well - we can use a simple "Mixed Linear Congruential (LGC)" random number generator
i miscalculated earlier - the number of possible 4-value groups is 132,652,800
we may be able to use that value as the divisor in the random number generator
that way, it spits out integers from 0 to 132,652,799
then, we can use a little algo to convert that value into the 4 group values
i am still working on that in my head   :P
i know it can be done - i did something very similar several years ago (~25 years - lol)
the question is - can we make it faster than 4 pulls on a random number generator

FORTRANS

Hi,

   One read of a random number can certainly be faster than
four.  However, the additional manipulation will slow things
down, unless it is like a table look up.  Big table.  Or just chop
up the returned random number into pieces.  Not too good
if the randomness is supposed to be good, but for "human"
use, it could be okay.

   However using a linear congruential generator is probably
fast enough for four "in line" usages.

Cheers,

Steve N.

dedndave

you may be right, Steve
if you can just make a fast RNG and pull it 4 times, or even 5 (to re-order)
it can be done the other way, i am sure
but it would take a lot of code development and perhaps an LUT (i am thinking 24 dwords for branch vectors)
i may write a 4 pull routine, just to see what the numbers are like
if it seems like too many clock cycles, then look at the single pull

the method i am thinking of extracting 4 values from a single one involves 4 multiply-to-divide's
if the RNG is fast enough, they will be about the same times

frktons

Quote from: dedndave on April 25, 2010, 09:57:15 AM
- the number of possible 4-value groups is 132,652,800

Sorry Dave, how did you get that 132,652,800 value?
We have numbers in the range 1-50, and we can't have the same number in the same 4-value group,
so it looks a little bit TOO BIG.

I don't think I'm getting your point, please can you detail your idea?
Mind is like a parachute. You know what to do in order to use it :-)