News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dave's nickname hijacked??

Started by raymond, August 12, 2011, 03:04:56 AM

Previous topic - Next topic

dedndave

by changing the modulo divisors to 234, i get the correct result for Nmax=10
the problem comes when i have to multiply a 34-bit temporary operand onto a 34-bit exponential accumulator
( 234-1 )2 MOD 234
i'll have to extend the precision a little for that one operation
i need 68 bits of precision, just long enough to take the modulus - lol
i can do a sort of "partial multiply"

dioxin

On a 3GHz Phenom II x4 the following straight forward single threaded (using 1 core) PowerBASIC program gives these results:
N=1,000     time=       4.28 ms  Answer = 9110846700
N=10,000    time=     449.42 ms  Answer = 6237204500
N=100,000   time=   49900.84 ms  Answer = 3031782500
N=1,000,000 time= 5504076.67 ms  Answer = 4077562500


These are the same results as Clive and on a similar CPU but significantly faster than his c version.

I also have a better algorithm but it's not quite finished. It's heading for about 1.5ms for N=1000 and 3.3s for N=1,000,000
This would be about 10x slower than the times quoted earlier by Raymond:
Quote1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

..so I might not bother finishing it!

Paul.
%CPUfreq = 3000000000
%MaxValue = 100000

#DIM ALL

FUNCTION PBMAIN () AS LONG
LOCAL Sum,k, x AS QUAD
LOCAL t,r AS LONG

PRINT "N=";%MaxValue
TIX x

FOR r = 1 TO %MaxValue
    k = 1
    FOR t = 1 TO r
        k = (k * r) MOD 10000000000

    NEXT

    sum = (sum + k) MOD 10000000000
NEXT

TIX END x
PRINT "Answer =";sum
PRINT "Tix";x
PRINT "time="x/%CPUfreq*1000;"ms"
WAITKEY$

END FUNCTION
 

dedndave

hi Paul - good to see you

those are very impressive times
it would be interesting to see how the compiler implements the algo

dioxin

Quoteit would be interesting to see how the compiler implements the algo
The above program FOR/NEXT loop compiles as follows:
CPU Disasm
Address   Hex dump          Command                                  Comments
002010DA    C7C7 01000000   MOV EDI,1                               ; FOR r = 1 ..
002010E0    DB05 943A2000   FILD DWORD PTR DS:[203A94]              ; k=1
002010E6    DDD9            FSTP ST(1)
002010E8    89BD 30FFFFFF   MOV DWORD PTR SS:[EBP-0D0],EDI
002010EE    C7C6 01000000   MOV ESI,1                               ; FOR t = 1 ..
002010F4    E9 1E000000     JMP 00201117
002010F9    90              NOP
002010FA    90              NOP
002010FB    90              NOP
002010FC    90              NOP                                     ;pad to align
002010FD    90              NOP
002010FE    90              NOP
002010FF    90              NOP
00201100    DF2D 983A2000   FILD QWORD PTR DS:[203A98]              ; 10,000,000,000             
00201106    897D 94         MOV DWORD PTR SS:[EBP-6C],EDI           ; r
00201109    DB45 94         FILD DWORD PTR SS:[EBP-6C]
0020110C    D8CA            FMUL ST,ST(2)                           ; k * r
0020110E    E8 B5190000     CALL 00202AC8                           ; MOD 10,000,000,000
00201113    DDD9            FSTP ST(1)                              ; k = result
00201115    FFC6            INC ESI                                 ; NEXT t
00201117    8BC6            MOV EAX,ESI
00201119    3B85 30FFFFFF   CMP EAX,DWORD PTR SS:[EBP-0D0]
0020111F  ^ 7E DF           JLE SHORT 00201100
00201121    DF2D 983A2000   FILD QWORD PTR DS:[203A98]              ; 10,000,000,000
00201127    D9C1            FLD ST(1)                               ; k
00201129    D8C3            FADD ST,ST(3)                           ; sum + k
0020112B    E8 98190000     CALL 00202AC8                           ; MOD 10,000,000,000
00201130    DDDA            FSTP ST(2)
00201132    FFC7            INC EDI                                 ; NEXT r
00201134    81FF 40420F00   CMP EDI,0F4240
0020113A  ^ 7E A4           JLE SHORT 002010E0


00202AC9    D9EE            FLDZ
00202ACB    D8DA            FCOMP ST(2)                              ; FLOAT 10000000000.00000000
00202ACD    9B              WAIT
00202ACE    DFE0            FSTSW AX
00202AD0    F6C4 40         TEST AH,40
00202AD3    75 0C           JNE SHORT 00202AE1
00202AD5    D9F8            FPREM                                    ; MOD
00202AD7    9B              WAIT
00202AD8    DFE0            FSTSW AX
00202ADA    F6C4 04         TEST AH,04
00202ADD  ^ 75 F6           JNE SHORT 00202AD5
00202ADF    D9C9            FXCH ST(1)
00202AE1    DDD8            FSTP ST
00202AE3    58              POP EAX
00202AE4    C3              RETN

dedndave

QuoteThis would be about 10x slower than the times quoted earlier by Raymond:
Quote
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

you never can tell - lol

he might have you beat for the small one, but yours may win on the big ones   :bg

dioxin

It can be sped up a little (4-5%) by a more efficient conversion to ASM but the huge majority of the time (about 80% of the total)  is spent on the fprem instruction so it's not going to get a lot faster unless the algorithm is changed or the fprem is replaced with something better.

Paul.
FOR r = 1 TO %MaxValue

!fld1               'k=1
!fild r
!fild TenBillion

!mov eax,r          't
#ALIGN 16
lp1:
!fld st(1)          'replicate r
!fmul st(0),st(3)   'r*k
!fprem              'MOD 10,000,000,000
skip:
!fstp st(3)         'mov result back to k

!dec eax
!jne short lp1

!fcompp             'dump 2 items off stack
!fistp k

    sum = sum + k
NEXT

raymond

Paul,
That 250 ms for 1000000 was to generate only the last 9 digits using 32-bit integers. A similar time would be expected to get the last 10-18 digits using 64-bit integers on a 64-bit machine (but not on a 32-bit machine).

As a side note, I'm really surprised that this thread is getting so much attention. 449 views at last count! :dazzled:

And I will be away without access to a computer nor to the internet starting tomorrow morning until the end of this week. :(
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

we'll save your seat, Ray   :bg

FORTRANS

Hi,

   Probably off-topic, or maybe a by the way kind of thing.  But
the second to last bug fix in my code that got me from N = 9
to N = 10 showed some unexpected behavior with the BCD
instructions.  (Well it surprised me a bit.)


; Off by 16 code contained the following.
        ADD     AL,[DI] ; If there is any existing data.
        XOR     AH,AH
        AAA             ; Adjust after addition.

; The working code is now.
        XOR     AH,AH   ; Required as AAA increments AH, not a put result.
        ADD     AL,[DI] ; If there is any existing data.
        AAA             ; ASCII Adjust after Addition.


   Pickey, pickey, pickey.

Cheers,

Steve N.

dedndave

that's because AAA works by using the AC and CF flags, i think
XOR clears them   :P

as for my algo, it really needs to be coded in SSE/SSE2
but then, it would have to be compared with the other algo, also written using SSE/SSE2 - lol
not sure i want to spend the time to learn SSE at this point

dioxin

QuoteThat 250 ms for 1000000 was to generate only the last 9 digits

I see. In that case, below is my solution.

Here are the results it gives:
N= 10         time=        0.003ms Answer =  405071317
N= 100        time=        0.030ms Answer = 9027641920
N= 1000       time=        0.447ms Answer = 9110846700  14.9x
N= 10000      time=        6.092ms Answer = 6237204500  13.6x
N= 100000     time=       76.722ms Answer = 3031782500  12.6x
N= 1000000    time=      926.138ms Answer = 4077562500  12.1x
N= 10000000   time=   10,878.353ms Answer = 4535362500  11.7x
N= 100000000  time=  125,184.016ms Answer = 9113362500  11.5x
N= 1000000000 time=1,412,944.585ms Answer = 4893362500  11.3x longer than the one above

It's not quite linear but isn't bad.

The program is in PowerBASIC but uses ASM to do the complicated (x * y) MOD 10000000000.

Paul.
%CPUfreq = 3000000000
%MaxValue = 1000000

#DIM ALL

FUNCTION PBMAIN () AS LONG
LOCAL Sum,k,j, x, TenBillion, temp AS QUAD
LOCAL t,r AS LONG
LOCAL prod3,prod2,prod1 AS LONG
LOCAL TwoToThePower34Mod1E10, TwoToThePower64Mod1E10 AS QUAD

TenBillion = 10000000000
TwoToThePower34Mod1E10 = 7179869184
TwoToThePower64Mod1E10 = 3709551616

TIX x

sum = 0

FOR r = 1 TO %MaxValue
    k = 1
    t = r
    temp = r
    WHILE  t > 0
        IF (t AND 1) THEN
            'k = (k * temp)  MOD 10000000000. k and temp are 34 bit (10 digits)
            !lea eax,k
            !call DoCalc

        END IF

        'temp = (temp x temp) MOD 1E10.  temp is 10 digits (34 bits)
        !lea eax,temp
        !call DoCalc

        SHIFT RIGHT t,1
    WEND

    sum = (sum + k) MOD 10000000000
NEXT

TIX END x

PRINT "N=";%MaxValue ,;
PRINT "time=";FORMAT$(x/%CPUfreq*1000,"###,###,###.000");"ms " ;
PRINT "Answer =";sum

WAITKEY$
EXIT FUNCTION


#ALIGN 16
DoCalc:
        'on entry eax = address of the variable to update
        'var = (var * temp) MOD 1E10   (all values 34 bit)
        'could maybe be done in FPU in 2 blocks instead, might be quicker.

        !pushad             'save registers

        !mov esi,[eax]      'get the variable into edi:esi
        !mov edi,[eax+4]

        !mov eax,temp       'do low x low dword first
        !mul esi
        !mov ebx,eax        'store in ecx:ebx
        !mov ecx,edx

        !mov eax,temp[4]    'do low x high dword
        !mul esi            'now finished with esi so I can reuse it as the high dword of result
        !add ecx,eax        'add to result so far
        !mov esi,edx
        !adc esi,0          'carry into high register



        !mov eax,temp       'do high x low dword
        !mul edi
        !add ecx,eax        'add to result so far
        !adc esi,edx        'carry into high register


        !mov eax,temp[4]    'high x high only a few bits in this one so it never overflows to the next register
        !mul edi
        !add esi,eax

        !mov prod1,ebx      'store the full result
        !mov prod2,ecx
        !mov prod3,esi

        !mov edi,ecx        'this will be the surplus bits (low part)
        !shr edi,2          'shift bits into place

        !and ecx,&h3        'remove bits 34 upwards


        !mov eax,TwoToThePower34Mod1E10[4]         'correct for the 1st block of overflow bits
        !mul edi
        !xchg eax,edi
        !mul dword ptr TwoToThePower34Mod1E10
        !add edx,edi

        !add ebx,eax
        !adc ecx,edx


        !mov eax,TwoToThePower64Mod1E10[4]         'correct for the 2nd block of overflow bits
        !mul esi
        !xchg eax,esi
        !mul dword ptr TwoToThePower64Mod1E10
        !add edx,esi

        !add ebx,eax
        !adc ecx,edx

        'now store the result
        !mov prod1,ebx
        !mov prod2,ecx

        'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
        !fild TenBillion
        !fild qword ptr prod1
        !fprem


        !mov eax,[esp+28]       'get address of answer
        !fistp qword ptr [eax]  'save answer

        !fcomp              'discard unused item off the stack

        !popad              'restore registers
        !ret


END FUNCTION

                     

dedndave

that beats my best effort using the FPU   :U

        'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
        !fild TenBillion
        !fild qword ptr prod1
        !fprem


couldn't you simply AND off any excess bits ?
10 billion can be fully expressed with 34 bits - just whack off those above

dioxin

Dave,
you can't just discard the most significant bits. The problem would be trivial if you could!
The first bit beyond the 10 digits is bit 34. 2^34=17,179,869,184
If you chop off that 1 bit then you effectively subtract 17,179,869,184 from the number and the values of all the lower digits are affected.
It would be different in decimal where discarding the 11th digit would be the same as subtracting 10,000,000,000 but all of those zeros mean that none of the lower digits would be affected.

Paul.

dedndave

ahhhhhh
that's where i am going wrong !!!!

in the algo i have been playing with, i truncated bits - not decimal digits
let me see if it helps   :P

dedndave

well - it does help - it increases the range
however, not far enough
no matter how i slice it, i think my algo needs more precision

if we only needed 9 digits, mine would work
of course, if we only needed 9 digits, the whole problem is easier   :bg