News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dear Math and FPU-Experts - help needed!

Started by cobold, August 16, 2008, 03:47:03 AM

Previous topic - Next topic

dsouza123

This version (still base 1 000 000 000) finds k. 

Modified and renamed dword to ascii string routine, now ddtoa,
to return in eax the position/index of the zero byte in the sz result.

The index was already available and is used in appending the ascii string version
of the second dword (9 digits), because the first dword may not have 9 digits
so the next 9 are needed to ensure at least 9.

Takes about 22 seconds on 1.2 Ghz Athlon, XP Pro SP2.

Output is 4 MessageBoxes stating:
  found k, first 9(+) digits, last 9 digits, max index used in the array.

Source and executable in the attached FibB05c.zip

[attachment deleted by admin]

Mark_Larson

#16
Quote from: dsouza123 on August 17, 2008, 03:46:35 AM
I used base 1 000 000 000 ( 1 billion ), with each dword holding 9 digits
and the array size used, 2048 dwords, it can handle a fibonnaci number with 18432 digits.

It does checking to prevent overflowing bounds.

Mixed pascal/masm pseudo code.

     for i := 0 to max do
       a[i] := a[i] + b[i]

     clc
     for i := 0 to max do
       adc a[i]
       if a[i] > 999 999 999 then
         a[i] := a[i] - 1 000 000 000
         stc
       endif;
     endfor;

     if c then
       inc max
       mov edx, max
       shl edx, 2
       a[max] := a[max] + 1
     endif;


did splitting up the adds and the carry into two seperate loops run slower or faster than doing them in the same loop?

EDIT: I found a bug in your code where you set 'max' to 0 and not 2048.

      mov max, 0

you can speed up the ADD loop by using SSE2.  If 'max' is in dwords.  both arrays need to be 16 byte aligned.  It'll do a 128-bit add, by adding 4 dwords in parallel.

     for i := 0 to max/4 do
          movdqa     xmm0,  b[i]
          paddd       xmm0,a[i]

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dsouza123

Got a scare for a moment

mov max, 0

is correct, max is a running max it gradually increases as the numbers increase in size,
it holds the current high point/index for the arrays, it is the highest dword with content.

arrmax is the variable that holds the array upper bound, max is compared to arrmax once per k

Yes, combining the adds and carrys saves time, 21 downto 18, no other changes.

There are likely many optimizations possible, keyhole, alignment, instruction reorder, stall reduction,
register/variable switches, instruction reduction.

I hoped that there was some way of using SSE2, since it is just dwords and not using all 32 bits,
and no conventional carry, so propagating a carry across dwords and SSE2 registers shouldn't be an obstacle.

Attached FibB06.zip has the combined add carry but no other modifications.

[attachment deleted by admin]

dsouza123

A couple of tests added to allow skipping chunks of code.

One check, if the last is less than 9 digits long
so can skip the ddtoa and pandigital
as a consequence of the last 9 being false
the test of first 9 is also skipped (not a new feature)

Another check, if the first is 9 digits long getting the second 9 digits is skipped,
saves a ddtoa, concatenation and pandigital.

The cutoff values used are correct but might be slightly non optimal.
For example the cutoff condition for pandigital eval in the last 9 check is  >= 100 000 000
(a demostratively correct value, any thing less is 8 digits), maybe 123 456 789 is optimal.

Attachment FibB06a.zip has both source and executable.

[attachment deleted by admin]

Mark_Larson

try adding the sse2 stuff in there and see if it still takes 22 seconds
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

FORTRANS

   This is probably a dim idea, but here it is anyway.  The FPU can load and
store packed BCD with up to 18 digits.  So you could load two 16 digit
numbers, add them and store a possible 17 digit number.  While the FPU
BCD loads and stores were notoriously slow, so you might not gain much
over code based on DAA.  But you could load two 8 digit numbers, multiply,
and store the 16 digit result.  Which does not look to be easy to do
otherwise.  So N x 8 BCD multiplies should be fairly easy.  And N x M digit
multiplies at least possible.

??

Steve

cobold

Yo, Yo,

I went the hard way, generating the sequence of fibonacci-numbers right from the start with

.const
    NUMSIZE equ  50000
    TXTSIZE equ 100000


50000 bytes BCD-buffer, which allows for numbers of up to 100000 digits!

.data
    k       dd ?
    F1      db NUMSIZE-1 dup(0),1   ;Fib1 NUMSIZE-digits pBCD
    F2      db NUMSIZE-1 dup(0),1   ;Fib2 dito
    szNum   db TXTSIZE dup("0"),0   ;TxtBuffer for ASCII-representation of pBCD
    NumLen  dd ?

Yes, I am a masochist, so I began with F1 = 1 and F2 = 1  :wink
then wrote proc to add bcd-number, another proc to convert bcd-number to decimal.

k=329468 and the number has 68855 decimal-digits, and szNum contains them all.
Code ist a little bit slow though!!!  :(
Think it's my stupid way of testing for pandigitality - did it this way:

test91 proc
    pushad
    mov dword ptr cnt[0],"0000"
    mov dword ptr cnt[4],"0000"
    mov  word ptr cnt[8],"00"
    xor esi,esi
    add esi,NumLen
    sub esi,9
    xor ebx,ebx
    .while ebx != 9
        movzx eax,szNum[esi]
        sub al,30h
        add cnt[eax],1
        inc ebx
        inc esi
    .endw
    popad
    switch$ addr cnt
    case$ "0111111111"
        mov eax,1
    else$
        mov eax,0
    endsw$
    ret
test91 endp

and test19 which does the some for first 9 digits.

Anyway - I'm astonished that an old idiot like me - who wasted a few years of his life programming in COBOL - was able to find this solution in assembler. :bdg

Mark_Larson

Quote from: cobold on August 20, 2008, 03:31:27 AM

50000 bytes BCD-buffer, which allows for numbers of up to 100000 digits!

Yes, I am a masochist, so I began with F1 = 1 and F2 = 1  :wink
then wrote proc to add bcd-number, another proc to convert bcd-number to decimal.

k=329468 and the number has 68855 decimal-digits, and szNum contains them all.
Code ist a little bit slow though!!!  :(
Think it's my stupid way of testing for pandigitality - did it this way:

test91 proc
    pushad
    mov dword ptr cnt[0],"0000"
    mov dword ptr cnt[4],"0000"
    mov  word ptr cnt[8],"00"
    xor esi,esi
    add esi,NumLen
    sub esi,9
    xor ebx,ebx
    .while ebx != 9
        movzx eax,szNum[esi]
        sub al,30h
        add cnt[eax],1
        inc ebx
        inc esi
    .endw
    popad
    switch$ addr cnt
    case$ "0111111111"
        mov eax,1
    else$
        mov eax,0
    endsw$
    ret
test91 endp

and test19 which does the some for first 9 digits.

Anyway - I'm astonished that an old idiot like me - who wasted a few years of his life programming in COBOL - was able to find this solution in assembler. :bdg

very innovative!

BCD on today's processors is very slow.  So I don't recommend it using it if you are looking for speed.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

Here's my algo to test for pandigitality.
First, I used unpacked BCDs throughout.
Second, I never converted any BCD to ASCII.

      mov   al,9
   @@:
      mov   edi,tailsrc    ;start of BCD string to test for pandigitality
      mov   ecx,9
      repnz scasb
      jnz   notpandigital
      dec   al
      jnz   @B
;is pandigital if it falls through


For your information, my initial program got the answer to the problem in 0.7 sec. (I was not as "masochist" as you were). A later modification brought the time down to 0.25 sec on my old P4.

QuoteSo I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

cobold

Quote from: raymond on August 21, 2008, 12:40:21 AM
Here's my algo to test for pandigitality.
First, I used unpacked BCDs throughout.
Excuse me if I'm asking so silly, but why do you use unpacked BCD?
Wouldn't packed ones be faster? (I think because you add 2 decimal digits in ONE instruction, while unpacked you add only ONE, or am I wrong here)

Quote from: raymond on August 21, 2008, 12:40:21 AM
Second, I never converted any BCD to ASCII.
Yup, wastes a lot of time! I didn't realize that when I started.

Never used SCAS - REP before, so thanks for posting your code.
I added my comments to your code, as far as I understand it.
Do you mind telling me, if I understood it? Shouldn't be a big deal for you. Just drop a short hint, if I understood something wrong: Thank you!

    mov   al,9            ;starting value to compare with BCD string
                          ;we start 9 because direction=down
@@:
    mov   edi,tailsrc     ;edi points to value of 9th BCD string
                          ;tailsrc = offset BCDstring+8  if we want to test 1st nine digits, or
                          ;tailsrc = offset BCDstring+len(BCDstring)-1 if we want to test last nine digits
                         
    mov   ecx,9           ;is for REP instruction to work properly
   
    repnz scasb           ;compare al with 9,8,7 .....1
                          ;decrement edi after comparison (SCAS) &&
                          ;decrement ecx after compare (REP)
                          ;ie. compare until nth BCD = 9 ???
                          ;we're out of repnz
                           
                           
    jnz   notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
    dec   al              ;else (if ZERO-FLAG) repeat the above with 8,7...and so on
    jnz   @B

;is pandigital if it falls through

Mark_Larson

#25
Quote from: raymond on August 21, 2008, 12:40:21 AM

QuoteSo I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???


SSE2 of course :)  I'll write up some code tomorrow.  it's 5:45 am here.  I just woke up, so thought I"d check my email and go back to bed.  too early for me.  Most of the SSE2 instructions on a core 2 duo run in 1 cycle.  On Intel processors P4 and up, will also run this faster.

EDIT: here is a short example off the top of my head.  It gets rid of your repnz scasb


    mov  eax,099999999h
    movd xmm0,eax
    pshufd  xmm1,xmm0,00000000b        ;blasts 0x99999999 to all dwords for a compare
   @@:
;tailsrc needs to be 16 byte aligned
      mov   edi,tailsrc    ;start of BCD string to test for pandigitality
      movdqa  xmm0,[edi]
;we are comparing 9 bytes.  9 bytes fit into ONE sse2register.
      pcmpeqb xmm0,xmm1        ;all 16-bytes in the compare it'll be FFs if it is equal for that byte, and 00 if it is not
      ;but we only care about 9 bytes, so we can do an and and get rid of the rest.
      pand  xmm2,[zero_out_bytes_10_16]
;lpmovmskb, grabs the high bit from each byte. and puts them in a 32-bit register, remember all the values are 00 or FF?   So if it's FF it gets a 1, otherwise a 0
; so that means if you get a 0 in ecx, that means that byte didn't exist.
      pmovmskb   ecx,xmm2
;pmovmskb might not update the flags.  so adding a test for 0
      test ecx,ecx
      jnz   notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
     dec   al
      jnz   @B
;is pandigital if it falls through


EDIT2: you should really look at learning SIMD, Raymond.  On P4 and up on Intel processors it is actually faster to use SIMD.  You can do a scalar code just like FP code


      fild            [32-bit_float_value]
       movss        xmm0,[32-bit_float_value]

       fadd             [32-bit_float_value]
        addss         [32-bit_float_value]

        fmul            [32-bit_float_value]
        mulss         [32-bit_float_value]

         ;64-bit is the same, except you add a 'd' in the instruction name, I capitalized it in the instruction to make it easier to see.  The 's' before the 'd' means "scalar", so the  instruction only operates on one value at a time.  you don't have to do finit, fxch, and other archaic FP stuff. 
      fild            [64-bit_float_value]
       movsD        xmm0,[64-bit_float_value]

       fadd             [64-bit_float_value]
        addsD         xmm0,[64-bit_float_value]

        fmul            [64-bit_float_value]
        mulsD         xmm0,[64-bit_float_value]




EDIT3: I forgot I wrote a long article on how to use sclar sse
http://www.masm32.com/board/index.php?topic=1140.0
Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  I am using YASM under 64-bit linux.  So the syntax will be  a bit different.  I ran 100 Million loops with the priority set to the highest under Linux.  I ran the code 10 times and took the lowest number.  I got 18.391898 CYCLES.  I set it so it would always find a match, and so it would be the slowest path.  You were getting stuff in the seconds range, so mine is definitely a lot faster.  I actually cut and pasted your code, and changed a few lines to support SIMD.  I completely unrolled the loop, which gave me a nice speed up.  I changed your DEC AL to SUB AL,1.  DEC/INC is slower on Intel processors, since the P4.  You had a "mov edi,tailsrc", when you should have just gotten rid of "edi" all together and just used" [tailsrc]"

I would be happy to discuss in detail how I sped it up so much if anyone is interested.


section .data
align 16
tailsrc: db 0,1,2,3,4,5,6,7,8,9
align 16
zero_out_bytes_10_16: dd 0FFFFFFFFh,0FFh,0,0
subtract_one: dd 11111111h,11111111h,11111111h,11111111h

section .code

mov ecx,099999999h
movd xmm0,ecx
movdqa xmm7,[zero_out_bytes_10_16]
movdqa xmm6,[subtract_one]
pshufd  xmm1,xmm0,00000000b        ;blasts 0x99999999 to all dwords for a compare
pand_loop:
;unroll the loop 9 times
%rep 9
movdqa  xmm0,[tailsrc]
pcmpeqb xmm0,xmm1        ;all 16-bytes in the compare it'll be FFs if it is equal for that byte, and 00 if it is not
pand  xmm0,xmm7 ;[zero_out_bytes_10_16]
pmovmskb edx,xmm2
test edx,edx
jnz    notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
psubb xmm1,xmm6
%endrep
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

QuoteDo you mind telling me, if I understood it? Shouldn't be a big deal for you. Just drop a short hint, if I understood something wrong: Thank you!

Some of it is wrong because you probably did not read up on how SCAS works.

      mov   al,9
;AL,AX,EAX are the default registers for the SCAS instruction
;The presence of all digits 1-9 must be checked.
;By starting at 1, I would have to perform an additional comparison with 9 on the way up.
;By starting from the top, i avoided that comparison and use the ZF flag to exit.

   @@:
      mov   edi,tailsrc
;tailsrc was a global variable containing the address where the unpacked BCD string STARTS
;Whether the string is kept in forward or reverse order is totaly immaterial.
;This avoids changing the direction flag if you start from the END of the string
;EDI is the default pointer register for the SCAS instruction

      mov   ecx,9
;ECX is the default counter for the REP instruction
;In our case, we need to check 9 consecutive memory locations

      repnz scasb
;Read up on the SCAS instruction (and scasb, scasw, scasd)
;If the target number is found, the scas stops with the ZF flag set

      jnz   notpandigital
;If the target number is not found, then the string is not pandigital
;no use to continue checking other digits

      dec   al
;change target number

      jnz   @B
;continue until all numbers checked

;is pandigital if it falls through

When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

raymond

QuoteQuote
So I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

Maybe I should have been more specific with my question or used your entire sentence in my quote. Scanning for pandigitality is only a small part of the problem at hand and has nothing to do with BCDs per se. I do agree that part could possibly be done a bit faster. The main part of the problem is to compute Fibonacci numbers and you don't recommend using BCDs for speed.

A more specific question would thus have been:
How would you compute Fibonacci numbers faster without using BCDs and then check ithe pandigitality of the last 9 digits (and first 9 digits) of each Fibonacci number.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

Quote from: raymond on August 21, 2008, 05:42:42 PM
QuoteQuote
So I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

Maybe I should have been more specific with my question or used your entire sentence in my quote. Scanning for pandigitality is only a small part of the problem at hand and has nothing to do with BCDs per se. I do agree that part could possibly be done a bit faster. The main part of the problem is to compute Fibonacci numbers and you don't recommend using BCDs for speed.

A more specific question would thus have been:
How would you compute Fibonacci numbers faster without using BCDs and then check ithe pandigitality of the last 9 digits (and first 9 digits) of each Fibonacci number.


Ah I misunderstood.  :)  Still SSE2.  The love of my life :)  We can also speed up the check faster by checking the rear and the front at the same time.  Let me work on it a bit and I will post something :)

EDIT:  I've done a LOT of big number math before in assembly language.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm