Dear Math and FPU-Experts - help needed!

dsouza123 · August 17, 2008, 08:56:18 PM

This version (still base 1 000 000 000) finds k.

Modified and renamed dword to ascii string routine, now ddtoa,
to return in eax the position/index of the zero byte in the sz result.

The index was already available and is used in appending the ascii string version
of the second dword (9 digits), because the first dword may not have 9 digits
so the next 9 are needed to ensure at least 9.

Takes about 22 seconds on 1.2 Ghz Athlon, XP Pro SP2.

Output is 4 MessageBoxes stating:
found k, first 9(+) digits, last 9 digits, max index used in the array.

Source and executable in the attached FibB05c.zip

[attachment deleted by admin]

Mark_Larson · August 17, 2008, 10:41:21 PM

Quote from: dsouza123 on August 17, 2008, 03:46:35 AM
I used base 1 000 000 000 ( 1 billion ), with each dword holding 9 digits
and the array size used, 2048 dwords, it can handle a fibonnaci number with 18432 digits.

It does checking to prevent overflowing bounds.

Mixed pascal/masm pseudo code.
Code Select Expand
for i := 0 to max do a[i] := a[i] + b[i] clc for i := 0 to max do adc a[i] if a[i] > 999 999 999 then a[i] := a[i] - 1 000 000 000 stc endif; endfor; if c then inc max mov edx, max shl edx, 2 a[max] := a[max] + 1 endif;

did splitting up the adds and the carry into two seperate loops run slower or faster than doing them in the same loop?

EDIT: I found a bug in your code where you set 'max' to 0 and not 2048.

mov max, 0

you can speed up the ADD loop by using SSE2. If 'max' is in dwords. both arrays need to be 16 byte aligned. It'll do a 128-bit add, by adding 4 dwords in parallel.

Code Select


     for i := 0 to max/4 do
          movdqa     xmm0,  b[i]
          paddd       xmm0,a[i]

dsouza123 · August 18, 2008, 12:26:08 AM

Got a scare for a moment

mov max, 0

is correct, max is a running max it gradually increases as the numbers increase in size,
it holds the current high point/index for the arrays, it is the highest dword with content.

arrmax is the variable that holds the array upper bound, max is compared to arrmax once per k

Yes, combining the adds and carrys saves time, 21 downto 18, no other changes.

There are likely many optimizations possible, keyhole, alignment, instruction reorder, stall reduction,
register/variable switches, instruction reduction.

I hoped that there was some way of using SSE2, since it is just dwords and not using all 32 bits,
and no conventional carry, so propagating a carry across dwords and SSE2 registers shouldn't be an obstacle.

Attached FibB06.zip has the combined add carry but no other modifications.

[attachment deleted by admin]

dsouza123 · August 18, 2008, 02:50:12 AM

A couple of tests added to allow skipping chunks of code.

One check, if the last is less than 9 digits long
so can skip the ddtoa and pandigital
as a consequence of the last 9 being false
the test of first 9 is also skipped (not a new feature)

Another check, if the first is 9 digits long getting the second 9 digits is skipped,
saves a ddtoa, concatenation and pandigital.

The cutoff values used are correct but might be slightly non optimal.
For example the cutoff condition for pandigital eval in the last 9 check is >= 100 000 000
(a demostratively correct value, any thing less is 8 digits), maybe 123 456 789 is optimal.

Attachment FibB06a.zip has both source and executable.

[attachment deleted by admin]

Mark_Larson · August 18, 2008, 12:46:56 PM

try adding the sse2 stuff in there and see if it still takes 22 seconds

FORTRANS · August 18, 2008, 01:53:57 PM

This is probably a dim idea, but here it is anyway. The FPU can load and
store packed BCD with up to 18 digits. So you could load two 16 digit
numbers, add them and store a possible 17 digit number. While the FPU
BCD loads and stores were notoriously slow, so you might not gain much
over code based on DAA. But you could load two 8 digit numbers, multiply,
and store the 16 digit result. Which does not look to be easy to do
otherwise. So N x 8 BCD multiplies should be fairly easy. And N x M digit
multiplies at least possible.

??

Steve

cobold · August 20, 2008, 03:31:27 AM

Yo, Yo,

I went the hard way, generating the sequence of fibonacci-numbers right from the start with

Code Select

.const
    NUMSIZE equ  50000
    TXTSIZE equ 100000

50000 bytes BCD-buffer, which allows for numbers of up to 100000 digits!

Code Select

.data
    k       dd ?
    F1      db NUMSIZE-1 dup(0),1   ;Fib1 NUMSIZE-digits pBCD
    F2      db NUMSIZE-1 dup(0),1   ;Fib2 dito
    szNum   db TXTSIZE dup("0"),0   ;TxtBuffer for ASCII-representation of pBCD
    NumLen  dd ?

Yes, I am a masochist, so I began with F1 = 1 and F2 = 1 :wink
then wrote proc to add bcd-number, another proc to convert bcd-number to decimal.

k=329468 and the number has 68855 decimal-digits, and szNum contains them all.
Code ist a little bit slow though!!! :(
Think it's my stupid way of testing for pandigitality - did it this way:

Code Select

test91 proc
    pushad
    mov dword ptr cnt[0],"0000"
    mov dword ptr cnt[4],"0000"
    mov  word ptr cnt[8],"00"
    xor esi,esi
    add esi,NumLen
    sub esi,9
    xor ebx,ebx
    .while ebx != 9
        movzx eax,szNum[esi]
        sub al,30h
        add cnt[eax],1
        inc ebx
        inc esi
    .endw
    popad
    switch$ addr cnt
    case$ "0111111111"
        mov eax,1
    else$
        mov eax,0
    endsw$
    ret
test91 endp

and test19 which does the some for first 9 digits.

Anyway - I'm astonished that an old idiot like me - who wasted a few years of his life programming in COBOL - was able to find this solution in assembler. :bdg

Mark_Larson · August 20, 2008, 11:04:34 AM

Quote from: cobold on August 20, 2008, 03:31:27 AM

50000 bytes BCD-buffer, which allows for numbers of up to 100000 digits!

Yes, I am a masochist, so I began with F1 = 1 and F2 = 1 :wink
then wrote proc to add bcd-number, another proc to convert bcd-number to decimal.

k=329468 and the number has 68855 decimal-digits, and szNum contains them all.
Code ist a little bit slow though!!! :(
Think it's my stupid way of testing for pandigitality - did it this way:

Code Select Expand
test91 proc pushad mov dword ptr cnt[0],"0000" mov dword ptr cnt[4],"0000" mov word ptr cnt[8],"00" xor esi,esi add esi,NumLen sub esi,9 xor ebx,ebx .while ebx != 9 movzx eax,szNum[esi] sub al,30h add cnt[eax],1 inc ebx inc esi .endw popad switch$ addr cnt case$ "0111111111" mov eax,1 else$ mov eax,0 endsw$ ret test91 endp
and test19 which does the some for first 9 digits.

Anyway - I'm astonished that an old idiot like me - who wasted a few years of his life programming in COBOL - was able to find this solution in assembler. :bdg

very innovative!

BCD on today's processors is very slow. So I don't recommend it using it if you are looking for speed.

raymond · August 21, 2008, 12:40:21 AM

Here's my algo to test for pandigitality.
First, I used unpacked BCDs throughout.
Second, I never converted any BCD to ASCII.

Code Select

      mov   al,9
   @@:
      mov   edi,tailsrc    ;start of BCD string to test for pandigitality
      mov   ecx,9
      repnz scasb
      jnz   notpandigital
      dec   al
      jnz   @B
;is pandigital if it falls through

For your information, my initial program got the answer to the problem in 0.7 sec. (I was not as "masochist" as you were). A later modification brought the time down to 0.25 sec on my old P4.

QuoteSo I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

cobold · August 21, 2008, 06:22:12 AM

Quote from: raymond on August 21, 2008, 12:40:21 AM
Here's my algo to test for pandigitality.
First, I used unpacked BCDs throughout.

Excuse me if I'm asking so silly, but why do you use unpacked BCD?
Wouldn't packed ones be faster? (I think because you add 2 decimal digits in ONE instruction, while unpacked you add only ONE, or am I wrong here)

Quote from: raymond on August 21, 2008, 12:40:21 AM
Second, I never converted any BCD to ASCII.

Yup, wastes a lot of time! I didn't realize that when I started.

Never used SCAS - REP before, so thanks for posting your code.
I added my comments to your code, as far as I understand it.
Do you mind telling me, if I understood it? Shouldn't be a big deal for you. Just drop a short hint, if I understood something wrong: Thank you!

Code Select

    mov   al,9            ;starting value to compare with BCD string
                          ;we start 9 because direction=down 
@@:
    mov   edi,tailsrc     ;edi points to value of 9th BCD string
                          ;tailsrc = offset BCDstring+8  if we want to test 1st nine digits, or
                          ;tailsrc = offset BCDstring+len(BCDstring)-1 if we want to test last nine digits
                          
    mov   ecx,9           ;is for REP instruction to work properly
    
    repnz scasb           ;compare al with 9,8,7 .....1
                          ;decrement edi after comparison (SCAS) &&
                          ;decrement ecx after compare (REP)
                          ;ie. compare until nth BCD = 9 ???
                          ;we're out of repnz 
                            
                            
    jnz   notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
    dec   al              ;else (if ZERO-FLAG) repeat the above with 8,7...and so on
    jnz   @B

;is pandigital if it falls through

Mark_Larson · August 21, 2008, 10:46:36 AM

Quote from: raymond on August 21, 2008, 12:40:21 AM

QuoteSo I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

SSE2 of course :) I'll write up some code tomorrow. it's 5:45 am here. I just woke up, so thought I"d check my email and go back to bed. too early for me. Most of the SSE2 instructions on a core 2 duo run in 1 cycle. On Intel processors P4 and up, will also run this faster.

EDIT: here is a short example off the top of my head. It gets rid of your repnz scasb

Code Select


    mov  eax,099999999h
    movd xmm0,eax
    pshufd  xmm1,xmm0,00000000b        ;blasts 0x99999999 to all dwords for a compare
   @@:
;tailsrc needs to be 16 byte aligned
      mov   edi,tailsrc    ;start of BCD string to test for pandigitality
      movdqa  xmm0,[edi]
;we are comparing 9 bytes.  9 bytes fit into ONE sse2register.
      pcmpeqb xmm0,xmm1        ;all 16-bytes in the compare it'll be FFs if it is equal for that byte, and 00 if it is not
      ;but we only care about 9 bytes, so we can do an and and get rid of the rest.
      pand  xmm2,[zero_out_bytes_10_16]
;lpmovmskb, grabs the high bit from each byte. and puts them in a 32-bit register, remember all the values are 00 or FF?   So if it's FF it gets a 1, otherwise a 0
; so that means if you get a 0 in ecx, that means that byte didn't exist.
      pmovmskb   ecx,xmm2 
;pmovmskb might not update the flags.  so adding a test for 0
      test ecx,ecx
      jnz   notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
     dec   al
      jnz   @B
;is pandigital if it falls through

EDIT2: you should really look at learning SIMD, Raymond. On P4 and up on Intel processors it is actually faster to use SIMD. You can do a scalar code just like FP code

Code Select


      fild            [32-bit_float_value]
       movss        xmm0,[32-bit_float_value]

       fadd             [32-bit_float_value]
        addss         [32-bit_float_value]

        fmul            [32-bit_float_value]
        mulss         [32-bit_float_value]

         ;64-bit is the same, except you add a 'd' in the instruction name, I capitalized it in the instruction to make it easier to see.  The 's' before the 'd' means "scalar", so the  instruction only operates on one value at a time.  you don't have to do finit, fxch, and other archaic FP stuff.  
      fild            [64-bit_float_value]
       movsD        xmm0,[64-bit_float_value]

       fadd             [64-bit_float_value]
        addsD         xmm0,[64-bit_float_value]

        fmul            [64-bit_float_value]
        mulsD         xmm0,[64-bit_float_value]

EDIT3: I forgot I wrote a long article on how to use sclar sse
http://www.masm32.com/board/index.php?topic=1140.0
Mark

Mark_Larson · August 21, 2008, 03:14:28 PM

I am using YASM under 64-bit linux. So the syntax will be a bit different. I ran 100 Million loops with the priority set to the highest under Linux. I ran the code 10 times and took the lowest number. I got 18.391898 CYCLES. I set it so it would always find a match, and so it would be the slowest path. You were getting stuff in the seconds range, so mine is definitely a lot faster. I actually cut and pasted your code, and changed a few lines to support SIMD. I completely unrolled the loop, which gave me a nice speed up. I changed your DEC AL to SUB AL,1. DEC/INC is slower on Intel processors, since the P4. You had a "mov edi,tailsrc", when you should have just gotten rid of "edi" all together and just used" [tailsrc]"

I would be happy to discuss in detail how I sped it up so much if anyone is interested.

Code Select


section .data
align 16
	tailsrc:						db		0,1,2,3,4,5,6,7,8,9
align 16
	zero_out_bytes_10_16:	dd		0FFFFFFFFh,0FFh,0,0
	subtract_one:				dd		11111111h,11111111h,11111111h,11111111h	

section .code

	mov		ecx,099999999h
	movd		xmm0,ecx
	movdqa	xmm7,[zero_out_bytes_10_16]
	movdqa	xmm6,[subtract_one]
	pshufd  	xmm1,xmm0,00000000b        ;blasts 0x99999999 to all dwords for a compare
pand_loop:
;unroll the loop 9 times
%rep 9
	movdqa  	xmm0,[tailsrc]
	pcmpeqb 	xmm0,xmm1        ;all 16-bytes in the compare it'll be FFs if it is equal for that byte, and 00 if it is not
	pand  	xmm0,xmm7			;[zero_out_bytes_10_16]
	pmovmskb	edx,xmm2
	test 		edx,edx
	jnz   	notpandigital   ;if nz ---> Number does not contain 9, cannot be pandigital
	psubb		xmm1,xmm6
%endrep

raymond · August 21, 2008, 05:22:41 PM

QuoteDo you mind telling me, if I understood it? Shouldn't be a big deal for you. Just drop a short hint, if I understood something wrong: Thank you!

Some of it is wrong because you probably did not read up on how SCAS works.

Code Select

      mov   al,9
;AL,AX,EAX are the default registers for the SCAS instruction
;The presence of all digits 1-9 must be checked.
;By starting at 1, I would have to perform an additional comparison with 9 on the way up.
;By starting from the top, i avoided that comparison and use the ZF flag to exit.

   @@:
      mov   edi,tailsrc
;tailsrc was a global variable containing the address where the unpacked BCD string STARTS
;Whether the string is kept in forward or reverse order is totaly immaterial.
;This avoids changing the direction flag if you start from the END of the string
;EDI is the default pointer register for the SCAS instruction

      mov   ecx,9
;ECX is the default counter for the REP instruction
;In our case, we need to check 9 consecutive memory locations

      repnz scasb
;Read up on the SCAS instruction (and scasb, scasw, scasd)
;If the target number is found, the scas stops with the ZF flag set

      jnz   notpandigital
;If the target number is not found, then the string is not pandigital
;no use to continue checking other digits

      dec   al
;change target number

      jnz   @B
;continue until all numbers checked

;is pandigital if it falls through

raymond · August 21, 2008, 05:42:42 PM

QuoteQuote
So I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

Maybe I should have been more specific with my question or used your entire sentence in my quote. Scanning for pandigitality is only a small part of the problem at hand and has nothing to do with BCDs per se. I do agree that part could possibly be done a bit faster. The main part of the problem is to compute Fibonacci numbers and you don't recommend using BCDs for speed.

A more specific question would thus have been:
How would you compute Fibonacci numbers faster without using BCDs and then check ithe pandigitality of the last 9 digits (and first 9 digits) of each Fibonacci number.

Mark_Larson · August 21, 2008, 06:40:37 PM

Quote from: raymond on August 21, 2008, 05:42:42 PM
QuoteQuote
So I don't recommend it using it if you are looking for speed.

Mark, what else apart from BCDs would YOU use which would be FASTER to solve this particular problem???

Maybe I should have been more specific with my question or used your entire sentence in my quote. Scanning for pandigitality is only a small part of the problem at hand and has nothing to do with BCDs per se. I do agree that part could possibly be done a bit faster. The main part of the problem is to compute Fibonacci numbers and you don't recommend using BCDs for speed.

A more specific question would thus have been:
How would you compute Fibonacci numbers faster without using BCDs and then check ithe pandigitality of the last 9 digits (and first 9 digits) of each Fibonacci number.

Ah I misunderstood. :) Still SSE2. The love of my life :) We can also speed up the check faster by checking the rear and the front at the same time. Let me work on it a bit and I will post something :)

EDIT: I've done a LOT of big number math before in assembly language.

News:

Dear Math and FPU-Experts - help needed!

dsouza123

dsouza123

dsouza123