News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dear Math and FPU-Experts - help needed!

Started by cobold, August 16, 2008, 03:47:03 AM

Previous topic - Next topic

Mark_Larson

 here is what the basic add loop looks like.  I am doing two 64-bit adds at the same time using sse2.  This is just the add part, I need to add the carry part next.



   xor eax,eax ; max = 0
mov ecx,eax ; i = max

outer_loop:

;yasm automatically uses OFFSET so using it will break the code
; mov rsi, OFFSET src1
mov rsi,src1 ; esi = src1 = f0
; mov rdi, OFFSET src2
mov rdi,src2 ; edi = src2 = f1-
   pxor xmm7,xmm7 ; carry =  0

do_adds:
movdqa xmm0, [rsi] ; f0
movdqa xmm1, [rdi] ; f1
paddq xmm0,xmm7 ; add f0 + carry, 64-bit add :)
pxor xmm7,xmm7 ; carry = 0
paddq xmm1,xmm0 ; add f0 + carry + f1

add rsi,16
add rdi,16
sub ecx,16 ; don't do the shr, just subtract by 16
jg do_adds ; since we don't do a SHR we need to change this to a JG instead of a JNE
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  I decided to do MMX first.  It has a 64-bit add ( paddq).  There is no 128-bit add.  You can do two 64-bit adds in parallel with SSE2, but then it gets more complex.  So I wanted to get the basic stuff done, and then I'll tweak it from there.  I just finished this.  I don't know if it works or not.  it starts with f0=1, and f1=1, does 10,000 loops at the highest priority.   I only do the adding and carries, no pand.  I stop when the number of digits = 8192.  It takes me 89470.973000 cycles to get there.  Or 33.635704135 nanoseconds.  Still a lot faster.  I'll test it tomorrow.  And add the check in.

if you notice any bugs, please let me know :)

; we set all of the first 18 digits to 9
max_digits dq 999999999999999999,0
; we 'subtract' 1 followed by 18 0's
subtract dq 1000000000000000000,0
; used as a comparison
zero dq 0,0
; used to set the carry flag to 1
one dq 1,1

   mov byte [src1], 1 ; f0 = 1
   mov byte [src2], 1 ; f1 = 1
   xor eax,eax ; num_digits = 0 ( we're 0 based for the #)
mov ecx,eax ; loop_cnt = num_digits = 0 ( we're 0 based for the #)
movq mm6,[max_digits] ; 18 9's
movq mm5,[subtract] ; subtract 1 followed by 18 0's
movq mm4,[one] ; set carry flag to 1
movq mm3,[zero] ; ; used as a comparison

outer_loop:
;yasm automatically uses OFFSET so using it will break the code
; mov rsi, OFFSET src1
mov rsi,src1 ; esi = src1 = f0
; mov rdi, OFFSET src2
mov rdi,src2 ; edi = src2 = f1-
   pxor mm7,mm7 ; carry =  0

do_adds:
movq mm0, [rsi] ; f0
movq mm1, [rdi] ; f1
paddq mm0,mm7 ; add f0 + carry, 64-bit add :)
pxor mm7,mm7 ; carry = 0
paddq mm1,mm0 ; add f0 + carry + f1

movq mm0,mm1 ; save mm1, we will need it later
pcmpeqd mm0,mm6 ; 18 digits, all 9's
pmovmskb edx,mm0
test edx,edx
jne skip_carry

psubq mm1, mm5 ; subtract 1 followed by 18 0's
movq mm7, mm4 ; carry = 1
movq [rdi], mm1

skip_carry:
add rsi,8
add rdi,8
sub ecx,8 ; don't do the shr, just subtract by 4
jg do_adds ; since we don't do a SHR we need to change this to a JG instead of a JNE

movq mm0,mm7 ; save mm7 (carry), we will need it later
pcmpeqd mm0,mm3 ; carry = 0 ?
pmovmskb edx,mm0
test edx,edx
jz skip
add eax,1 ; num_digits += 1
add ecx,1 ; i += 1
skip:
cmp eax,8192                                 ; 8192 digits already done?
jl outer_loop


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dsouza123

An alternative pandigital test of bcd stored numbers.


.686p                     ; create 32 bit code
.model flat, stdcall      ; 32 bit memory model
option casemap :none      ; case sensitive
.xmm

include    \masm32\include\windows.inc
include    \masm32\include\user32.inc
include    \masm32\include\kernel32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib

.data
    a db 1,2,3,4,5,6,7,8,9
    i dd 1,2,4,8,16,32,64,128,256,512 ; 1022 for pandigital
   
.code

start:
    mov   esi, offset a
    mov   edi, offset i
    mov   edx, 0
    mov   ecx, 9
@@:
    dec   ecx
    movzx eax, byte ptr [esi + ecx]
    mov   eax, [edi + eax*4]
    or    edx, eax
    jecxz @F
    jmp   @B
@@:
    mov   eax, 0        ; 0 is false
    xor   edx, 1022     ; if becomes 0, matches pandigital
    jnz   @F
    inc   eax           ; 1 is true
@@:
    invoke  ExitProcess, 0
end start

raymond

QuoteAn alternative pandigital test of bcd stored numbers.

Nice algo. I'll try it and see if it improves the overall timing.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dsouza123

Thank you Raymond, a slight reordering of the instructions should reduce stalls due to read after write.


    mov   ecx, 8
    mov   esi, offset a
    mov   edi, offset i
    mov   edx, 0
@@:
    movzx eax, byte ptr [esi + ecx]
    mov   eax, [edi + eax*4]
    or    edx, eax
    jecxz @F
    dec   ecx
    jmp   @B


Given the hint you gave earlier, to get the incredible subsecond timing for the whole
routine of finding the k with a double pandigital (first and last 9 both pandigitals), I'm guessing
you were keeping track of the last 9 digits, doing a pandigital test on it, only if it passed
doing a direct calculation of the specific k without needing the preceding ones.
There are a few formulas to do a direct calculation but the ones I found are fairly involved
but you must have figured out a way of simplifying one of them or creating one of your own.

Well done Raymond, on the subsecond timing of your complete program.

Mark_Larson

Quote from: dsouza123 on August 22, 2008, 11:28:41 AM


    mov   ecx, 8
    mov   esi, offset a
    mov   edi, offset i
    mov   edx, 0
@@:
    movzx eax, byte ptr [esi + ecx]
    mov   eax, [edi + eax*4]
    or    edx, eax
    jecxz @F
    dec   ecx
    jmp   @B



  good idea :)
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

Here's the variant I used for testing:
.data
      numbits     dd    1,2,4,8,10h,20h,40h,80h,100h,200h

.code
      mov   edi,offset source
      xor   eax,eax
      xor   edx,edx
      mov   ecx,8
   @@:
      mov   dl,[edi+ecx]
      add   eax,numbits[edx*4]
      sub   cl,1
      jnc   @B
      cmp   eax,3FEh
      jnz   not_pandigital


The timing for my initial program dropped from 0.703 sec to 0.656 sec.
The timing for my improved program dropped from 0.250 sec to only 0.219 sec although the pandigitality test is performed the same number of times (over 300,000 times) on the same data.

BTW, the only memory I use for either program is strictly in the .data section and is less than 150 bytes.

dsouza123: Your guess is partly right but only for my improved algo.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

raymond,

  Why don't you try my MMX code and see if it works?  It's just missing the pand test.  I am curious to see how much it would speed up your code.  I also haven't tested it yet.  For your pand test look at doing both the tests at the same time.  It'll be faster since it'll break up stalls.


Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

dsouza123: Amazing!!!! :clap: :dance: :dazzled: :U

I ran some more tests to check why there was a difference in speed up between my two programs. If I modify them to bypass all pandigitality tests while performing all other computations, the reported timings seem to be within 1 ms using your test algo. Meaning that running your algo 300000 times may take less than 1 ms.

I still don't understand why my initial pandigitality algo has such a large variation (31 ms vs 47 ms) between the two programs for the same number of tests.

Mark: I don't know what the format of the output with your MMX code would be in order to test for pandigitality. I don't have rsi nor rdi registers either. Can your code work also with less than 150 bytes of memory???
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

Quote from: raymond on August 22, 2008, 08:29:12 PM
Mark: I don't know what the format of the output with your MMX code would be in order to test for pandigitality. I don't have rsi nor rdi registers either. Can your code work also with less than 150 bytes of memory???


unlike d's code, I don't use any memory at all.  Everything is kept in registers.  I do have some XMM constants that are in memory.  But that is well under 200 bytes.  The only thing in memory is the two arrays.  I call the arrays in my code 'src1' and 'src2'.  Simply replace your array pointers in there.  I can install MASM32 under linux if it'll make it easier for you?  And then convert the code.

I cut and pasted MY ENTIRE code.  So what you see above is all of it.

rsi = esi in 32-bit mode
rdi = edi in 32-bit mode

change all the 'r's to 'e' and it will work.

Yasm doesn't use the "offset" keyword, I marked both places that needed to be fixed up with comments.

EDIT:  I ran 300,000 loops like you did, and my code ran in 1.233360564 MICROseconds not milli.

You have a P4 right?  I can unroll the code more to run faster on your P4.  If you have a core 2 duo, tell me now, because the optimization tricks are completely different.  For instance when unrolling you want to keep the code size in your inner loops to 64 bytes.  You have a greater potential to execute more than one instruction at the same time.  So you have to watch out for those kinds of opportunities.



BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

QuoteI don't use any memory at all
The only thing in memory is the two arrays

Which one is correct?? If the latter, what is the size of those arrays??? What's the format of the output of the code???

Those questions are probably due to the fact that I still can't figure out if your complex code is strictly to test for pandigitality, or to compute the Fibonacci numbers, or both.

If it's only to test for pandigitality, it can't speed up my code any more than dsouza123's algo within the 1 ms limit I can measure. At the microsecond level, it could very well be faster than d's algo if it's the sole purpose of the code, but would not have a significant effect (less than 1%) on the overall speed of my entire program using BCDs to compute the Fibonacci numbers.

(My P4 is the oldest one (Model 1) 1500 MHz. During the winter, I have a CoreDuo 1.89GHz.)
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

#41
Quote from: raymond on August 23, 2008, 03:21:13 AM
QuoteI don't use any memory at all
The only thing in memory is the two arrays

Which one is correct?? If the latter, what is the size of those arrays??? What's the format of the output of the code???

Those questions are probably due to the fact that I still can't figure out if your complex code is strictly to test for pandigitality, or to compute the Fibonacci numbers, or both.

If it's only to test for pandigitality, it can't speed up my code any more than dsouza123's algo within the 1 ms limit I can measure. At the microsecond level, it could very well be faster than d's algo if it's the sole purpose of the code, but would not have a significant effect (less than 1%) on the overall speed of my entire program using BCDs to compute the Fibonacci numbers.

(My P4 is the oldest one (Model 1) 1500 MHz. During the winter, I have a CoreDuo 1.89GHz.)


I am cheating and allocating the arrays statically since YASM can do that fast.  The array size is completely settable.  You can put your dynamic allocation code in there, and change src1 ( F0) and src2 ( f1)  and use your pointers to dynamic memory instead.

this code is NOT the pandigital test.

the array is 16 byte aligned.  you can use the SSE2 mem_alloc that routine that lets you allocate memory aligned to a certain boundary.  I think it is
  crt_aligned_malloc
  crt_aligned_free

but it's been a while. 

Try changing your ADDing code to use my code.  Leave d's pandigital test in there.  Get that working, and then look at adding my pandigital test.  I can speed it up more, now that I know you have a P4.  We can unroll the loops more.

the format is the same as d's adding code.

I start adding at the BEGINNING of the array. and move up from there. 

I init the first byte of f0 and f1 to a '1' in both cases.  And then I start adding them together.

which parts of my code do you not understand?  I can explain in more detail if you need me to.

EDIT: here is the code for the array.  I got rid of 'src1' and 'src2' and changed it to 'f0' and 'f1' to make it clearer.  I also added ARR_SIZE_BYTES, and that also changes one line of my code.  Read my comments carefully.  I added a lot more than normal for you.

NOTE: in 64-bit mode, it is faster to use 32-bit registers ( in general).  But you can't do that for pointers and offsets.  So the only code you will have to modify is the pointers and offsets.  The RDI gets changed to EDI, and the RSI gets changed to ESI



%define ARR_SIZE_BYTES 300000
f0 resb ARR_SIZE_BYTES + 64 ; 64 byte padding for SSE2
f1 resb ARR_SIZE_BYTES + 64 ; 64 byte padding for SSE2




I PURPOSELY ALLOCATED 16 BYTES FOR EACH OF THE MMX CONSTANTS SO THAT I CAN SWITCH TO XMM easier.

section .data          ; say we are in the data segment

align 16
zero_out_bytes_10_16: dd 0FFFFFFFFh,0FFh,0,0
subtract_one: dd 11111111h,11111111h,11111111h,11111111h
subtract_four: dd 44444444h,44444444h,44444444h,44444444h
; we set all of the first 18 digits to 9
max_digits dq 999999999999999999,0
; we 'subtract' 1 followed by 18 0's
subtract dq 1000000000000000000,0
; used as a comparison
zero dq 0,0
; used to set the carry flag to 1
one dq 1,1

section .code        ; say we are in the code segment

   mov byte [f0], 1 ; f0 = 1
   mov byte [f1], 1 ; f1 = 1
   xor eax,eax ; num_digits = 0 ( we're 0 based for the #)
mov ecx,eax ; loop_cnt = num_digits = 0 ( we're 0 based for the #)
movq mm6,[max_digits] ; 18 9's
movq mm5,[subtract] ; subtract 1 followed by 18 0's
movq mm4,[one] ; set carry flag to 1
movq mm3,[zero] ; used as a comparison

outer_loop:
;yasm automatically uses OFFSET so using it will break the code
; mov esi, OFFSET f0
mov rsi,f0 ; esi = f0
; mov edi, OFFSET f1
mov rdi,f1 ; edi = f1
   pxor mm7,mm7 ; carry =  0

do_adds:
movq mm0, [rsi] ; f0
movq mm1, [rdi] ; f1
paddq mm0,mm7 ; add f0 + carry, 64-bit add :)
pxor mm7,mm7 ; carry = 0
paddq mm1,mm0 ; add f0 + carry + f1

movq mm0,mm1 ; save mm1, we will need it later
pcmpeqd mm0,mm6 ; f0 + f1 + c = 999999999999999999
pmovmskb edx,mm0 ; grab the highest bit of each byte and write it to EDX
test edx,edx ;
jne skip_carry ;

psubq mm1, mm5 ; subtract 1000000000000000000
movq mm7, mm4 ; carry = 1
movq [rdi], mm1 ; save the value to f1

skip_carry:
add rsi,8 ; we deal with 8 bytes at a time
add rdi,8
sub ecx,8 ; don't do the shr, just subtract by 4
jg do_adds ; since we don't do a SHR we need to change this to a JG instead of a JNE

movq mm0,mm7 ; save mm7 (carry), we will need it later
pcmpeqd mm0,mm3 ; carry = 0 ?
pmovmskb edx,mm0 ; grab the highest bit of each byte and write it to EDX
test edx,edx ;
jz skip ;
add eax,1 ; max += 1
add ecx,1 ; i += 1
skip:
cmp eax,ARR_SIZE_BYTES ; have we reached the maximum digit size we support?  the size goes from 0 to ARR_SIZE_BYTES - 1
jl outer_loop


Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson


raymond, there is a bug in my code that I am trying to track down.  So hold off for a  bit.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

codewarp

Quote from: cobold on August 16, 2008, 03:47:03 AM
Cheers!

There's an interesting website with hundreds of Math-Problems to solve:
http://projecteuler.net/index.php?section=problems&id=104

I solved a few of the easier ones - no problem. But AARRRGHH!!!! - problem 104, I know _exactly_ how to solve it, but how shall I calcalute the 2749th Fibonacci-number with 32 or 64 bit!
It doesn't work - the numbers are simply too big!

I'll post my work in progress for problem 104 (code is a mess - I just changed generation of fib-sequence from 32 to 64 bit, via fpu but aaah!
k = 47  F(k) = 2971215073 is biggest Fib with 32bit-arithmetic....
k = 98  F(k) = 11810432329172123648 is the biggest Fib that one can generate with 64bit!!
Will this end up in manually adding 256,512 or 8192bit-fields???
Ideas? Help? (No code needed but perhaps a hint how this could be solved without using BIGBITFIELDS?


OK, let's think outside the box...  Fibonocci calculation is just simple addition.  A String is a sequence of bytes that can be an unlimited integer if you use it that way.  All you need is to be able to ADD two strings together like big integers, corresponding bytes, propagating the carry on up.  The sum needs more space only when the top bit is set.  Using my own dynamic Strings, my calculation of fib(#2749) using this approach takes 580,000 clock cycles in C++, and results in a string result of 476 hexdigits starting with 100c38d67c9f4... and ending ...479ee63a70b0d9.  I am counting fib( ) 1, 2, 3, 4, 5, 6, as 1, 2, 3, 5, 8, 13, and so on.

Also, BCD representation does not add anything to this process, except the ability to read off the answer in decimal.  But you can pull 9 decimal digits at a time, from a 2000 bit integer that you divide by one billion (decimal) and take the remainder.  This is not hard.

Also, those that think Fibonocci is uninteresting and a waste of time should find out about the Golden Section and its vast applications in nature and engineering.

Mark_Larson

  you are correct, that is embarassing  :bg

what makes it worse is I know that.  I've used a lot of big number addition.

EDIT: as you can see from my post a few weeks ago on a topic of large number addition
   http://www.masm32.com/board/index.php?topic=9631.msg70539#msg70539
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm