Dear Math and FPU-Experts - help needed!

Mark_Larson · August 21, 2008, 08:03:48 PM

here is what the basic add loop looks like. I am doing two 64-bit adds at the same time using sse2. This is just the add part, I need to add the carry part next.

Code Select


   xor	eax,eax					; max = 0
	mov	ecx,eax 					; i = max

outer_loop:

;yasm automatically uses OFFSET so using it will break the code
; mov rsi, OFFSET src1
	mov	rsi,src1					; esi = src1 = f0
; mov rdi, OFFSET src2
	mov	rdi,src2					; edi = src2 = f1-
   pxor	xmm7,xmm7				; carry =  0

do_adds:
	movdqa	xmm0, [rsi]			; f0
	movdqa	xmm1, [rdi]			; f1
	paddq		xmm0,xmm7			; add f0 + carry, 64-bit add :)
	pxor		xmm7,xmm7			; carry = 0
	paddq		xmm1,xmm0			; add f0 + carry + f1

	add		rsi,16
	add		rdi,16
	sub		ecx,16				; don't do the shr, just subtract by 16
	jg			do_adds				; since we don't do a SHR we need to change this to a JG instead of a JNE

Mark_Larson · August 21, 2008, 09:43:31 PM

I decided to do MMX first. It has a 64-bit add ( paddq). There is no 128-bit add. You can do two 64-bit adds in parallel with SSE2, but then it gets more complex. So I wanted to get the basic stuff done, and then I'll tweak it from there. I just finished this. I don't know if it works or not. it starts with f0=1, and f1=1, does 10,000 loops at the highest priority. I only do the adding and carries, no pand. I stop when the number of digits = 8192. It takes me 89470.973000 cycles to get there. Or 33.635704135 nanoseconds. Still a lot faster. I'll test it tomorrow. And add the check in.

if you notice any bugs, please let me know :)

Code Select

; we set all of the first 18 digits to 9
	max_digits					dq		999999999999999999,0
; we 'subtract' 1 followed by 18 0's
	subtract						dq		1000000000000000000,0
; used as a comparison
	zero							dq		0,0
; used to set the carry flag to 1
	one							dq		1,1

   mov byte [src1], 1				; f0 = 1
   mov byte [src2], 1				; f1 = 1
   xor		eax,eax					; num_digits = 0 ( we're 0 based for the #)
	mov		ecx,eax 					; loop_cnt = num_digits = 0 ( we're 0 based for the #)
	movq		mm6,[max_digits]		; 18 9's
	movq		mm5,[subtract]			; subtract 1 followed by 18 0's
	movq		mm4,[one]				; set carry flag to 1
	movq		mm3,[zero]				; ; used as a comparison

outer_loop:
;yasm automatically uses OFFSET so using it will break the code
; mov rsi, OFFSET src1
	mov	rsi,src1					; esi = src1 = f0
; mov rdi, OFFSET src2
	mov	rdi,src2					; edi = src2 = f1-
   pxor	mm7,mm7					; carry =  0

do_adds:
	movq		mm0, [rsi]			; f0
	movq		mm1, [rdi]			; f1
	paddq		mm0,mm7				; add f0 + carry, 64-bit add :)
	pxor		mm7,mm7				; carry = 0
	paddq		mm1,mm0				; add f0 + carry + f1

	movq		mm0,mm1				; save mm1, we will need it later
	pcmpeqd	mm0,mm6				; 18 digits, all 9's
	pmovmskb	edx,mm0
	test		edx,edx
	jne		skip_carry
	
	psubq 	mm1, mm5				; subtract 1 followed by 18 0's
	movq		mm7, mm4				; carry = 1
	movq		[rdi], mm1

skip_carry:
	add		rsi,8
	add		rdi,8
	sub		ecx,8					; don't do the shr, just subtract by 4
	jg			do_adds				; since we don't do a SHR we need to change this to a JG instead of a JNE

	movq		mm0,mm7				; save mm7 (carry), we will need it later
	pcmpeqd	mm0,mm3				; carry = 0 ?
	pmovmskb	edx,mm0
	test		edx,edx				
	jz			skip
	add		eax,1					; num_digits += 1
	add		ecx,1					; i += 1	
skip:
	cmp		eax,8192                                 ; 8192 digits already done?
	jl			outer_loop

dsouza123 · August 22, 2008, 04:12:11 AM

An alternative pandigital test of bcd stored numbers.

Code Select


.686p                     ; create 32 bit code
.model flat, stdcall      ; 32 bit memory model
option casemap :none      ; case sensitive
.xmm

include    \masm32\include\windows.inc
include    \masm32\include\user32.inc
include    \masm32\include\kernel32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib

.data
    a db 1,2,3,4,5,6,7,8,9
    i dd 1,2,4,8,16,32,64,128,256,512 ; 1022 for pandigital
    
.code

start:
    mov   esi, offset a
    mov   edi, offset i
    mov   edx, 0
    mov   ecx, 9
@@:
    dec   ecx
    movzx eax, byte ptr [esi + ecx]
    mov   eax, [edi + eax*4]
    or    edx, eax
    jecxz @F
    jmp   @B
@@:
    mov   eax, 0        ; 0 is false
    xor   edx, 1022     ; if becomes 0, matches pandigital
    jnz   @F
    inc   eax           ; 1 is true
@@:
    invoke  ExitProcess, 0
end start

raymond · August 22, 2008, 05:01:53 AM

QuoteAn alternative pandigital test of bcd stored numbers.

Nice algo. I'll try it and see if it improves the overall timing.

dsouza123 · August 22, 2008, 11:28:41 AM

Thank you Raymond, a slight reordering of the instructions should reduce stalls due to read after write.

Code Select


    mov   ecx, 8
    mov   esi, offset a
    mov   edi, offset i
    mov   edx, 0
@@:
    movzx eax, byte ptr [esi + ecx]
    mov   eax, [edi + eax*4]
    or    edx, eax
    jecxz @F
    dec   ecx
    jmp   @B

Given the hint you gave earlier, to get the incredible subsecond timing for the whole
routine of finding the k with a double pandigital (first and last 9 both pandigitals), I'm guessing
you were keeping track of the last 9 digits, doing a pandigital test on it, only if it passed
doing a direct calculation of the specific k without needing the preceding ones.
There are a few formulas to do a direct calculation but the ones I found are fairly involved
but you must have figured out a way of simplifying one of them or creating one of your own.

Well done Raymond, on the subsecond timing of your complete program.

Mark_Larson · August 22, 2008, 02:14:07 PM

Quote from: dsouza123 on August 22, 2008, 11:28:41 AM

Code Select Expand
mov ecx, 8 mov esi, offset a mov edi, offset i mov edx, 0 @@: movzx eax, byte ptr [esi + ecx] mov eax, [edi + eax*4] or edx, eax jecxz @F dec ecx jmp @B

good idea :)

raymond · August 22, 2008, 03:21:18 PM

Here's the variant I used for testing:

Code Select

.data
      numbits     dd    1,2,4,8,10h,20h,40h,80h,100h,200h

.code
      mov   edi,offset source
      xor   eax,eax
      xor   edx,edx
      mov   ecx,8
   @@:
      mov   dl,[edi+ecx]
      add   eax,numbits[edx*4]
      sub   cl,1
      jnc   @B
      cmp   eax,3FEh
      jnz   not_pandigital

The timing for my initial program dropped from 0.703 sec to 0.656 sec.
The timing for my improved program dropped from 0.250 sec to only 0.219 sec although the pandigitality test is performed the same number of times (over 300,000 times) on the same data.

BTW, the only memory I use for either program is strictly in the .data section and is less than 150 bytes.

dsouza123: Your guess is partly right but only for my improved algo.

Mark_Larson · August 22, 2008, 07:40:01 PM

raymond,

Why don't you try my MMX code and see if it works? It's just missing the pand test. I am curious to see how much it would speed up your code. I also haven't tested it yet. For your pand test look at doing both the tests at the same time. It'll be faster since it'll break up stalls.

Mark

raymond · August 22, 2008, 08:29:12 PM

dsouza123: Amazing!!!! :clap: :dance: :dazzled: :U

I ran some more tests to check why there was a difference in speed up between my two programs. If I modify them to bypass all pandigitality tests while performing all other computations, the reported timings seem to be within 1 ms using your test algo. Meaning that running your algo 300000 times may take less than 1 ms.

I still don't understand why my initial pandigitality algo has such a large variation (31 ms vs 47 ms) between the two programs for the same number of tests.

Mark: I don't know what the format of the output with your MMX code would be in order to test for pandigitality. I don't have rsi nor rdi registers either. Can your code work also with less than 150 bytes of memory???

Mark_Larson · August 22, 2008, 09:07:07 PM

Quote from: raymond on August 22, 2008, 08:29:12 PM
Mark: I don't know what the format of the output with your MMX code would be in order to test for pandigitality. I don't have rsi nor rdi registers either. Can your code work also with less than 150 bytes of memory???

unlike d's code, I don't use any memory at all. Everything is kept in registers. I do have some XMM constants that are in memory. But that is well under 200 bytes. The only thing in memory is the two arrays. I call the arrays in my code 'src1' and 'src2'. Simply replace your array pointers in there. I can install MASM32 under linux if it'll make it easier for you? And then convert the code.

I cut and pasted MY ENTIRE code. So what you see above is all of it.

rsi = esi in 32-bit mode
rdi = edi in 32-bit mode

change all the 'r's to 'e' and it will work.

Yasm doesn't use the "offset" keyword, I marked both places that needed to be fixed up with comments.

EDIT: I ran 300,000 loops like you did, and my code ran in 1.233360564 MICROseconds not milli.

You have a P4 right? I can unroll the code more to run faster on your P4. If you have a core 2 duo, tell me now, because the optimization tricks are completely different. For instance when unrolling you want to keep the code size in your inner loops to 64 bytes. You have a greater potential to execute more than one instruction at the same time. So you have to watch out for those kinds of opportunities.

raymond · August 23, 2008, 03:21:13 AM

QuoteI don't use any memory at all
The only thing in memory is the two arrays

Which one is correct?? If the latter, what is the size of those arrays??? What's the format of the output of the code???

Those questions are probably due to the fact that I still can't figure out if your complex code is strictly to test for pandigitality, or to compute the Fibonacci numbers, or both.

If it's only to test for pandigitality, it can't speed up my code any more than dsouza123's algo within the 1 ms limit I can measure. At the microsecond level, it could very well be faster than d's algo if it's the sole purpose of the code, but would not have a significant effect (less than 1%) on the overall speed of my entire program using BCDs to compute the Fibonacci numbers.

(My P4 is the oldest one (Model 1) 1500 MHz. During the winter, I have a CoreDuo 1.89GHz.)

Mark_Larson · August 23, 2008, 11:05:09 AM

Quote from: raymond on August 23, 2008, 03:21:13 AM
QuoteI don't use any memory at all
The only thing in memory is the two arrays

Which one is correct?? If the latter, what is the size of those arrays??? What's the format of the output of the code???

Those questions are probably due to the fact that I still can't figure out if your complex code is strictly to test for pandigitality, or to compute the Fibonacci numbers, or both.

If it's only to test for pandigitality, it can't speed up my code any more than dsouza123's algo within the 1 ms limit I can measure. At the microsecond level, it could very well be faster than d's algo if it's the sole purpose of the code, but would not have a significant effect (less than 1%) on the overall speed of my entire program using BCDs to compute the Fibonacci numbers.

(My P4 is the oldest one (Model 1) 1500 MHz. During the winter, I have a CoreDuo 1.89GHz.)

I am cheating and allocating the arrays statically since YASM can do that fast. The array size is completely settable. You can put your dynamic allocation code in there, and change src1 ( F0) and src2 ( f1) and use your pointers to dynamic memory instead.

this code is NOT the pandigital test.

the array is 16 byte aligned. you can use the SSE2 mem_alloc that routine that lets you allocate memory aligned to a certain boundary. I think it is
crt_aligned_malloc
crt_aligned_free

but it's been a while.

Try changing your ADDing code to use my code. Leave d's pandigital test in there. Get that working, and then look at adding my pandigital test. I can speed it up more, now that I know you have a P4. We can unroll the loops more.

the format is the same as d's adding code.

I start adding at the BEGINNING of the array. and move up from there.

I init the first byte of f0 and f1 to a '1' in both cases. And then I start adding them together.

which parts of my code do you not understand? I can explain in more detail if you need me to.

EDIT: here is the code for the array. I got rid of 'src1' and 'src2' and changed it to 'f0' and 'f1' to make it clearer. I also added ARR_SIZE_BYTES, and that also changes one line of my code. Read my comments carefully. I added a lot more than normal for you.

NOTE: in 64-bit mode, it is faster to use 32-bit registers ( in general). But you can't do that for pointers and offsets. So the only code you will have to modify is the pointers and offsets. The RDI gets changed to EDI, and the RSI gets changed to ESI

Code Select


%define ARR_SIZE_BYTES					300000
		f0			resb	ARR_SIZE_BYTES + 64	; 64 byte padding for SSE2
		f1			resb	ARR_SIZE_BYTES + 64	; 64 byte padding for SSE2

Code Select


I PURPOSELY ALLOCATED 16 BYTES FOR EACH OF THE MMX CONSTANTS SO THAT I CAN SWITCH TO XMM easier.

section .data          ; say we are in the data segment

align 16
	zero_out_bytes_10_16:	dd		0FFFFFFFFh,0FFh,0,0
	subtract_one:				dd		11111111h,11111111h,11111111h,11111111h	
	subtract_four:				dd		44444444h,44444444h,44444444h,44444444h	
; we set all of the first 18 digits to 9
	max_digits					dq		999999999999999999,0
; we 'subtract' 1 followed by 18 0's
	subtract						dq		1000000000000000000,0
; used as a comparison
	zero							dq		0,0
; used to set the carry flag to 1
	one							dq		1,1

section .code        ; say we are in the code segment

   mov byte [f0], 1					; f0 = 1
   mov byte [f1], 1					; f1 = 1
   xor		eax,eax					; num_digits = 0 ( we're 0 based for the #)
	mov		ecx,eax 					; loop_cnt = num_digits = 0 ( we're 0 based for the #)
	movq		mm6,[max_digits]		; 18 9's
	movq		mm5,[subtract]			; subtract 1 followed by 18 0's
	movq		mm4,[one]				; set carry flag to 1
	movq		mm3,[zero]				; used as a comparison

outer_loop:
;yasm automatically uses OFFSET so using it will break the code
; mov esi, OFFSET f0
	mov	rsi,f0					; esi = f0
; mov edi, OFFSET f1
	mov	rdi,f1					; edi = f1
   pxor	mm7,mm7					; carry =  0

do_adds:
	movq		mm0, [rsi]			; f0
	movq		mm1, [rdi]			; f1
	paddq		mm0,mm7				; add f0 + carry, 64-bit add :)
	pxor		mm7,mm7				; carry = 0
	paddq		mm1,mm0				; add f0 + carry + f1

	movq		mm0,mm1				; save mm1, we will need it later
	pcmpeqd	mm0,mm6				; f0 + f1 + c = 999999999999999999
	pmovmskb	edx,mm0				; grab the highest bit of each byte and write it to EDX
	test		edx,edx				;
	jne		skip_carry			;
	
	psubq 	mm1, mm5				; subtract 1000000000000000000
	movq		mm7, mm4				; carry = 1
	movq		[rdi], mm1			; save the value to f1

skip_carry:
	add		rsi,8					; we deal with 8 bytes at a time
	add		rdi,8
	sub		ecx,8					; don't do the shr, just subtract by 4
	jg			do_adds				; since we don't do a SHR we need to change this to a JG instead of a JNE

	movq		mm0,mm7				; save mm7 (carry), we will need it later
	pcmpeqd	mm0,mm3				; carry = 0 ?
	pmovmskb	edx,mm0				; grab the highest bit of each byte and write it to EDX
	test		edx,edx				;
	jz			skip					;
	add		eax,1					; max += 1
	add		ecx,1					; i += 1	
skip:
	cmp		eax,ARR_SIZE_BYTES	; have we reached the maximum digit size we support?  the size goes from 0 to ARR_SIZE_BYTES - 1
	jl			outer_loop

Mark

Mark_Larson · August 23, 2008, 01:34:24 PM

raymond, there is a bug in my code that I am trying to track down. So hold off for a bit.

codewarp · August 23, 2008, 10:58:36 PM

Quote from: cobold on August 16, 2008, 03:47:03 AM
Cheers!

There's an interesting website with hundreds of Math-Problems to solve:
http://projecteuler.net/index.php?section=problems&id=104

I solved a few of the easier ones - no problem. But AARRRGHH!!!! - problem 104, I know _exactly_ how to solve it, but how shall I calcalute the 2749th Fibonacci-number with 32 or 64 bit!
It doesn't work - the numbers are simply too big!

I'll post my work in progress for problem 104 (code is a mess - I just changed generation of fib-sequence from 32 to 64 bit, via fpu but aaah!
k = 47 F(k) = 2971215073 is biggest Fib with 32bit-arithmetic....
k = 98 F(k) = 11810432329172123648 is the biggest Fib that one can generate with 64bit!!
Will this end up in manually adding 256,512 or 8192bit-fields???
Ideas? Help? (No code needed but perhaps a hint how this could be solved without using BIGBITFIELDS?

OK, let's think outside the box... Fibonocci calculation is just simple addition. A String is a sequence of bytes that can be an unlimited integer if you use it that way. All you need is to be able to ADD two strings together like big integers, corresponding bytes, propagating the carry on up. The sum needs more space only when the top bit is set. Using my own dynamic Strings, my calculation of fib(#2749) using this approach takes 580,000 clock cycles in C++, and results in a string result of 476 hexdigits starting with 100c38d67c9f4... and ending ...479ee63a70b0d9. I am counting fib( ) 1, 2, 3, 4, 5, 6, as 1, 2, 3, 5, 8, 13, and so on.

Also, BCD representation does not add anything to this process, except the ability to read off the answer in decimal. But you can pull 9 decimal digits at a time, from a 2000 bit integer that you divide by one billion (decimal) and take the remainder. This is not hard.

Also, those that think Fibonocci is uninteresting and a waste of time should find out about the Golden Section and its vast applications in nature and engineering.

Mark_Larson · August 23, 2008, 11:13:38 PM

you are correct, that is embarassing :bg

what makes it worse is I know that. I've used a lot of big number addition.

EDIT: as you can see from my post a few weeks ago on a topic of large number addition
http://www.masm32.com/board/index.php?topic=9631.msg70539#msg70539

News:

Dear Math and FPU-Experts - help needed!

dsouza123

dsouza123

codewarp