Faster loop?

Eddy · May 27, 2005, 12:16:32 PM

Speed gain by loop unrolling is somewhat less dramatic than I mentioned above.
I got the big difference in running time by leaving the 'INC ecx' in my initial routine... ::)

After using there also the 'ADD ecx, 1' the running time of the optimised routine is about 95% of that of the original routine.
Still worth using..

I'm looking into MMX and x87 FPU now. Totally new to me, so have to study first... :dazzled:

Kind regards
Eddy

hutch-- · May 27, 2005, 12:52:59 PM

Eddy,

Is there a way you can achieve the same functionality without using ADC (add with carry) ? If you can perform the action without the ADC and use ADD instead, it is likely to be faster.

roticv · May 27, 2005, 02:45:35 PM

hutch, he needs adc because of the fact that he is doing big int addition. But of course, he can replace it with mmx. like for instance

Code Select


bnAdd2 proc bigintdes:dword,bigintsrc1,bigintsrc2
;bigintdes = bigint1 + bigint2	
		mov		eax, [esp+4]
		mov		ecx, [esp+8]
		mov		edx, [esp+12]
		movd	mm0, [ecx]
		movd	mm1, [edx]
		paddq	mm1, mm0
		movd	[eax], mm1
		psrlq	mm1, 32
	i = 4
	REPT N/2 - 1
		movd	mm0, [ecx+i]
		movd	mm2, [edx+i]
		paddq	mm2, mm0
		paddq	mm2, mm1
		movd	[eax+i],mm2
		psrlq	mm2, 32
		i = i + 4
		movd	mm0, [ecx+i]
		movd	mm1, [edx+i]
		paddq	mm1, mm0
		paddq	mm1, mm2
		movd	[eax+i], mm1
		psrlq	mm1, 32
		i = i + 4
	ENDM
		movd	mm0,[ecx+i]
		movd	mm2,[edx+i]
		paddq	mm2, mm0
		paddq	mm2, mm1
		movq	[eax+i], mm2
		retn	12
bnAdd2 endp

Just modify it slightly so that you can make the length of teh big int be variable

Eddy · May 27, 2005, 02:51:48 PM

Hutch,

I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.

Addition of the first (Least significant) dword could be done by ADD, but this could already set the carry bit, so for the rest of the additions, the carry bit has to be tagged along.
Depending on the operand values, the carry can 'ripple' through all the dwords, up to the end. So, when using ADD, you would have to add ways to handle the carry.
ADD sets the CF, so after every ADD I could check if a CF is set.
If CF set, I could do: ADD edx, 1 (where edx holds the next dword). I could try if that would gain something but I fear that the extra CF check destroys the benefit of using ADD over ADC.

This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...

Thanks for the idea!

Kind regards
Eddy

Eddy · May 27, 2005, 03:09:49 PM

Roticv,

Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?

Code Select


movd mm0, [ecx+i]    '<--get dword of operand 1
movd mm2, [edx+i]    '<--get dword of operand 2
paddq mm2, mm0      'Add 2 quadwords
paddq mm2, mm1      'Add carry bit (generated previously)
movd [eax+i],mm2    'Store addition result in mm2 
psrlq mm2, 32           'Shift carry bit to bit 0 for next addition

Is there a way to do this with paddd instead of paddq ?

Kind regards
Eddy

MazeGen · May 27, 2005, 03:12:06 PM

Quote from: Eddy on May 27, 2005, 02:51:48 PM
This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...

There is really interesting document: http://www.agner.org/assem/pentopt.pdf

Quote from: Eddy on May 27, 2005, 02:51:48 PM
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.

BTW, if you know how to measure the code speed, you can try the following modification. It may seem strange, but I remember that somebody said it is much quicker on P4. I don't have the proof, though.

Code Select


    Add1:
    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! jnc over
    ! add  edx, 1 ' or inc edx
    over:
    ! add  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! inc  ecx
    ! jnz Add1

Eddy · May 27, 2005, 09:42:59 PM

Mazegen,

your code with the ADD instead of ADC does not give a shorter run time. Execution time is twice as long on my system (centrino).

Thanks for your suggestion though!

Kind regards
Eddy

Posit · May 28, 2005, 12:59:05 AM

Another thing you could try is to first copy all of the first operand into the memory reserved for the result, then add the second operand to that memory. So the addition loop becomes

Code Select



     ;ecx - contains length of 2nd operand
     ;esi - points to 2nd operand
     ;edi - points to result containing copy of 1st operand

     xor   eax,eax
AddDigit:
     mov  edx, [esi+eax*4]    ;get DWORD of p1
     adc  [edi+eax*4], edx    ;add to DWORD of p2 stored in pr
     add  eax,1
     loop AddDigit

Eddy · May 29, 2005, 10:01:29 PM

Posit,

Good idea! But this way still takes about 250% of the original routine.. :(

Kind regards
Eddy

roticv · May 30, 2005, 09:48:04 AM

Quote from: Eddy on May 27, 2005, 03:09:49 PM
Roticv,

Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?
Code Select Expand
movd mm0, [ecx+i] '<--get dword of operand 1 movd mm2, [edx+i] '<--get dword of operand 2 paddq mm2, mm0 'Add 2 quadwords paddq mm2, mm1 'Add carry bit (generated previously) movd [eax+i],mm2 'Store addition result in mm2 psrlq mm2, 32 'Shift carry bit to bit 0 for next addition

Is there a way to do this with paddd instead of paddq ?

Kind regards
Eddy

Yes you are correct, the carry bit is stored in mm2/mm1 depending on which part of the loop you are referring to.

PS: I think the code was from mark larson

Eddy · May 30, 2005, 02:34:09 PM

Nice list of pentium opcodes and clock cycles:

http://home.zcu.cz/~dudacek/SOJ/manualy/80x86set.pdf

Kind regards
Eddy

roticv · May 30, 2005, 04:27:16 PM

I think it is outdated because we are no longer using pentium and it does not apply to PIV.

Eddy · May 30, 2005, 04:48:48 PM

I couldn't find a more recent one... :(

Mark Jones · May 30, 2005, 05:01:17 PM

Thanks for the link Eddy.

If it were possible to time single instructions with Michael's timing macros, I'd code an app to time every instruction. Then we could see timings for every processor.

MazeGen · May 30, 2005, 05:02:18 PM

Quote from: Eddy on May 30, 2005, 04:48:48 PM
I couldn't find a more recent one... :(

Well known, updated one is here: http://www.agner.org/assem/pentopt.pdf

News:

Faster loop?

roticv

MazeGen

Posit

roticv

roticv

MazeGen