News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster loop?

Started by Eddy, May 24, 2005, 08:32:14 PM

Previous topic - Next topic

Eddy

Speed gain by loop unrolling is somewhat less dramatic than I mentioned above.
I got the big difference in running time by leaving the 'INC ecx' in my initial routine... ::)

After using there also the 'ADD ecx, 1' the running time of the optimised routine is about 95% of that of the original routine.
Still worth using..

I'm looking into MMX and x87 FPU now. Totally new to me, so have to study first... :dazzled:

Kind regards
Eddy
Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

hutch--

Eddy,

Is there a way you can achieve the same functionality without using ADC (add with carry) ? If you can perform the action without the ADC and use ADD instead, it is likely to be faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv

hutch, he needs adc because of the fact that he is doing big int addition. But of course, he can replace it with mmx. like for instance


bnAdd2 proc bigintdes:dword,bigintsrc1,bigintsrc2
;bigintdes = bigint1 + bigint2
mov eax, [esp+4]
mov ecx, [esp+8]
mov edx, [esp+12]
movd mm0, [ecx]
movd mm1, [edx]
paddq mm1, mm0
movd [eax], mm1
psrlq mm1, 32
i = 4
REPT N/2 - 1
movd mm0, [ecx+i]
movd mm2, [edx+i]
paddq mm2, mm0
paddq mm2, mm1
movd [eax+i],mm2
psrlq mm2, 32
i = i + 4
movd mm0, [ecx+i]
movd mm1, [edx+i]
paddq mm1, mm0
paddq mm1, mm2
movd [eax+i], mm1
psrlq mm1, 32
i = i + 4
ENDM
movd mm0,[ecx+i]
movd mm2,[edx+i]
paddq mm2, mm0
paddq mm2, mm1
movq [eax+i], mm2
retn 12
bnAdd2 endp

Just modify it slightly so that you can make the length of teh big int be variable

Eddy

Hutch,

I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.

Addition of the first (Least significant) dword could be done by ADD, but this could already set the carry bit, so for the rest of the additions, the carry bit has to be tagged along.
Depending on the operand values, the carry can 'ripple' through all the dwords, up to the end. So, when using ADD, you would have to add ways to handle the carry.
ADD sets the CF, so after every ADD I could check if a CF is set.
If CF set, I could do:  ADD edx, 1   (where edx holds the next dword). I could try if that would gain something but I fear that the extra CF check destroys the benefit of using ADD over ADC.

This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...

Thanks for the idea!

Kind regards
Eddy



Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

Eddy

Roticv,

Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?

movd mm0, [ecx+i]    '<--get dword of operand 1
movd mm2, [edx+i]    '<--get dword of operand 2
paddq mm2, mm0      'Add 2 quadwords
paddq mm2, mm1      'Add carry bit (generated previously)
movd [eax+i],mm2    'Store addition result in mm2
psrlq mm2, 32           'Shift carry bit to bit 0 for next addition


Is there a way to do this with paddd instead of paddq ?

Kind regards
Eddy

Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

MazeGen

Quote from: Eddy on May 27, 2005, 02:51:48 PM
This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...

There is really interesting document: http://www.agner.org/assem/pentopt.pdf

Quote from: Eddy on May 27, 2005, 02:51:48 PM
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.

BTW, if you know how to measure the code speed, you can try the following modification. It may seem strange, but I remember that somebody said it is much quicker on P4. I don't have the proof, though.


    Add1:
    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! jnc over
    ! add  edx, 1 ' or inc edx
    over:
    ! add  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! inc  ecx
    ! jnz Add1

Eddy

Mazegen,

your code with the ADD instead of ADC does not give a shorter run time. Execution time is twice as long on my system (centrino).

Thanks for your suggestion though!

Kind regards
Eddy
Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

Posit

Another thing you could try is to first copy all of the first operand into the memory reserved for the result, then add the second operand to that memory. So the addition loop becomes



     ;ecx - contains length of 2nd operand
     ;esi - points to 2nd operand
     ;edi - points to result containing copy of 1st operand

     xor   eax,eax
AddDigit:
     mov  edx, [esi+eax*4]    ;get DWORD of p1
     adc  [edi+eax*4], edx    ;add to DWORD of p2 stored in pr
     add  eax,1
     loop AddDigit




Eddy

Posit,

Good idea!  But this way still takes about 250% of the original routine..  :(

Kind regards
Eddy
Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

roticv

Quote from: Eddy on May 27, 2005, 03:09:49 PM
Roticv,

Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?

movd mm0, [ecx+i]    '<--get dword of operand 1
movd mm2, [edx+i]    '<--get dword of operand 2
paddq mm2, mm0      'Add 2 quadwords
paddq mm2, mm1      'Add carry bit (generated previously)
movd [eax+i],mm2    'Store addition result in mm2
psrlq mm2, 32           'Shift carry bit to bit 0 for next addition


Is there a way to do this with paddd instead of paddq ?

Kind regards
Eddy


Yes you are correct, the carry bit is stored in mm2/mm1 depending on which part of the loop you are referring to.

PS: I think the code was from mark larson

Eddy

Nice list of pentium opcodes and clock cycles:

http://home.zcu.cz/~dudacek/SOJ/manualy/80x86set.pdf


Kind regards
Eddy
Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

roticv

I think it is outdated because we are no longer using pentium and it does not apply to PIV.

Eddy

I couldn't find a more recent one... :(

Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

Mark Jones

Thanks for the link Eddy.

If it were possible to time single instructions with Michael's timing macros, I'd code an app to time every instruction. Then we could see timings for every processor.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MazeGen