SSE2 replacement of ALU registers for 384 bit unsigned integers

Started by dsouza123, June 04, 2006, 05:09:23 AM

Previous topic - Next topic

roticv

adc is slow. Better to use mmx. I wonder how does this fare against the SSE2 routine.

bnAdd2 proc bigintdes:dword,bigintsrc1,bigintsrc2
;bigintdes = bigint1 + bigint2
N equ 12
mov eax, [esp+4]
mov ecx, [esp+8]
mov edx, [esp+12]
movd mm0, [ecx]
movd mm1, [edx]
paddq mm1, mm0
movd [eax], mm1
psrlq mm1, 32
i = 4
REPT N/2 - 1
movd mm0, [ecx+i]
movd mm2, [edx+i]
paddq mm2, mm0
paddq mm2, mm1
movd [eax+i],mm2
psrlq mm2, 32
i = i + 4
movd mm0, [ecx+i]
movd mm1, [edx+i]
paddq mm1, mm0
paddq mm1, mm2
movd [eax+i], mm1
psrlq mm1, 32
i = i + 4
ENDM
movd mm0,[ecx+i]
movd mm2,[edx+i]
paddq mm2, mm0
paddq mm2, mm1
movq [eax+i], mm2
retn 12
bnAdd2 endp

Mark_Larson

Quote from: roticv on July 08, 2006, 01:02:29 PM
adc is slow. Better to use mmx. I wonder how does this fare against the SSE2 routine.


  I actually did that eariler in the thread.

http://www.masm32.com/board/index.php?topic=4933.15


EDIT:

I modified your constant N to correctly reflect the correct size in bytes of the big integer.

here are some timings:

Quote
16 bytes
adc code         : 48
roticv MMX        : 69

512 bytes
adc code          : 1460
roticv MMX        : 3294
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dsouza123

Instead of tackling multiple 128 bit adds as a chunk, the attached code will do
a true 128 bit add with carry, getting and setting the carry flag at the start and finish,
using only SSE2 registers without conditional code or ALU register usage
presenting the possiblity of both using the ALU and SSE2 in parallel or interleaved.

Tried to reduce memory accesses and constants, keeping the values in SSE2 registers
and transforming to new values using pshufd.  More optimizations are possible.

Perhaps there is some way of combining some methods used in the multiple chunks code.

Instead of using the carry flag just placing a 0 or 1 in the low byte of C1 at the start would work
keeping the carry flag free for the ALU version.

The testbed code is equivalent to

     stc              ; set carry, just for testing

     mov eax, B0+ 0   ; ALU version
     mov ebx, B0+ 4
     mov ecx, B0+ 8
     mov edx, B0+12
     adc A0+ 0, eax
     adc A0+ 4, ebx
     adc A0+ 8, ecx
     adc A0+12. edx


Along with Mark's MMX code there are now three versions using different register sets.

The newly released CPUs, Core 2 architecture, have faster SSE2 performance
so SSE2 versions of routines should benefit.

[attachment deleted by admin]