Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?

Started by Eddy, October 19, 2005, 09:42:37 PM

Previous topic - Next topic

Eddy

Hi all,

For the purpose of fast large integer (more than 256 bits numbers) arithmetic calculations, I have been looking into the pentium SIMD extensions (MMX, SSE, SSE2, SSE3).

Much to my disappointment, the integer arithmetic commands have one large drawback: they do not signal an over- or underflow condition. In other words, they do not affect any of the flags, such as the Carry Flag.

Also, the shift commands do not preserve the bits that are shifted out !!

This seems to mostly destroy the benefits of the faster SIMD commands, operating on the larger datatypes.
Obviously, the SIMD extensions were not implemented with fast large integer arithmetic in mind.

As a testcase I revisited my 'faster loop' thread:
http://www.masmforum.com/simple/index.php?topic=1747.0


I compared the calculation speed of 2 loops:
- #1 using 'classical' pentium commands
- #2 using SIMD commands

This is the core code of both loops:
- #1 using 'classical' pentium commands

        Add1:
            mov  edx, [eax+esi]     'get LONG of h1
            adc  edx, [ebx+esi]     'add LONG of h2
            mov  [ecx+esi], edx     'store result in hr

            mov  edx, [eax+esi+4]   'get LONG of h1
            adc  edx, [ebx+esi+4]   'add LONG of h2
            mov  [ecx+esi+4], edx   'store result in hr

            ADD esi, 8
        jnz Add1


- #2 using SIMD commands

        Add1a:
            movd  mm0, [eax+esi]    'get LONG of h1
            movd  mm2, [ebx+esi]    'get LONG of h2
            paddq mm2, mm0
            paddq mm2, mm1
            movd [ecx+esi], mm2     'store result
            psrlq mm2, 32
           
            movd  mm0, [eax+esi+4]  'get LONG of h1
            movd  mm1, [ebx+esi+4]  'get LONG of h2
            paddq mm1, mm0
            paddq mm1, mm2
            movd [ecx+esi+4], mm1   'store result
            psrlq mm1, 32

            ADD esi, 8

        jnz Add1a


Perhaps I didn't optimise the code as much as I could, but nevertheless the second loop was significantly (1.5 times) slower than the first one, not to my surprise because of all the overhead to detect the overflow condition...

My guess is that the shift operations will not be much better.

So (finally) my question:
Does anybody have positive experiences implementing large integer arithmetic routines (like +, -, *, /, shift) using SIMD commands ?

Kind regards
Eddy
Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

Farabi

Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Eddy

The non SIMD code is faster. Mainly because it handles overflows (Carry flag) automatically.
With the SIMD code you have to add 'a lot' of commands to detect and handle overflows.

The Intel manual clearly states that it is the responsability of the programmer to make sure that overflows are detected and handled properly.
Causing a lot of overhead .. :(

Note also that although PADDQ does QUADWORD addition, we can only use it here for DOUBLEWORD addition because otherwise we could not detect the carry bit ...
A real shame that Intel did not add Carry Flag handling (over and underflow) for the SIMD commands. But they will have had their reasons probably...


Kind regards
Eddy

Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--

Farabi

If PADDQ instruction is using the Floating Point format, PADDQ instruction is hopeless.
I think I will use SSE2 only to initialize the math table.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Eddy

PADDQ does integer addition, not floating point!
For floating point arithmetic I am sure SIMD does a good job, although I haven't run any tests.

Kind regards
Eddy

Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--