The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Eddy on October 19, 2005, 09:42:37 PM

Title: Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?
Post by: Eddy on October 19, 2005, 09:42:37 PM
Hi all,

For the purpose of fast large integer (more than 256 bits numbers) arithmetic calculations, I have been looking into the pentium SIMD extensions (MMX, SSE, SSE2, SSE3).

Much to my disappointment, the integer arithmetic commands have one large drawback: they do not signal an over- or underflow condition. In other words, they do not affect any of the flags, such as the Carry Flag.

Also, the shift commands do not preserve the bits that are shifted out !!

This seems to mostly destroy the benefits of the faster SIMD commands, operating on the larger datatypes.
Obviously, the SIMD extensions were not implemented with fast large integer arithmetic in mind.

As a testcase I revisited my 'faster loop' thread:
http://www.masmforum.com/simple/index.php?topic=1747.0


I compared the calculation speed of 2 loops:
- #1 using 'classical' pentium commands
- #2 using SIMD commands

This is the core code of both loops:
- #1 using 'classical' pentium commands

        Add1:
            mov  edx, [eax+esi]     'get LONG of h1
            adc  edx, [ebx+esi]     'add LONG of h2
            mov  [ecx+esi], edx     'store result in hr

            mov  edx, [eax+esi+4]   'get LONG of h1
            adc  edx, [ebx+esi+4]   'add LONG of h2
            mov  [ecx+esi+4], edx   'store result in hr

            ADD esi, 8
        jnz Add1


- #2 using SIMD commands

        Add1a:
            movd  mm0, [eax+esi]    'get LONG of h1
            movd  mm2, [ebx+esi]    'get LONG of h2
            paddq mm2, mm0
            paddq mm2, mm1
            movd [ecx+esi], mm2     'store result
            psrlq mm2, 32
           
            movd  mm0, [eax+esi+4]  'get LONG of h1
            movd  mm1, [ebx+esi+4]  'get LONG of h2
            paddq mm1, mm0
            paddq mm1, mm2
            movd [ecx+esi+4], mm1   'store result
            psrlq mm1, 32

            ADD esi, 8

        jnz Add1a


Perhaps I didn't optimise the code as much as I could, but nevertheless the second loop was significantly (1.5 times) slower than the first one, not to my surprise because of all the overhead to detect the overflow condition...

My guess is that the shift operations will not be much better.

So (finally) my question:
Does anybody have positive experiences implementing large integer arithmetic routines (like +, -, *, /, shift) using SIMD commands ?

Kind regards
Eddy
Title: Re: Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?
Post by: Farabi on October 22, 2005, 08:23:33 AM
Which is fastest? SIMD or non SIMD function?
Title: Re: Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?
Post by: Eddy on October 22, 2005, 08:38:15 AM
The non SIMD code is faster. Mainly because it handles overflows (Carry flag) automatically.
With the SIMD code you have to add 'a lot' of commands to detect and handle overflows.

The Intel manual clearly states that it is the responsability of the programmer to make sure that overflows are detected and handled properly.
Causing a lot of overhead .. :(

Note also that although PADDQ does QUADWORD addition, we can only use it here for DOUBLEWORD addition because otherwise we could not detect the carry bit ...
A real shame that Intel did not add Carry Flag handling (over and underflow) for the SIMD commands. But they will have had their reasons probably...


Kind regards
Eddy

Title: Re: Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?
Post by: Farabi on October 22, 2005, 09:12:43 AM
If PADDQ instruction is using the Floating Point format, PADDQ instruction is hopeless.
I think I will use SSE2 only to initialize the math table.
Title: Re: Using SIMD (MMX,SSE, SSE2, SSE3) for large integer arithmetic ?
Post by: Eddy on October 22, 2005, 09:15:22 AM
PADDQ does integer addition, not floating point!
For floating point arithmetic I am sure SIMD does a good job, although I haven't run any tests.

Kind regards
Eddy