SSE2 replacement of ALU registers for 384 bit unsigned integers

Started by dsouza123, June 04, 2006, 05:09:23 AM

Previous topic - Next topic

Ossa

Quote from: asmfan on June 14, 2006, 08:24:44 AM
Of course it's the matter of little endian:)

Good point asmfan, I got the orders mixed up (must have been falling asleep). That was exactly what I was trying to fix... but I seem to have got a bit mixed up on the byte ordering and "unfixed" it.

Ossa
Website (very old): ossa.the-wot.co.uk

asmfan

But now we have a addition and subtraction algorithms on BIG numbers:)
Russia is a weird place

dsouza123

Working code, hopefully, I don't have a CPU that has SSE2 to test it,
for a pure SSE2 add of 512 bit unsigned integers,
increased it from 384 to 512 for symmetry/efficiency.

The code is a testbed, it works by generating combinations of numbers
adding them using the ALU with pairs of 4 mov/ 4 adc instructions then
adding them using SSE2,
after it compares the results and adds to a mismatch count if they aren't equal.
It displays a messagebox with the mismatch count when done.

The numbers are qwords with the top two bits  63:62 varying from 11 to 00
and the low bit 0 always 1, so a 128 bit SSE2 register 127:0 could be  11...1,11...1
or down to 00...1,00...1

Ideally it would use 16 bits, two for each of the high bits of all 16 qwords but
it might take too long, 2^32 combinations 65536 * 65536
so instead I used 2^20 combinations 1024 * 1024,
for the values of i and j in the nested repeat until loops.

Before the SSE2 code was added, using only ALU with 65536 for both loops
it took 11.5 minutes (1.2 Ghz) and was very slow to respond. 
Dropped them to 1024 and added multiple invoke sleep, 0
to cause it to yield the CPU more often.

Hopefully someone with an appropriate CPU with SSE2 could test it.

I am sure there better ways of doing a code harness, without sticking
a number of invoke sleep, 0 to remove the impact on responsiveness,
such as setting the code in a worker thread, with the main thread handling
system communication or some other effective methods.

An overview of the SSE2 add:

Computes logical operations then stores them.
(ie AND, OR, XOR of the two 512 bit unsigned integers 256 bits at a time.)
ADDs (without qword carries) then stores, the original two 512 bit numbers.
Performs combinations of logical operations with the four previous operations.
Creates carries for both 512 bits.
Shifts the carries 1 bit left.
Adds the carries to the ADDs.

The result should be same as the ALU version.

[attachment deleted by admin]

Ossa

Website (very old): ossa.the-wot.co.uk

Mark_Larson


  I also get CNT: 0.  It takes roughly 7 seconds to run.  Did you also try to see how fast the new one I wrote that supports arbitrary sizes?

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dsouza123

Ossa and Mark,
Thanks for testing it.

Mark
Yes, I saw your optimized code, the alloc/16 threw me for a loop at first,
I was thinking bits and you were dealing with bytes,
now that I reread it it makes perfect sense.
Also the use of ebp, I thought you were doing some trick with the
stack, but now I see it was just a way to hold the counter in a register,
the other 6 normal ones were all in use.

When is it safe to use ebp ?

Your code is very flexible, small and quick versus
the fully unrolled hard coded offset version that only is quick.
I just copied and pasted then extended the working code I already had.

My biggest concern was primarily correctness of the SSE2 code
with a secondary issue of performance by minimizing moves from/to memory
and holding values in SSE2 registers.

So if your code was dropped in place of the hardcoded ALU version it would
be the code below. 

  mov  esi,[B0]
  mov  edi,[A0]

  push ebp
  mov  ebp, 4   ; 64 bytes/16 bytes per oword == 4 owords == 512 bits
  clc
looper:
  mov eax, [esi +  0]
  mov ebx, [esi +  4]
  mov ecx, [esi +  8]
  mov edx, [esi + 12]

  adc [edi +  0], eax
  adc [edi +  4], ebx
  adc [edi +  8], ecx
  adc [edi + 12], edx

  lea esi,[esi+16]
  lea edi,[edi+16]
   
  dec ebp
  jnz looper

  pop ebp

savage

Quote from: dsouza123 on June 16, 2006, 04:27:36 PM
When is it safe to use ebp ?

My answer would be whenever you are not using local variables or function parameters.

Ian_B

Quote from: savage on June 21, 2006, 02:50:38 PM
My answer would be whenever you are not using local variables or function parameters.

You mean when you are not referencing the function parameters by their friendly variable names... You can still use function parameters when you aren't using stack frames and EBP, but you will need to reference them directly by [ESP+offset] and keep track yourself of where on the stack they move to if you are PUSHing and POPping other values. It's not hard, just fiddly, but worth it if you need the extra register badly or you don't need to access the params much.

Ian_B

dsouza123

In reviewing the SSE2 instruction set, there is a mystery I haven't resolved.

There seems to be two parallel sets of instructions that have duplicate function.

ANDPD
ANDNPD
XORPD
ORPD

PAND
PANDN
PXOR
POR

Is there a performance advantage to one set over the other ?
Are they truely the same, 128 bit operations,
or are there some subtle functional differences between them ?

Mark_Larson

Quote from: dsouza123 on June 25, 2006, 11:18:34 PM
In reviewing the SSE2 instruction set, there is a mystery I haven't resolved.

There seems to be two parallel sets of instructions that have duplicate function.

ANDPD
ANDNPD
XORPD
ORPD

PAND
PANDN
PXOR
POR

Is there a performance advantage to one set over the other ?
Are they truely the same, 128 bit operations,
or are there some subtle functional differences between them ?

  They are the same instructions.  There is no subtle differences or performance advantages.  There are even more duplicated instructions.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dsouza123

Thanks Mark

After success with SSE2 for addition, the next is SSE2 for subtraction.

Here are the truth tables used for addition and subtraction.
The addition tables should be correct because the addition code works.
The subtraction tables hopefully are correct but the subtraction code doesn't
work correctly half the time.  It may be errors in the code and not with the tables.
Or fairly unlikely the emulated system it was tested with.

Without a CPU capable of SSE2, I have tried a different solution.
The testing was done on BOCHS with an emulated AMD64 CPU which has SSE2.

The standard BOCHS is a Pentium III but it doesn't have SSE2, someone compiled
the latest 2.2.6 version with an Athlon64 as the emulated CPU.

The addition code that Mark and Ossa tested on real CPUs,
also returned a count of 0 mismatches under BOCHS in 32-bit Windows.
Giving me confidence in the BOCHS emulation.


  0  0  0  0  1  1  1  1  C Carry
  0  0  1  1  0  0  1  1  A0
  0  1  0  1  0  1  0  1  B0
-- -- -- -- -- -- -- --
00 01 01 10 01 10 10 11  D aDd
          ==    == == ==

  0  0  0  1  0  0  0  1  A And   <----
  0  1  0  0  0  1  0  0  N andN
  0  1  1  1  0  1  1  1  O Or    <----
  0  1  1  0  0  1  1  0  X Xor   <----
  0  1  1  0  1  0  0  1  D aDd   <----

  0  0  0  0  1  1  1  1  xXD
  0  0  0  0  0  1  1  1  aOxXD
  0  0  0  1  0  1  1  1  oAaOxXD
==========++====++=++=++==============

  0  0  0  0  1  1  1  1  B Borrow
  0  0  1  1  0  0  1  1  A0
  0  1  0  1  0  1  0  1  B0
-- -- -- -- -- -- -- --
00 11 01 00 11 10 00 11  U sUb
    ==       == ==    ==

  0  0  0  1  0  0  0  1  A And
  0  1  0  0  0  1  0  0  N andN  <----
  0  1  1  1  0  1  1  1  O Or
  0  1  1  0  0  1  1  0  X Xor   <----
  0  1  1  0  1  0  0  1  U sUb   <----

  0  0  0  0  1  1  1  1  xXU
  0  0  0  0  1  0  0  1  aUxXU     aOxXU
  0  1  0  0  1  1  0  1  oNaUxXU   aOxXU
====++=======++=++====++==============


Keys for the tables.

lower case for the operation:
x for xor, a for and, o for or

upper case for the result of operations on the original values:
A for And, O for Or, N for andN, X for Xor, U for sUb, D for aDd


The current code and executable that give a WRONG answer half the time is attached. 
I believe the comparison ALU subtract code is correct, the subtraction tables probably OK,
the SSE2 code likely has the error(s).

[attachment deleted by admin]

dsouza123

Error detected !

Carry propagation error, with the full SSE2 addition.

In 1 out of 2^64 possible conditions the carry isn't propagated.
I have found the SSE2 add will fail to produce a carry for one condition.

If the result of two qwords added together, without the incoming carry,
is all 1s, ie FFFF FFFF FFFF FFFF, then later after adding an incoming 1 carry
no outgoing 1 carry is produced.
The incorrect outgoing carry produced by this special condition is 0.

  Correct: incoming carry 0, produces an outgoing carry 0.
Incorrect: incoming carry 1, produces an outgoing carry 0.

The current qword will be correct.

When the incoming 1 carry is added, the qword result will correctly become 0.
The NEXT higher qword wont get an incoming 1 carry though.

An example using bytes instead of qwords.
0,1111 1111,1111 1111  A0
0,0000 0000,1111 1111  B0
---------------------
0,1111 1111,1111 1110  ADD without carries
0,1000 0000,0000 0000  XOR  (Only showing bits 15, 7)
0,0000 0000,1000 0000  xXD
0,1000 0000,1000 0000  OR
0,0000 0000,1000 0000  aOxXD
0,0000 0000,1000 0000  AND
0,0000 0000,1000 0000  oAaOxXD final step to calc carries
0,0000 0001,0000 0000  shl 1  put in the correct spots
0,1111 1111,1111 1110  ADD without carries ( calc'd earlier )
0,0000 0000,1111 1110  ADD with carries, result is INCORRECT

1,0000 0000,1111 1110  Correct result with incoming and outgoing carries calc'd

No pure SSE2 fix yet.

dsouza123

No pure SSE2 solution for ADD,
but the working version that uses the ALU to prevent the carry propagation error.
It works correctly with all the values I've tried.

Also tried to emulate SSE2 instructions using ALU instructions ( add adc add adc no carries between qwords),
unfortunately there is/are some bug(s) and it doesn't match either SSE2 or conventional ALU results.

I'll continue working on pure SSE2 multi dword addition until it works 100 percent. (No carry propagation error).

Any ideas/code to fix the carry propagation error, and/or the emulated SSE2 code are welcome.



[attachment deleted by admin]

EduardoS

First a sugestion, to test the logic use MMX instructions instead of SSE replacing qword for dwords, the logic should be the same,

Now about the carry propagation, i have an idea, let's say i have registers with n words of 2n bits and want sum two of them with carry propagation:
1) Sum both registers put the carry of each in the odds bits of a new registers (C size 2n), ex: the carry of first word go to bit 1, the carry of second word to bit 3, etc;

2) put 1 in the even bits of C where the corresponding words are all 1s (ie: byte=255, word=65535, etc), ex: if first word is all 1s put 1 in bit 0, etc;

3) add to C 101010... (a sequnce of "2" in base 4), not C and shift left C by 1;

4) add the even bits in C to the big registers with the result of first instruction (ignore the carry now).


Example:
(in hex, C in binary, n = 4)
1)
(67, C6, 89, A5) + (45, D7, 76, 86) = (AC, 9D, FF, 2B) -- The result withou carry propagation
C = (00100010) -- Carries

2)
C = (00100110) -- Carries and "all 1s"

3)
C = (11010000) -- C + 10101010
C = (00101111) -- Not C
C = (01011110) -- C << 1

4)
(01, 01, 01, 00) -- The value to add
(AD, 9E, 00, 2B) --  The final result, sum of two registers with carry propagation (two 2nn bits number).



I will try to post here the SSE implementention of the above, as soon my time permit...

dsouza123

Important, to run the attached program requires a CPU with SSE2 instruction support.

This is a testbed program that does both a pure SSE2 carry propagation fix and an ALU version of the fix.
The two versions should produce identical results but the SSE2 only works part way, the ALU works correctly.

After both have attempted to propagate the carry,
16 MessageBoxes are displayed  for dwords 15 downto 0
each shows the carry for SSE2, ALU; 16 (the number of dwords); the index of the current dword.

Finally a MessageBox with the Mismatch count; Total of SSE2 carries added; Total of ALU carries added.

It is nearly working, the carry for the SSE2 propagates a while before stopping.
The ALU version works correctly.

The algorithm takes a qword, if it is all bits set to 1,  FFFF FFFF FFFF FFFF, -1
then it's paired qword carry is ORed with the next higher qword carry, thus propagating the carry.

It doesn't matter if the current paired carry is 1 or 0 since it is ORed it will max out at 1
and if the paired carry is 0 it wont propagate the carry, but does no harm.

How the SSE2 version works, there doesn't seem to be a workable qword comparison SSE2 instruction
just FP Double which has issues with the evaluation of some bit patterns into NAN etc, so dwords are used.

The dwords have a problem, they aren't qwords, and can differ in evaluation, so after the dwords are compared
they are shuffled then ANDed, if they are both FFFF FFFF they remain FFFF FFFF else become 0.

The pcmpeqd instruction for SSE2 compairs paired dwords, if they are the same,
the destination's dword for each pair is set to FFFF FFFF else to 0000 0000.

With the dwords in xmm? indexed 3, 2, 1, 0 the dword shuffle instruction pshufd
with the byte size order operand == 0b1h, shuffles them to  2, 3, 0, 1

The AND of the two xmm registers will produce either all 1s or all 0s for each qword.

The carry qword is ANDed with the value qword and shifted then ORed with the next higher carry qword.
The instructions psrldq and pslldq shift an oword a multiple of eight bits at a time either right or left, 
like  shr eax, 8  or shl eax, 16 do for dwords.


The testbed program, source and exe are in the attached zip file.

Here is essence of the code, maybe someone can spot the subtle error.


.data
   A0  dd 16 dup (-1)  ; x0 is for values
   C0  dd 16 dup (-1)
   A1  dd 16 dup (0)   ; x1 is for carries
   C1  dd 16 dup (0)

   YY  dd -1,-1,-1,-1  ; all 1s  11111111111

.code
    mov A0, 0    ; result of lowest qword if it was all 1s and 1 added to it
    mov A0+4, 0
    mov C0, 0
    mov C0+4, 0
    mov [A1+8], 1  ; the carry was put in the next higher carry qword
    mov [C1+8], 1

    movdqa  xmm0, oword ptr [A1 +  0]
    movdqa  xmm1, oword ptr [A1 + 16]
    movdqa  xmm2, oword ptr [A1 + 32]
    movdqa  xmm3, oword ptr [A1 + 48]

    movdqa  xmm7, oword ptr [A0 +  0]
    pcmpeqd xmm7, oword ptr [YY +  0]  ; compare with all 1111  if == 1111, != 0000  per dword
    pshufd  xmm6, xmm7, 0b1h           ; swap dwords  3210 --> 2301
    andpd   xmm6, xmm7                 ; AND 32, 23  AND 10, 01   if same == 1111, diff 0000
    andpd   xmm6, xmm0                 ; C1 + 0
    psrldq  xmm6, 8
    orpd    xmm1, xmm6                 ; oword ptr [A1 + 16]  1a

    movdqa  xmm7, oword ptr [A0 + 16]
    pcmpeqd xmm7, oword ptr [YY +  0]  ; compare with all 1111  if == 1111, != 0000  per dword
    pshufd  xmm6, xmm7, 0b1h           ; swap dwords  3210 --> 2301
    andpd   xmm6, xmm7                 ; AND 32, 23  AND 10, 01   if same == 1111, diff 0000
    andpd   xmm6, xmm1                 ; C1 + 16
    movdqa  xmm5, xmm6
    pslldq  xmm5, 8
    orpd    xmm1, xmm5                 ; oword ptr [A1 + 16]  1b
    andpd   xmm6, xmm1
    psrldq  xmm6, 8
    orpd    xmm2, xmm6                 ; oword ptr [A1 + 32]  2a

    movdqa  xmm7, oword ptr [A0 + 32]
    pcmpeqd xmm7, oword ptr [YY +  0]  ; compare with all 1111  if == 1111, != 0000  per dword
    pshufd  xmm6, xmm7, 0b1h            ; swap dwords  3210 --> 2301
    andpd   xmm6, xmm7                 ; AND 32, 23  AND 10, 01   if same == 1111, diff 0000
    andpd   xmm6, xmm2                 ; C1 + 32
    movdqa  xmm5, xmm6
    pslldq  xmm5, 8
    orpd    xmm2, xmm5                 ; oword ptr [A1 + 32]  2b
    andpd   xmm6, xmm2
    psrldq  xmm6, 8
    orpd    xmm3, xmm6                 ; oword ptr [A1 + 48]  3a

    movdqa  xmm7, oword ptr [A0 + 48]
    pcmpeqd xmm7, oword ptr [YY +  0]  ; compare with all 1111  if == 1111, != 0000  per dword
    pshufd  xmm6, xmm7, 0b1h           ; swap dwords  3210 --> 2301
    andpd   xmm6, xmm7                 ; AND 32, 23  AND 10, 01   if same == 1111, diff 0000
    andpd   xmm6, xmm3                 ; C1 + 48
    pslldq  xmm6, 8
    orpd    xmm3, xmm6                 ; oword ptr [A1 + 48]  3b

    movdqa  oword ptr [A1 +  0], xmm0
    movdqa  oword ptr [A1 + 16], xmm1
    movdqa  oword ptr [A1 + 32], xmm2
    movdqa  oword ptr [A1 + 48], xmm3
    .endif

    nop
;==============================
    nop

    mov ecx, 8                                           ; if qword == -1  whatever is in C1+0 is ORed with C1+8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)     ;
      mov eax, [C1 + ecx+0]                              ; 0 or 1 doesn't matter
      or  [C1 + ecx+8], eax                              ; and [C1 + ecx+8],1  not needed if C1 is only 1 bit
    .endif
    nop
    add ecx, 8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
      mov eax, [C1 + ecx+0]
      or  [C1 + ecx+8], eax
    .endif
    add ecx, 8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
      mov eax, [C1 + ecx+0]
      or  [C1 + ecx+8], eax
    .endif
    add ecx, 8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
      mov eax, [C1 + ecx+0]
      or  [C1 + ecx+8], eax
    .endif
    add ecx, 8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
      mov eax, [C1 + ecx+0]
      or  [C1 + ecx+8], eax
    .endif
    add ecx, 8
    .if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
      mov eax, [C1 + ecx+0]
      or  [C1 + ecx+8], eax
    .endif
    .endif

[attachment deleted by admin]