Doing unsigned integer math operations on arrays of 12 unsigned dwords (384 bits)
such as addition, subtraction, shifts (multiplying/dividing by powers of 2)
using ALU registers EAX, EBX etc is straight forward.
Doing the same using SSE2 instructions could in theory increase performance
because 128 bits, equivalent to 4 dwords, can be operated on instead of only 32 bit chunks.
The following is addition using ALU instructions, then an incorrect first attempt with SSE2.
Carries aren't propagated past 64 bit chunks with SSE2, so something has to be done
to compensate for it.
The logical functions of AND, OR, XOR, ANDNOT are available.
NOT can be simulated by two 64 subtracts from all bits set.
NOT equivalent with byte values 255 - x.
Bit functions can be simulated by the logical functions AND, OR.
.data
ALIGN 16
a dd 3239879239, 1309340349, 2349087349, 4020030040
dd 239879239, 2023987439, 1234567890, 23978239
dd 1245409999, 3340340309, 343498392, 3020030040
b dd 1340983534, 3295798324, 3843298739, 1897349923
dd 2398721309, 1347204809, 4085034850, 2409872134
dd 2594597492, 2398792379, 1398349830, 4139823991
c dd 3239879239, 1309340349, 2349087349, 4020030040
dd 239879239, 2023987439, 1234567890, 23978239
dd 1245409999, 3340340309, 343498392, 3020030040
d dd 1340983534, 3295798324, 3843298739, 1897349923
dd 2398721309, 1347204809, 4085034850, 2409872134
dd 2594597492, 2398792379, 1398349830, 4139823991
e dd 12 dup (0) ; 384 bit unsigned values
f dd 12 dup (0)
.code
start:
;-------- 384 bit unsigned add using ALU registers in dword chunks
clc
mov eax, [c + 0]
mov ebx, [c + 4]
mov ecx, [c + 8]
mov edx, [c + 12]
adc [d + 0], eax
adc [d + 4], ebx
adc [d + 8], ecx
adc [d + 12], edx
mov eax, [c + 16]
mov ebx, [c + 20]
mov ecx, [c + 24]
mov edx, [c + 28]
adc [d + 16], eax
adc [d + 20], ebx
adc [d + 24], ecx
adc [d + 28], edx
mov eax, [c + 32]
mov ebx, [c + 36]
mov ecx, [c + 40]
mov edx, [c + 44]
adc [d + 32], eax
adc [d + 36], ebx
adc [d + 40], ecx
adc [d + 44], edx
;-------- 384 bit unsigned add with 128 bit oword chunks
; first attempt gives obviously INCORRECT results
movdqa xmm0, oword ptr [a + 0]
movdqa xmm1, oword ptr [a + 16]
movdqa xmm2, oword ptr [a + 32]
paddq xmm0, oword ptr [b + 0]
paddq xmm1, oword ptr [b + 16]
paddq xmm2, oword ptr [b + 32]
movdqa oword ptr [a + 0], xmm0
movdqa oword ptr [a + 16], xmm1
movdqa oword ptr [a + 32], xmm2
end start
Adding a value with only bits 0 and/or 64 set out of bits 0 to 127
would provide the carr(y/ies), determining which of the 3 possible versions
is still an issue.
Any ideas/code for doing any of the math operations ?
perform several 384 in parallel, instead of try serial on one 384 because these instructions are only made with parallel execution in mind
for example work with two 64bit at a time, maybe unroll it to perform 4 64bits at a time, for all execution units work parallel
but that require restructure your data to be efficient
for addition, packed compare less than and use this to mask a 1,1,1,1,... constant and a second padd to add this carry
for each carryoperation that is needed
each 1 emulates a carrybit
doubt it will be faster with so much xtra code to emulate carry operations
maybe if you mix alu and SSE2, it helps because they perform in different execution units ?
After trying various methods, this is the most compact and least complicated,
I believe this will do the add of two 384 bit unsigned numbers with SSE2, but it is untested.
Is there another way to move bit 127 to bit 0, within a xmm register other than by a 126 bit right shift ?
Example: if it was a byte instead of an oword 1000 1000 to 0000 0001.
Remember the lower qword may have the high bit set, which is dealt with by the 126 bit right shift.
Could extract high dword or high byte, ROL, then place in low dword or low byte in a zeroed 128 bit temp
but trying to use only SSE2 (or SSE if appropriate) opcodes.
.data
ALIGN 16
a dd 3239879239, 1309340349, 2349087349, 4020030040
dd 239879239, 2023987439, 1234567890, 23978239
dd 1245409999, 3340340309, 343498392, 3020030040
b dd 1340983534, 3295798324, 3843298739, 1897349923
dd 2398721309, 1347204809, 4085034850, 2409872134
dd 2594597492, 2398792379, 1398349830, 139823991 ; took off leading 4 in high dword
c dd 3239879239, 1309340349, 2349087349, 4020030040
dd 239879239, 2023987439, 1234567890, 23978239
dd 1245409999, 3340340309, 343498392, 3020030040
d dd 1340983534, 3295798324, 3843298739, 1897349923
dd 2398721309, 1347204809, 4085034850, 2409872134
dd 2594597492, 2398792379, 1398349830, 139823991 ; took off leading 4 in high dword
g dd 0, 2147483648, 0, 2147483648
.code
start:
; conventional ALU add of two 384 bit unsigned integers, a dword pair at a time
clc
mov eax, [d + 0]
mov ebx, [d + 4]
mov ecx, [d + 8]
mov edx, [d + 12]
adc [c + 0], eax
adc [c + 4], ebx
adc [c + 8], ecx
adc [c + 12], edx
mov eax, [d + 16]
mov ebx, [d + 20]
mov ecx, [d + 24]
mov edx, [d + 28]
adc [c + 16], eax
adc [c + 20], ebx
adc [c + 24], ecx
adc [c + 28], edx
mov eax, [d + 32]
mov ebx, [d + 36]
mov ecx, [d + 40]
mov edx, [d + 44]
adc [c + 32], eax
adc [c + 36], ebx
adc [c + 40], ecx
adc [c + 44], edx
; add the two 384 bit values, carries between qwords aren't done
; make xmm registers copy of A because it gets over written but
; needs to be used later in the ANDs to calculate the qword carries
movdqa xmm0, oword ptr [a + 0] ; a
movdqa xmm1, oword ptr [a + 16]
movdqa xmm2, oword ptr [a + 32]
movdqa xmm4, oword ptr [b + 0] ; b
movdqa xmm5, oword ptr [b + 16]
movdqa xmm6, oword ptr [b + 32]
paddq oword ptr [a + 0], xmm4 ; a = a + b
paddq oword ptr [a + 16], xmm5
paddq oword ptr [a + 32], xmm6
movdqa xmm3, oword ptr g ; 63:1,62..0:0 _ 63:1,62..0:0
andpd xmm0, xmm4 ; a = a and b
andpd xmm1, xmm5
andpd xmm2, xmm6
andpd xmm0, xmm3 ; a = a and 1...1...
andpd xmm1, xmm3
andpd xmm2, xmm3
; shl 384 bits, 1 will move the computed carries from all qwords to the correct spots
movdqa xmm7, xmm1 ; copy 1
psrldq xmm7, 126 ; get bit 127, put in bit 0
pslldq xmm2, 1 ; shl 2 by 1
paddq xmm2, xmm7 ; add bit 0 from 1
movdqa xmm7, xmm0 ; copy 0
psrldq xmm7, 126 ; get bit 127, put in bit 0
pslldq xmm1, 1 ; shl 1 by 1
paddq xmm1, xmm7 ; add bit 0
pslldq xmm0, 1 ; shl 0 by 1
; add the shifted qword carries to previous uncarried add of the two 384 bit values
paddq oword ptr [a + 32], xmm2
paddq oword ptr [a + 16], xmm1
paddq oword ptr [a + 0], xmm0
a should equal c, if the SSE2 code works correctly.
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...
I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.
QuoteIs there another way to move bit 127 to bit 0, within a xmm register other than by a 126 bit right shift ?
Forgive me for not knowing more about the instructions, but what's wrong with shift, exactly?
Found out the SSE2 right and left shifts are by byte not bit so the max right shift is 15.
Work around, since it is a single bit that that is being shifted, shift by byte, double, shift by byte(s)
in opposite direction.
move high bit 127 to low bit 0
10000000 00000000... before any ops, 1 == bit 127, the high bit in xmm register
00000000 10000000... after 1 byte shift right
00000001 00000000... after doubling
...00000000 00000001 after 15 byte shift right
move bit 63 to bit 64
00000000 00000000^10000000 before any ops ^ == qword boundary
00000000 10000000^00000000 after 1 byte shift left
00000001 00000000^00000000 after doubling
00000000 00000001^00000000 after 1 byte shift right
The following replaces the previous code chunk for moving the computed carries.
; shl 384 bits, 1 will move the computed carries from all qwords to the correct spots
movdqa xmm7, xmm1 ; copy 1 -- get bit 127, put in bit 0
psrldq xmm7, 1 ; right shift 1 byte
paddq xmm7, xmm7 ; double, effect of shift left 1 bit
psrldq xmm7, 15 ; right shift 15 bytes
pslldq xmm2, 1 ; left shift 2 by 1 byte
paddq xmm2, xmm2 ; double, effect of shift left 1 bit
psrldq xmm2, 1
paddq xmm2, xmm7 ; add bit 0 from 1
movdqa xmm7, xmm0 ; copy 0 -- get bit 127, put in bit 0
psrldq xmm7, 1 ; get bit 127, put in bit 0
; right shift 1 byte
paddq xmm7, xmm7 ; double, effect of shift left 1 bit
psrldq xmm7, 15 ; right shift 15 bytes
pslldq xmm1, 1 ; left shift 1 by 1 byte
paddq xmm1, xmm1 ; double, effect of shift left 1 bit
psrldq xmm1, 1
paddq xmm1, xmm7 ; add bit 0
pslldq xmm0, 1 ; shift left 0 by 1 byte
paddq xmm0, xmm0 ; double, effect of shift left 1 bit
psrldq xmm0, 1
The technique relies on the properties of AND indicating a carry only if both bits are 1,
just realized since 3 bits are being added if either of the original bits at bit 63
are 1 and the carry from the addition of bits 62 is 1 then an ADD will also cause a carry
into bit 64 across the qword boundary.
Had dealt with this in an earlier method involving XOR of the original bits,
then XOR that with bit 63 result of ADD,
if first XOR == 1 and ADD bit 63 == 0 then carry should have been created.
OR this last result with AND of original bits. Will always give the correct carry state
crossing the qword boundary. Will also have to implement it for pairs of bit 127.
1 1 carry from bit add of bit - 1
0 1 original
1 0 original
=^==
0 0 both generate a carry ie 10 was the result of adding the three bits.
AND
0 0 1 1
0 1 0 1
= = = =
0 0 0 1
XOR
0 0 1 1
0 1 0 1
= = = =
0 1 1 0
I think that there isn't more convenient way to do a addition / subtraction of any size operands than you described above
xor ecx, ecx
lea esi, src
lea edi, dst
clc
@@:
mov eax, [esi + ecx*4]
adc [edi + ecx*4], eax
inc ecx
cmp ecx, 12
jne @B
Quote from: asmfan on June 13, 2006, 07:15:04 AM
xor ecx, ecx
lea esi, src
lea edi, dst
clc
@@:
mov eax, [esi + ecx*4]
adc [edi + ecx*4], eax
inc ecx
cmp ecx, 12
jne @B
asmfan, have I missed something, or does that have a bug?
The carry, so nicely preserved through almost every instruction will be destroyed by the
cmp instruction making the whole thing useless.
Ossa
[edit] I think this should fix it though:
[edit(2)] eek... stupid mistake! [/edit(2)]
[edit(3)] OK, really got it this time... it was a bit fiddly to get the byte order right, but...
NUMBITS EQU 384
...
mov ecx, -(NUMBITS / 32)
mov esi, ((offset src) + (NUMBITS / 8))
mov edi, ((offset dest) + (NUMBITS / 8))
clc
@@:
mov eax, [esi + ecx * 4]
adc [edi + ecx * 4], eax
inc ecx
jnz @B
[/edit(3)]
[/edit]
Ossa, well done man:) i really missed it (only inc/dec not alters cf)... nice code
Regards,
asmfan
heh, it's you code and idea really - a nice way to do it... I wouldn't have even tried to get away without altering the carry if you hadn't suggested it. (Reminds me of the SAS motto: "Who Dares Wins")
Ossa
I'd recommend keeping your original (ALU) code, which is unrolled nicely.
First idea wins imo :wink
Stan
Stan, but unrolling cannot be infinite... And looped code can be applied to a "REAL" big numbers;) up to some GB... since it universal
It depends if you want to stay with 384 bits. If you want to go somewhat higher than that, you could write a macro that unrolls the loop for you.
For larger numbers, you can unloop partially. (and handle end cases separately, if needed) An unroll of 2 or 4 will speed up the code noticeably.
Unroll by 2:
xor ecx, ecx
lea esi, nc
lea edi, nd
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
inc ecx
cmp ecx, 6
jne @B
Unroll by 4:
xor ecx, ecx
lea esi, nc
lea edi, nd
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
mov eax, [esi + ecx*8+8]
mov ebx, [esi + ecx*8+12]
adc [edi + ecx*8+8], eax
adc [edi + ecx*8+12], ebx
inc ecx
inc ecx
cmp ecx, 6
jne @B
AMD Athlon 64 3200+: (384 bit numbers)
16 cycles, original ALU code
45 cycles, asmfan's code
20 cycles, unrolled by 2
17 cycles, unrolled by 4
Stan
[attachment deleted by admin]
Stan, for the same reason as I stated for asmfan's code earlier, your code will not work - see fix above.
Ossa
Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...
I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.
You can actually get this to work. I've done it on adding large integer numbers together. You only keep a 32-bit value in the 64-bit qword portion of the register. Then you do a PADDQ and get the carry in the high dword of the register. You left shift the register 32-bits to get the carry in the low position. You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.
An unoptimized example:
mov esi,[src]
mov edi,[dst]
pxor mm7,mm7 ;zero out for carry register
mov ecx,[num_dwords]
looper:
movd mm0,[esi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero
movd mm1,[edi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero
paddq mm0,mm7 ;add carry to source register value, any carry overflows into upper 32-bits
paddq mm0,mm1 ;add src + dst, overflow into upper 32-bits if there is a carry
movq mm7,mm0 ;save results for next carry
psrlw mm7,32 ;shift it so the carry is in the lowest bit.
movq [dst],mm0 ;save result in destination
add esi,4
add edi,4
sub ecx,1
jnz looper
Now some _correct_ code:
unrolled by 2:
mov ecx, -6
lea esi, [nc + 384/8]
lea edi, [nd + 384/8]
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
inc ecx
jnz @B
unrolled by 4:
mov ecx, -6
lea esi, [nc + 384/8]
lea edi, [nd + 384/8]
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
mov eax, [esi + ecx*8+8]
mov ebx, [esi + ecx*8+12]
adc [edi + ecx*8+8], eax
adc [edi + ecx*8+12], ebx
inc ecx
inc ecx
jnz @B
and correct code goes with correct results:
16 cycles, original ALU code
32 cycles, ossa's code
14 cycles, unrolled by 2
16 cycles, unrolled by 4
Press any key to exit...
It seems like the unroll by 2 goes faster than the unroll by 4.
[edit] and now that I notice, it seems to be even faster than completely unrolled code [/edit]
Stan
I strongly recommend to use Ossa's last posted code as the only correct working ALU looped code here. Of course you can unroll it, but the idea! Adding 2 GB variables;) nice... Also it can be done for the subtraction also...
I recommend unrolling by 2, because it doubles your speed without bloating your code.
I think it's the matter of cache line. Whether we should unroll by 2 or 4... or maybe 8:) And the fact that the more we unroll the less possible numbers we can use to add. If we unroll by 4 we won't be able to add undivisible by 16 byte in size variable, when unrolled version can handle divisible by 4...
"To unroll and not to unroll that is the question" (don't remember who said this;)
If you want to process undivisible numbers, you can handle end cases or make sure there is enough space to perform the last iteration.
At some point, unrolling code will no longer speed up your code. That's why I always set up benchmarks and see what performs best.
Quote from: Mark_Larson on June 13, 2006, 05:05:53 PM
Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...
I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.
You can actually get this to work. I've done it on adding large integer numbers together. You only keep a 32-bit value in the 64-bit qword portion of the register. Then you do a PADDQ and get the carry in the high dword of the register. You left shift the register 32-bits to get the carry in the low position. You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.
An unoptimized example:
mov esi,[src]
mov edi,[dst]
pxor mm7,mm7 ;zero out for carry register
mov ecx,[num_dwords]
looper:
movd mm0,[esi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero
movd mm1,[edi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero
paddq mm0,mm7 ;add carry to source register value, any carry overflows into upper 32-bits
paddq mm0,mm1 ;add src + dst, overflow into upper 32-bits if there is a carry
movq mm7,mm0 ;save results for next carry
psrlw mm7,32 ;shift it so the carry is in the lowest bit.
movq [dst],mm0 ;save result in destination
add esi,4
add edi,4
sub ecx,1
jnz looper
cant we combine ALU code mixed with MMX/SSE2 code for ultimate speed?, dont they execute in different pipes?
Daydreamer, i think we cannot combine them because we obtain the carry flag sequentially adding parts of a number... however... who knows
daydreamer combining MMX and ALU would give a big speed up. But there is a dependency on the carry flag. It might still be possible to do, I'll think about it. You could get all the stuff set up except for the ADD or PADDQ, and that might work.
I also modified the original code as posted by dsouza123 to support an arbitrary size of big integers. His code only supports 384-bit ( 48 bytes). To get the ADC's to work from loop to loop I use LEA to modify the ESI and EDI registers. And I use DEC on the loop counter ( INC/DEC don't modify the Carry flag). I tested the code on a 1 megabyte sized number. On my system it takes 3357599 cycles to do that. It's a 3.2 Ghz Intel processor. That's 1.0492496875 milliseconds.
original proc near
mov esi,[src_ptr]
mov edi,[dst_ptr]
push ebp
mov ebp,ALLOC_SIZE/16
clc
looper:
mov eax, [esi + 0]
mov ebx, [esi + 4]
mov ecx, [esi + 8]
mov edx, [esi + 12]
adc [edi + 0], eax
adc [edi + 4], ebx
adc [edi + 8], ecx
adc [edi + 12], edx
lea esi,[esi+16]
lea edi,[edi+16]
dec ebp
jnz looper
pop ebp
ret
original endp
Mark,
small "correction" (it's a matter of preference):
original proc near
mov esi,[src_ptr]
mov edi,[dst_ptr]
add esi, ALLOC_SIZE - 16 ; These can be absorbed into the preceding
add edi, ALLOC_SIZE - 16 ; lines if the pointer are offsets to statics
push ebp
mov ebp, ALLOC_SIZE / 16
clc
looper:
mov edx, [esi + 12]
mov ecx, [esi + 8]
mov ebx, [esi + 4]
mov eax, [esi + 0]
adc [edi + 12], edx
adc [edi + 8], ecx
adc [edi + 4], ebx
adc [edi + 0], eax
lea esi,[esi - 16]
lea edi,[edi - 16]
dec ebp
jnz looper
pop ebp
ret
original endp
This gets the correct byte order (like for words, dwords, etc). Obviously you could store it the other way around, but this just conforms to the standard.
I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).
Ossa
Quick question:
Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?
What's the best way to quickly test out code like this?
Savage, on the fly? Not that I know of. Sounds like a great side project though, for someone with time to tackle it! Two ways to do it statically though: include MichaelW's timers.asm:
include timers.asm
.code
counter_begin 1000000, HIGH_PRIORITY_CLASS
; instructions to time
counter_end
print ustr$(eax)
print " cycles",13,10
or use Petroizki's ptimers.inc and/or proctimers.inc, and do something like:
include proctimers.inc
.code
PTIMER_PROC_LOOP_COUNT MyFunction
; clocks returned as qword in qwPROCTIMER_RESULT
invoke dwtoa,dword ptr [qwPROCTIMER_RESULT],addr szTemp
print offset szTemp
Quote from: savage on June 13, 2006, 10:22:20 PM
Quick question:
Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?
What's the best way to quickly test out code like this?
I actually have my own code I rolled to do timings. However I use MichaelW's because a lot of other people do, and so therefore the timings should be similar. For small snippets of code ( like under 100 cycles) there's little difference in timing code. However there's greater fluctuation once you get into larger units of time. The above code took 1 millisecond to run a 1Megabyte number. In this case I mean 1Megabyte = 256KB dwords, because we handle a dword at a time with the ADC.
Quote
I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).
Ossa
Yep, much faster. And INC/DEC modify flags just not the carry flag. So that doesn't disturb the carry flag that gets set up by the ADC's.
Ossa, your last posted code won't work correct on addition because we have a carry while adding from LSB (least significant bits) to MSB. Your code is for subtraction so if you change ADC to SBB it will work perfectly, correctly catching the CF.
Of course it's the matter of little endian:)
Quote from: asmfan on June 14, 2006, 08:24:44 AM
Of course it's the matter of little endian:)
Good point asmfan, I got the orders mixed up (must have been falling asleep). That was exactly what I was trying to fix... but I seem to have got a bit mixed up on the byte ordering and "unfixed" it.
Ossa
But now we have a addition and subtraction algorithms on BIG numbers:)
Working code, hopefully, I don't have a CPU that has SSE2 to test it,
for a pure SSE2 add of 512 bit unsigned integers,
increased it from 384 to 512 for symmetry/efficiency.
The code is a testbed, it works by generating combinations of numbers
adding them using the ALU with pairs of 4 mov/ 4 adc instructions then
adding them using SSE2,
after it compares the results and adds to a mismatch count if they aren't equal.
It displays a messagebox with the mismatch count when done.
The numbers are qwords with the top two bits 63:62 varying from 11 to 00
and the low bit 0 always 1, so a 128 bit SSE2 register 127:0 could be 11...1,11...1
or down to 00...1,00...1
Ideally it would use 16 bits, two for each of the high bits of all 16 qwords but
it might take too long, 2^32 combinations 65536 * 65536
so instead I used 2^20 combinations 1024 * 1024,
for the values of i and j in the nested repeat until loops.
Before the SSE2 code was added, using only ALU with 65536 for both loops
it took 11.5 minutes (1.2 Ghz) and was very slow to respond.
Dropped them to 1024 and added multiple invoke sleep, 0
to cause it to yield the CPU more often.
Hopefully someone with an appropriate CPU with SSE2 could test it.
I am sure there better ways of doing a code harness, without sticking
a number of invoke sleep, 0 to remove the impact on responsiveness,
such as setting the code in a worker thread, with the main thread handling
system communication or some other effective methods.
An overview of the SSE2 add:
Computes logical operations then stores them.
(ie AND, OR, XOR of the two 512 bit unsigned integers 256 bits at a time.)
ADDs (without qword carries) then stores, the original two 512 bit numbers.
Performs combinations of logical operations with the four previous operations.
Creates carries for both 512 bits.
Shifts the carries 1 bit left.
Adds the carries to the ADDs.
The result should be same as the ALU version.
[attachment deleted by admin]
I get "cnt: 0", so i guess it works fine.
Ossa
I also get CNT: 0. It takes roughly 7 seconds to run. Did you also try to see how fast the new one I wrote that supports arbitrary sizes?
Ossa and Mark,
Thanks for testing it.
Mark
Yes, I saw your optimized code, the alloc/16 threw me for a loop at first,
I was thinking bits and you were dealing with bytes,
now that I reread it it makes perfect sense.
Also the use of ebp, I thought you were doing some trick with the
stack, but now I see it was just a way to hold the counter in a register,
the other 6 normal ones were all in use.
When is it safe to use ebp ?
Your code is very flexible, small and quick versus
the fully unrolled hard coded offset version that only is quick.
I just copied and pasted then extended the working code I already had.
My biggest concern was primarily correctness of the SSE2 code
with a secondary issue of performance by minimizing moves from/to memory
and holding values in SSE2 registers.
So if your code was dropped in place of the hardcoded ALU version it would
be the code below.
mov esi,[B0]
mov edi,[A0]
push ebp
mov ebp, 4 ; 64 bytes/16 bytes per oword == 4 owords == 512 bits
clc
looper:
mov eax, [esi + 0]
mov ebx, [esi + 4]
mov ecx, [esi + 8]
mov edx, [esi + 12]
adc [edi + 0], eax
adc [edi + 4], ebx
adc [edi + 8], ecx
adc [edi + 12], edx
lea esi,[esi+16]
lea edi,[edi+16]
dec ebp
jnz looper
pop ebp
Quote from: dsouza123 on June 16, 2006, 04:27:36 PM
When is it safe to use ebp ?
My answer would be whenever you are not using local variables or function parameters.
Quote from: savage on June 21, 2006, 02:50:38 PM
My answer would be whenever you are not using local variables or function parameters.
You mean when you are not referencing the function parameters by their friendly variable names... You can still use function parameters when you aren't using stack frames and EBP, but you will need to reference them directly by [ESP+offset] and keep track yourself of where on the stack they move to if you are PUSHing and POPping other values. It's not hard, just fiddly, but worth it if you need the extra register badly or you don't need to access the params much.
Ian_B
In reviewing the SSE2 instruction set, there is a mystery I haven't resolved.
There seems to be two parallel sets of instructions that have duplicate function.
ANDPD
ANDNPD
XORPD
ORPD
PAND
PANDN
PXOR
POR
Is there a performance advantage to one set over the other ?
Are they truely the same, 128 bit operations,
or are there some subtle functional differences between them ?
Quote from: dsouza123 on June 25, 2006, 11:18:34 PM
In reviewing the SSE2 instruction set, there is a mystery I haven't resolved.
There seems to be two parallel sets of instructions that have duplicate function.
ANDPD
ANDNPD
XORPD
ORPD
PAND
PANDN
PXOR
POR
Is there a performance advantage to one set over the other ?
Are they truely the same, 128 bit operations,
or are there some subtle functional differences between them ?
They are the same instructions. There is no subtle differences or performance advantages. There are even more duplicated instructions.
Thanks Mark
After success with SSE2 for addition, the next is SSE2 for subtraction.
Here are the truth tables used for addition and subtraction.
The addition tables should be correct because the addition code works.
The subtraction tables hopefully are correct but the subtraction code doesn't
work correctly half the time. It may be errors in the code and not with the tables.
Or fairly unlikely the emulated system it was tested with.
Without a CPU capable of SSE2, I have tried a different solution.
The testing was done on BOCHS with an emulated AMD64 CPU which has SSE2.
The standard BOCHS is a Pentium III but it doesn't have SSE2, someone compiled
the latest 2.2.6 version with an Athlon64 as the emulated CPU.
The addition code that Mark and Ossa tested on real CPUs,
also returned a count of 0 mismatches under BOCHS in 32-bit Windows.
Giving me confidence in the BOCHS emulation.
0 0 0 0 1 1 1 1 C Carry
0 0 1 1 0 0 1 1 A0
0 1 0 1 0 1 0 1 B0
-- -- -- -- -- -- -- --
00 01 01 10 01 10 10 11 D aDd
== == == ==
0 0 0 1 0 0 0 1 A And <----
0 1 0 0 0 1 0 0 N andN
0 1 1 1 0 1 1 1 O Or <----
0 1 1 0 0 1 1 0 X Xor <----
0 1 1 0 1 0 0 1 D aDd <----
0 0 0 0 1 1 1 1 xXD
0 0 0 0 0 1 1 1 aOxXD
0 0 0 1 0 1 1 1 oAaOxXD
==========++====++=++=++==============
0 0 0 0 1 1 1 1 B Borrow
0 0 1 1 0 0 1 1 A0
0 1 0 1 0 1 0 1 B0
-- -- -- -- -- -- -- --
00 11 01 00 11 10 00 11 U sUb
== == == ==
0 0 0 1 0 0 0 1 A And
0 1 0 0 0 1 0 0 N andN <----
0 1 1 1 0 1 1 1 O Or
0 1 1 0 0 1 1 0 X Xor <----
0 1 1 0 1 0 0 1 U sUb <----
0 0 0 0 1 1 1 1 xXU
0 0 0 0 1 0 0 1 aUxXU aOxXU
0 1 0 0 1 1 0 1 oNaUxXU aOxXU
====++=======++=++====++==============
Keys for the tables.
lower case for the operation:
x for xor, a for and, o for or
upper case for the result of operations on the original values:
A for And, O for Or, N for andN, X for Xor, U for sUb, D for aDd
The current code and executable that give a WRONG answer half the time is attached.
I believe the comparison ALU subtract code is correct, the subtraction tables probably OK,
the SSE2 code likely has the error(s).
[attachment deleted by admin]
Error detected !
Carry propagation error, with the full SSE2 addition.
In 1 out of 2^64 possible conditions the carry isn't propagated.
I have found the SSE2 add will fail to produce a carry for one condition.
If the result of two qwords added together, without the incoming carry,
is all 1s, ie FFFF FFFF FFFF FFFF, then later after adding an incoming 1 carry
no outgoing 1 carry is produced.
The incorrect outgoing carry produced by this special condition is 0.
Correct: incoming carry 0, produces an outgoing carry 0.
Incorrect: incoming carry 1, produces an outgoing carry 0.
The current qword will be correct.
When the incoming 1 carry is added, the qword result will correctly become 0.
The NEXT higher qword wont get an incoming 1 carry though.
An example using bytes instead of qwords.
0,1111 1111,1111 1111 A0
0,0000 0000,1111 1111 B0
---------------------
0,1111 1111,1111 1110 ADD without carries
0,1000 0000,0000 0000 XOR (Only showing bits 15, 7)
0,0000 0000,1000 0000 xXD
0,1000 0000,1000 0000 OR
0,0000 0000,1000 0000 aOxXD
0,0000 0000,1000 0000 AND
0,0000 0000,1000 0000 oAaOxXD final step to calc carries
0,0000 0001,0000 0000 shl 1 put in the correct spots
0,1111 1111,1111 1110 ADD without carries ( calc'd earlier )
0,0000 0000,1111 1110 ADD with carries, result is INCORRECT
1,0000 0000,1111 1110 Correct result with incoming and outgoing carries calc'd
No pure SSE2 fix yet.
No pure SSE2 solution for ADD,
but the working version that uses the ALU to prevent the carry propagation error.
It works correctly with all the values I've tried.
Also tried to emulate SSE2 instructions using ALU instructions ( add adc add adc no carries between qwords),
unfortunately there is/are some bug(s) and it doesn't match either SSE2 or conventional ALU results.
I'll continue working on pure SSE2 multi dword addition until it works 100 percent. (No carry propagation error).
Any ideas/code to fix the carry propagation error, and/or the emulated SSE2 code are welcome.
[attachment deleted by admin]
First a sugestion, to test the logic use MMX instructions instead of SSE replacing qword for dwords, the logic should be the same,
Now about the carry propagation, i have an idea, let's say i have registers with n words of 2n bits and want sum two of them with carry propagation:
1) Sum both registers put the carry of each in the odds bits of a new registers (C size 2n), ex: the carry of first word go to bit 1, the carry of second word to bit 3, etc;
2) put 1 in the even bits of C where the corresponding words are all 1s (ie: byte=255, word=65535, etc), ex: if first word is all 1s put 1 in bit 0, etc;
3) add to C 101010... (a sequnce of "2" in base 4), not C and shift left C by 1;
4) add the even bits in C to the big registers with the result of first instruction (ignore the carry now).
Example:
(in hex, C in binary, n = 4)
1)
(67, C6, 89, A5) + (45, D7, 76, 86) = (AC, 9D, FF, 2B) -- The result withou carry propagation
C = (00100010) -- Carries
2)
C = (00100110) -- Carries and "all 1s"
3)
C = (11010000) -- C + 10101010
C = (00101111) -- Not C
C = (01011110) -- C << 1
4)
(01, 01, 01, 00) -- The value to add
(AD, 9E, 00, 2B) -- The final result, sum of two registers with carry propagation (two 2nn bits number).
I will try to post here the SSE implementention of the above, as soon my time permit...
Important, to run the attached program requires a CPU with SSE2 instruction support.
This is a testbed program that does both a pure SSE2 carry propagation fix and an ALU version of the fix.
The two versions should produce identical results but the SSE2 only works part way, the ALU works correctly.
After both have attempted to propagate the carry,
16 MessageBoxes are displayed for dwords 15 downto 0
each shows the carry for SSE2, ALU; 16 (the number of dwords); the index of the current dword.
Finally a MessageBox with the Mismatch count; Total of SSE2 carries added; Total of ALU carries added.
It is nearly working, the carry for the SSE2 propagates a while before stopping.
The ALU version works correctly.
The algorithm takes a qword, if it is all bits set to 1, FFFF FFFF FFFF FFFF, -1
then it's paired qword carry is ORed with the next higher qword carry, thus propagating the carry.
It doesn't matter if the current paired carry is 1 or 0 since it is ORed it will max out at 1
and if the paired carry is 0 it wont propagate the carry, but does no harm.
How the SSE2 version works, there doesn't seem to be a workable qword comparison SSE2 instruction
just FP Double which has issues with the evaluation of some bit patterns into NAN etc, so dwords are used.
The dwords have a problem, they aren't qwords, and can differ in evaluation, so after the dwords are compared
they are shuffled then ANDed, if they are both FFFF FFFF they remain FFFF FFFF else become 0.
The pcmpeqd instruction for SSE2 compairs paired dwords, if they are the same,
the destination's dword for each pair is set to FFFF FFFF else to 0000 0000.
With the dwords in xmm? indexed 3, 2, 1, 0 the dword shuffle instruction pshufd
with the byte size order operand == 0b1h, shuffles them to 2, 3, 0, 1
The AND of the two xmm registers will produce either all 1s or all 0s for each qword.
The carry qword is ANDed with the value qword and shifted then ORed with the next higher carry qword.
The instructions psrldq and pslldq shift an oword a multiple of eight bits at a time either right or left,
like shr eax, 8 or shl eax, 16 do for dwords.
The testbed program, source and exe are in the attached zip file.
Here is essence of the code, maybe someone can spot the subtle error.
.data
A0 dd 16 dup (-1) ; x0 is for values
C0 dd 16 dup (-1)
A1 dd 16 dup (0) ; x1 is for carries
C1 dd 16 dup (0)
YY dd -1,-1,-1,-1 ; all 1s 11111111111
.code
mov A0, 0 ; result of lowest qword if it was all 1s and 1 added to it
mov A0+4, 0
mov C0, 0
mov C0+4, 0
mov [A1+8], 1 ; the carry was put in the next higher carry qword
mov [C1+8], 1
movdqa xmm0, oword ptr [A1 + 0]
movdqa xmm1, oword ptr [A1 + 16]
movdqa xmm2, oword ptr [A1 + 32]
movdqa xmm3, oword ptr [A1 + 48]
movdqa xmm7, oword ptr [A0 + 0]
pcmpeqd xmm7, oword ptr [YY + 0] ; compare with all 1111 if == 1111, != 0000 per dword
pshufd xmm6, xmm7, 0b1h ; swap dwords 3210 --> 2301
andpd xmm6, xmm7 ; AND 32, 23 AND 10, 01 if same == 1111, diff 0000
andpd xmm6, xmm0 ; C1 + 0
psrldq xmm6, 8
orpd xmm1, xmm6 ; oword ptr [A1 + 16] 1a
movdqa xmm7, oword ptr [A0 + 16]
pcmpeqd xmm7, oword ptr [YY + 0] ; compare with all 1111 if == 1111, != 0000 per dword
pshufd xmm6, xmm7, 0b1h ; swap dwords 3210 --> 2301
andpd xmm6, xmm7 ; AND 32, 23 AND 10, 01 if same == 1111, diff 0000
andpd xmm6, xmm1 ; C1 + 16
movdqa xmm5, xmm6
pslldq xmm5, 8
orpd xmm1, xmm5 ; oword ptr [A1 + 16] 1b
andpd xmm6, xmm1
psrldq xmm6, 8
orpd xmm2, xmm6 ; oword ptr [A1 + 32] 2a
movdqa xmm7, oword ptr [A0 + 32]
pcmpeqd xmm7, oword ptr [YY + 0] ; compare with all 1111 if == 1111, != 0000 per dword
pshufd xmm6, xmm7, 0b1h ; swap dwords 3210 --> 2301
andpd xmm6, xmm7 ; AND 32, 23 AND 10, 01 if same == 1111, diff 0000
andpd xmm6, xmm2 ; C1 + 32
movdqa xmm5, xmm6
pslldq xmm5, 8
orpd xmm2, xmm5 ; oword ptr [A1 + 32] 2b
andpd xmm6, xmm2
psrldq xmm6, 8
orpd xmm3, xmm6 ; oword ptr [A1 + 48] 3a
movdqa xmm7, oword ptr [A0 + 48]
pcmpeqd xmm7, oword ptr [YY + 0] ; compare with all 1111 if == 1111, != 0000 per dword
pshufd xmm6, xmm7, 0b1h ; swap dwords 3210 --> 2301
andpd xmm6, xmm7 ; AND 32, 23 AND 10, 01 if same == 1111, diff 0000
andpd xmm6, xmm3 ; C1 + 48
pslldq xmm6, 8
orpd xmm3, xmm6 ; oword ptr [A1 + 48] 3b
movdqa oword ptr [A1 + 0], xmm0
movdqa oword ptr [A1 + 16], xmm1
movdqa oword ptr [A1 + 32], xmm2
movdqa oword ptr [A1 + 48], xmm3
.endif
nop
;==============================
nop
mov ecx, 8 ; if qword == -1 whatever is in C1+0 is ORed with C1+8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1) ;
mov eax, [C1 + ecx+0] ; 0 or 1 doesn't matter
or [C1 + ecx+8], eax ; and [C1 + ecx+8],1 not needed if C1 is only 1 bit
.endif
nop
add ecx, 8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
mov eax, [C1 + ecx+0]
or [C1 + ecx+8], eax
.endif
add ecx, 8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
mov eax, [C1 + ecx+0]
or [C1 + ecx+8], eax
.endif
add ecx, 8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
mov eax, [C1 + ecx+0]
or [C1 + ecx+8], eax
.endif
add ecx, 8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
mov eax, [C1 + ecx+0]
or [C1 + ecx+8], eax
.endif
add ecx, 8
.if ([C0 + ecx+0] == -1) && ([C0 + ecx+4] == -1)
mov eax, [C1 + ecx+0]
or [C1 + ecx+8], eax
.endif
.endif
[attachment deleted by admin]
adc is slow. Better to use mmx. I wonder how does this fare against the SSE2 routine.
bnAdd2 proc bigintdes:dword,bigintsrc1,bigintsrc2
;bigintdes = bigint1 + bigint2
N equ 12
mov eax, [esp+4]
mov ecx, [esp+8]
mov edx, [esp+12]
movd mm0, [ecx]
movd mm1, [edx]
paddq mm1, mm0
movd [eax], mm1
psrlq mm1, 32
i = 4
REPT N/2 - 1
movd mm0, [ecx+i]
movd mm2, [edx+i]
paddq mm2, mm0
paddq mm2, mm1
movd [eax+i],mm2
psrlq mm2, 32
i = i + 4
movd mm0, [ecx+i]
movd mm1, [edx+i]
paddq mm1, mm0
paddq mm1, mm2
movd [eax+i], mm1
psrlq mm1, 32
i = i + 4
ENDM
movd mm0,[ecx+i]
movd mm2,[edx+i]
paddq mm2, mm0
paddq mm2, mm1
movq [eax+i], mm2
retn 12
bnAdd2 endp
Quote from: roticv on July 08, 2006, 01:02:29 PM
adc is slow. Better to use mmx. I wonder how does this fare against the SSE2 routine.
I actually did that eariler in the thread.
http://www.masm32.com/board/index.php?topic=4933.15
EDIT:
I modified your constant N to correctly reflect the correct size in bytes of the big integer.
here are some timings:
Quote
16 bytes
adc code : 48
roticv MMX : 69
512 bytes
adc code : 1460
roticv MMX : 3294
Instead of tackling multiple 128 bit adds as a chunk, the attached code will do
a true 128 bit add with carry, getting and setting the carry flag at the start and finish,
using only SSE2 registers without conditional code or ALU register usage
presenting the possiblity of both using the ALU and SSE2 in parallel or interleaved.
Tried to reduce memory accesses and constants, keeping the values in SSE2 registers
and transforming to new values using pshufd. More optimizations are possible.
Perhaps there is some way of combining some methods used in the multiple chunks code.
Instead of using the carry flag just placing a 0 or 1 in the low byte of C1 at the start would work
keeping the carry flag free for the ALU version.
The testbed code is equivalent to
stc ; set carry, just for testing
mov eax, B0+ 0 ; ALU version
mov ebx, B0+ 4
mov ecx, B0+ 8
mov edx, B0+12
adc A0+ 0, eax
adc A0+ 4, ebx
adc A0+ 8, ecx
adc A0+12. edx
Along with Mark's MMX code there are now three versions using different register sets.
The newly released CPUs, Core 2 architecture, have faster SSE2 performance
so SSE2 versions of routines should benefit.
[attachment deleted by admin]