SSE2 replacement of ALU registers for 384 bit unsigned integers

Started by dsouza123, June 04, 2006, 05:09:23 AM

Previous topic - Next topic

Mark_Larson

Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...

I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.

You can actually get this to work.  I've done it on adding large integer numbers together.  You only keep a 32-bit value in the 64-bit qword portion of the register.  Then you do a PADDQ and get the carry in the high dword of the register.  You left shift the register 32-bits to get the carry in the low position.  You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.

An unoptimized example:


   mov esi,[src]
   mov edi,[dst]
   pxor mm7,mm7   ;zero out for carry register
   mov ecx,[num_dwords]
looper:
  movd  mm0,[esi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  movd  mm1,[edi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  paddq mm0,mm7     ;add carry to source register value, any carry overflows into upper 32-bits
  paddq mm0,mm1    ;add src + dst, overflow into upper 32-bits if there is a carry
  movq  mm7,mm0     ;save results for next carry
  psrlw  mm7,32       ;shift it so the carry is in the lowest bit.
  movq [dst],mm0    ;save result in destination
  add   esi,4
  add   edi,4
  sub   ecx,1
  jnz    looper
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

stanhebben

Now some _correct_ code:

unrolled by 2:

mov ecx, -6
lea esi, [nc + 384/8]
lea edi, [nd + 384/8]
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
inc ecx
jnz @B


unrolled by 4:

mov ecx, -6
lea esi, [nc + 384/8]
lea edi, [nd + 384/8]
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx

mov eax, [esi + ecx*8+8]
mov ebx, [esi + ecx*8+12]
adc [edi + ecx*8+8], eax
adc [edi + ecx*8+12], ebx

inc ecx
inc ecx
jnz @B


and correct code goes with correct results:

16 cycles, original ALU code
32 cycles, ossa's code
14 cycles, unrolled by 2
16 cycles, unrolled by 4
Press any key to exit...


It seems like the unroll by 2 goes faster than the unroll by 4.

[edit] and now that I notice, it seems to be even faster than completely unrolled code [/edit]

Stan

asmfan

I strongly recommend to use Ossa's last posted code as the only correct working ALU looped code here. Of course you can unroll it, but the idea! Adding 2 GB variables;) nice... Also it can be done for the subtraction also...
Russia is a weird place

stanhebben

I recommend unrolling by 2, because it doubles your speed without bloating your code.

asmfan

I think it's the matter of cache line. Whether we should unroll by 2 or 4... or maybe 8:) And the fact that the more we unroll the less possible numbers we can use to add. If we unroll by 4 we won't be able to add undivisible by 16 byte in size variable, when unrolled version can handle divisible by 4...
"To unroll and not to unroll that is the question" (don't remember who said this;)
Russia is a weird place

stanhebben

If you want to process undivisible numbers, you can handle end cases or make sure there is enough space to perform the last iteration.

At some point, unrolling code will no longer speed up your code. That's why I always set up benchmarks and see what performs best.

daydreamer

Quote from: Mark_Larson on June 13, 2006, 05:05:53 PM
Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...

I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.

You can actually get this to work.  I've done it on adding large integer numbers together.  You only keep a 32-bit value in the 64-bit qword portion of the register.  Then you do a PADDQ and get the carry in the high dword of the register.  You left shift the register 32-bits to get the carry in the low position.  You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.

An unoptimized example:


   mov esi,[src]
   mov edi,[dst]
   pxor mm7,mm7   ;zero out for carry register
   mov ecx,[num_dwords]
looper:
  movd  mm0,[esi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  movd  mm1,[edi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  paddq mm0,mm7     ;add carry to source register value, any carry overflows into upper 32-bits
  paddq mm0,mm1    ;add src + dst, overflow into upper 32-bits if there is a carry
  movq  mm7,mm0     ;save results for next carry
  psrlw  mm7,32       ;shift it so the carry is in the lowest bit.
  movq [dst],mm0    ;save result in destination
  add   esi,4
  add   edi,4
  sub   ecx,1
  jnz    looper

cant we combine ALU code mixed with MMX/SSE2 code for ultimate speed?, dont they execute in different pipes?

asmfan

Daydreamer, i think we cannot combine them because we obtain the carry flag sequentially adding parts of a number... however... who knows
Russia is a weird place

Mark_Larson

  daydreamer combining MMX and ALU would give a big speed up.  But there is a dependency on the carry flag.  It might still be possible to do, I'll think about it.  You could get all the stuff set up except for the ADD or PADDQ, and that might work.

I also modified the original code as posted by dsouza123 to support an arbitrary size of big integers.  His code only supports 384-bit ( 48 bytes).   To get the ADC's to work from loop to loop I use LEA to modify the ESI and EDI registers.  And I use DEC on the loop counter ( INC/DEC don't modify the Carry flag).  I tested the code on a 1 megabyte sized number.  On my system it takes 3357599 cycles to do that.  It's a 3.2 Ghz Intel processor.  That's 1.0492496875 milliseconds.





original proc near
mov esi,[src_ptr]
mov edi,[dst_ptr]
   
push            ebp
mov ebp,ALLOC_SIZE/16
clc
looper:
mov eax, [esi +  0]
mov ebx, [esi +  4]
mov ecx, [esi +  8]
mov edx, [esi + 12]

adc [edi +  0], eax
adc [edi +  4], ebx
adc [edi +  8], ecx
adc [edi + 12], edx

lea esi,[esi+16]
lea edi,[edi+16]
   
dec ebp
jnz looper
pop ebp

ret
original endp
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Ossa

Mark,

small "correction" (it's a matter of preference):

original proc near
mov esi,[src_ptr]
mov edi,[dst_ptr]

add esi, ALLOC_SIZE - 16 ; These can be absorbed into the preceding
add edi, ALLOC_SIZE - 16 ; lines if the pointer are offsets to statics
   
push ebp
mov ebp, ALLOC_SIZE / 16
clc
looper:
mov edx, [esi + 12]
mov ecx, [esi +  8]
mov ebx, [esi +  4]
mov eax, [esi +  0]

adc [edi + 12], edx
adc [edi +  8], ecx
adc [edi +  4], ebx
adc [edi +  0], eax

lea esi,[esi - 16]
lea edi,[edi - 16]
   
dec ebp
jnz looper
pop ebp

ret
original endp


This gets the correct byte order (like for words, dwords, etc). Obviously you could store it the other way around, but this just conforms to the standard.

I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).

Ossa
Website (very old): ossa.the-wot.co.uk

savage

Quick question:

Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?

What's the best way to quickly test out code like this?

Mark Jones

Savage, on the fly? Not that I know of. Sounds like a great side project though, for someone with time to tackle it! Two ways to do it statically though: include MichaelW's timers.asm:


include timers.asm
.code
counter_begin 1000000, HIGH_PRIORITY_CLASS
    ; instructions to time
counter_end
    print ustr$(eax)
    print " cycles",13,10


or use Petroizki's ptimers.inc and/or proctimers.inc, and do something like:


include proctimers.inc
.code
    PTIMER_PROC_LOOP_COUNT MyFunction
; clocks returned as qword in qwPROCTIMER_RESULT
    invoke dwtoa,dword ptr [qwPROCTIMER_RESULT],addr szTemp
    print offset szTemp
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark_Larson

Quote from: savage on June 13, 2006, 10:22:20 PM
Quick question:

Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?

What's the best way to quickly test out code like this?

I actually have my own code I rolled to do timings.  However I use MichaelW's because a lot of other people do, and so therefore the timings should be similar.  For small snippets of code ( like under 100 cycles) there's little difference in timing code.  However there's greater fluctuation once you get into larger units of time.  The above code took 1 millisecond to run a 1Megabyte number.  In this case I mean 1Megabyte = 256KB dwords, because we handle a dword at a time with the ADC.

Quote
I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).

Ossa

Yep, much faster.  And INC/DEC modify flags just not the carry flag.  So that doesn't disturb the carry flag that gets set up by the ADC's. 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

asmfan

Ossa, your last posted code won't work correct on addition because we have a carry while adding from LSB (least significant bits) to MSB. Your code is for subtraction so if you change ADC to SBB it will work perfectly, correctly catching the CF.
Russia is a weird place

asmfan

Russia is a weird place