SSE2 replacement of ALU registers for 384 bit unsigned integers

Mark_Larson · June 13, 2006, 05:05:53 PM

Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...

I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.

You can actually get this to work. I've done it on adding large integer numbers together. You only keep a 32-bit value in the 64-bit qword portion of the register. Then you do a PADDQ and get the carry in the high dword of the register. You left shift the register 32-bits to get the carry in the low position. You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.

An unoptimized example:

Code Select


   mov esi,[src]
   mov edi,[dst]
   pxor mm7,mm7   ;zero out for carry register
   mov ecx,[num_dwords]
looper:
  movd  mm0,[esi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  movd  mm1,[edi]       ;move 32-bits from memory.  This zero-extends the upper 32-bits to zero
  paddq mm0,mm7     ;add carry to source register value, any carry overflows into upper 32-bits
  paddq mm0,mm1    ;add src + dst, overflow into upper 32-bits if there is a carry
  movq  mm7,mm0     ;save results for next carry
  psrlw  mm7,32       ;shift it so the carry is in the lowest bit.
  movq [dst],mm0    ;save result in destination
  add   esi,4
  add   edi,4
  sub   ecx,1
  jnz    looper

stanhebben · June 13, 2006, 05:20:23 PM

Now some _correct_ code:

unrolled by 2:

Code Select


	mov ecx, -6
	lea esi, [nc + 384/8]
	lea edi, [nd + 384/8]
	clc
@@:
	mov eax, [esi + ecx*8]
	mov	ebx, [esi + ecx*8+4]
	adc [edi + ecx*8], eax
	adc [edi + ecx*8+4], ebx
	inc ecx
	jnz @B

unrolled by 4:

Code Select


	mov ecx, -6
	lea esi, [nc + 384/8]
	lea edi, [nd + 384/8]
	clc
@@:
	mov eax, [esi + ecx*8]
	mov	ebx, [esi + ecx*8+4]
	adc [edi + ecx*8], eax
	adc [edi + ecx*8+4], ebx
	
	mov eax, [esi + ecx*8+8]
	mov ebx, [esi + ecx*8+12]
	adc [edi + ecx*8+8], eax
	adc [edi + ecx*8+12], ebx
	
	inc ecx
	inc ecx
	jnz @B

and correct code goes with correct results:

Code Select


16 cycles, original ALU code
32 cycles, ossa's code
14 cycles, unrolled by 2
16 cycles, unrolled by 4
Press any key to exit...

It seems like the unroll by 2 goes faster than the unroll by 4.

[edit] and now that I notice, it seems to be even faster than completely unrolled code [/edit]

Stan

asmfan · June 13, 2006, 05:23:50 PM

I strongly recommend to use Ossa's last posted code as the only correct working ALU looped code here. Of course you can unroll it, but the idea! Adding 2 GB variables;) nice... Also it can be done for the subtraction also...

stanhebben · June 13, 2006, 05:25:59 PM

I recommend unrolling by 2, because it doubles your speed without bloating your code.

asmfan · June 13, 2006, 05:44:15 PM

I think it's the matter of cache line. Whether we should unroll by 2 or 4... or maybe 8:) And the fact that the more we unroll the less possible numbers we can use to add. If we unroll by 4 we won't be able to add undivisible by 16 byte in size variable, when unrolled version can handle divisible by 4...
"To unroll and not to unroll that is the question" (don't remember who said this;)

stanhebben · June 13, 2006, 05:58:24 PM

If you want to process undivisible numbers, you can handle end cases or make sure there is enough space to perform the last iteration.

At some point, unrolling code will no longer speed up your code. That's why I always set up benchmarks and see what performs best.

daydreamer · June 13, 2006, 06:05:57 PM

Quote from: Mark_Larson on June 13, 2006, 05:05:53 PM
Quote from: EduardoS on June 13, 2006, 02:07:20 AM
I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...

I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.

You can actually get this to work. I've done it on adding large integer numbers together. You only keep a 32-bit value in the 64-bit qword portion of the register. Then you do a PADDQ and get the carry in the high dword of the register. You left shift the register 32-bits to get the carry in the low position. You just add it every time in the loop. I am doing it through MMX but you can do it through SSE2 as well.

An unoptimized example:

Code Select Expand
mov esi,[src] mov edi,[dst] pxor mm7,mm7 ;zero out for carry register mov ecx,[num_dwords] looper: movd mm0,[esi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero movd mm1,[edi] ;move 32-bits from memory. This zero-extends the upper 32-bits to zero paddq mm0,mm7 ;add carry to source register value, any carry overflows into upper 32-bits paddq mm0,mm1 ;add src + dst, overflow into upper 32-bits if there is a carry movq mm7,mm0 ;save results for next carry psrlw mm7,32 ;shift it so the carry is in the lowest bit. movq [dst],mm0 ;save result in destination add esi,4 add edi,4 sub ecx,1 jnz looper

cant we combine ALU code mixed with MMX/SSE2 code for ultimate speed?, dont they execute in different pipes?

asmfan · June 13, 2006, 06:53:43 PM

Daydreamer, i think we cannot combine them because we obtain the carry flag sequentially adding parts of a number... however... who knows

Mark_Larson · June 13, 2006, 09:07:56 PM

daydreamer combining MMX and ALU would give a big speed up. But there is a dependency on the carry flag. It might still be possible to do, I'll think about it. You could get all the stuff set up except for the ADD or PADDQ, and that might work.

I also modified the original code as posted by dsouza123 to support an arbitrary size of big integers. His code only supports 384-bit ( 48 bytes). To get the ADC's to work from loop to loop I use LEA to modify the ESI and EDI registers. And I use DEC on the loop counter ( INC/DEC don't modify the Carry flag). I tested the code on a 1 megabyte sized number. On my system it takes 3357599 cycles to do that. It's a 3.2 Ghz Intel processor. That's 1.0492496875 milliseconds.

Code Select


original	proc	near
	mov		esi,[src_ptr]
	mov		edi,[dst_ptr]
   
	push            ebp
	mov		ebp,ALLOC_SIZE/16
	clc
looper:
	mov eax, [esi +  0]
	mov ebx, [esi +  4]
	mov ecx, [esi +  8]
	mov edx, [esi + 12]

	adc [edi +  0], eax
	adc [edi +  4], ebx
	adc [edi +  8], ecx
	adc [edi + 12], edx

	lea	esi,[esi+16]
	lea	edi,[edi+16]
   
	dec		ebp
	jnz		looper
	pop	ebp	

	ret
original	endp

Ossa · June 13, 2006, 09:57:01 PM

Mark,

small "correction" (it's a matter of preference):

Code Select

original proc near
	mov esi,[src_ptr]
	mov edi,[dst_ptr]
	
	add esi, ALLOC_SIZE - 16	; These can be absorbed into the preceding
	add edi, ALLOC_SIZE - 16	; lines if the pointer are offsets to statics
	   
	push ebp
	mov ebp, ALLOC_SIZE / 16
	clc
looper:
	mov edx, [esi + 12]
	mov ecx, [esi +  8]
	mov ebx, [esi +  4]
	mov eax, [esi +  0]
	
	adc [edi + 12], edx
	adc [edi +  8], ecx
	adc [edi +  4], ebx
	adc [edi +  0], eax
	
	lea esi,[esi - 16]
	lea edi,[edi - 16]
	   
	dec ebp
	jnz looper
	pop ebp
	
	ret
original endp

This gets the correct byte order (like for words, dwords, etc). Obviously you could store it the other way around, but this just conforms to the standard.

I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).

Ossa

savage · June 13, 2006, 10:22:20 PM

Quick question:

Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?

What's the best way to quickly test out code like this?

Mark Jones · June 14, 2006, 12:24:07 AM

Savage, on the fly? Not that I know of. Sounds like a great side project though, for someone with time to tackle it! Two ways to do it statically though: include MichaelW's timers.asm:

Code Select


include timers.asm
.code
counter_begin 1000000, HIGH_PRIORITY_CLASS
    ; instructions to time
counter_end
    print ustr$(eax)
    print " cycles",13,10

or use Petroizki's ptimers.inc and/or proctimers.inc, and do something like:

Code Select


include proctimers.inc
.code
    PTIMER_PROC_LOOP_COUNT MyFunction
; clocks returned as qword in qwPROCTIMER_RESULT
    invoke dwtoa,dword ptr [qwPROCTIMER_RESULT],addr szTemp
    print offset szTemp

Mark_Larson · June 14, 2006, 01:00:06 AM

Quote from: savage on June 13, 2006, 10:22:20 PM
Quick question:

Is there a good program that can assemble fragments of code and test their clock count on the fly?
Or just a program that calculates the predicted clock cycles while taking all cache factors into account?

What's the best way to quickly test out code like this?

I actually have my own code I rolled to do timings. However I use MichaelW's because a lot of other people do, and so therefore the timings should be similar. For small snippets of code ( like under 100 cycles) there's little difference in timing code. However there's greater fluctuation once you get into larger units of time. The above code took 1 millisecond to run a 1Megabyte number. In this case I mean 1Megabyte = 256KB dwords, because we handle a dword at a time with the ADC.

Quote
I forgot that lea doesn't modify flags... makes the solution much cleaner (and faster I assume?).

Ossa

Yep, much faster. And INC/DEC modify flags just not the carry flag. So that doesn't disturb the carry flag that gets set up by the ADC's.

asmfan · June 14, 2006, 08:19:27 AM

Ossa, your last posted code won't work correct on addition because we have a carry while adding from LSB (least significant bits) to MSB. Your code is for subtraction so if you change ADC to SBB it will work perfectly, correctly catching the CF.

asmfan · June 14, 2006, 08:24:44 AM

Of course it's the matter of little endian:)

News:

SSE2 replacement of ALU registers for 384 bit unsigned integers

stanhebben

stanhebben

stanhebben

savage