Arbitrary-size integer addition

AeroASM · August 08, 2008, 06:02:02 PM

I realise that this is probably a pretty standard piece of code, and for this reason I'm not expecting a huge discussion of my code (although that would be nice), but I would be really grateful if one or two of the experienced optimizers out there could spare the time to take a look and tell me if its ok.

Code Select


badd proc in1:dword,in2:dword,output:dword,bytes:dword

;in1, in2, output are pointers to arrays of size 'bytes', which must be a multiple of 16
;in1 and in2 are the numbers to be added together and stored in output

push ebx
push esi
push edi

mov ebx,in1
mov esi,in2
mov edi,output
mov ecx,bytes
shr ecx,4

loopstart:

;this is mov eax,[ebx] / adc eax,[esi] / mov [edi],eax unrolled four times and reordered.

mov eax,dword ptr [ebx]
adc eax,dword ptr [esi]
mov edx,dword ptr [ebx+4]
mov [edi],eax
adc edx,dword ptr [esi+4]
mov [edi+4],edx
mov eax,dword ptr [ebx+8]
adc eax,dword ptr [esi+8]
mov edx,dword ptr [ebx+12]
mov [edi+8],eax
adc edx,dword ptr [esi+12]
mov [edi+12],edx

lea ebx,[ebx+16] ;lea instead of add to preserve carry flag
lea esi,[esi+16]
lea edi,[edi+16]
lea ecx,[ecx-1]
jecxz @F ;this preserves the carry flag, unlike sub ecx,1/jnz loopstart
jmp loopstart
@@:

pop edi
pop esi
pop ebx
rcl eax,1 ;grab the carry flag to return in eax
and eax,1
ret

badd endp

My timing for adding two 1024-byte numbers together is around 550-600 clock cycles, which is around 9 cycles per iteration. The loop has 18 opcodes, so I guess this demonstrates the dual pipeline nicely. (I have an Intel Core 2 Duo T5250 1.5GHz.)

As for improvements, I thought about prefetching, then decided that memory access was probably not a bottleneck here. I also thought about MMX/SSE (specifically using paddq), but I would have to work out the carries separately, which I reckon would cancel out any speedup. But I am new to optimisation so I could well be wrong.

Neo · August 09, 2008, 06:36:30 AM

That's a really interesting idea of how to save the carry. A few points:
1) The "dword ptr " shouldn't be necessary. Do you have any outstanding "assume" lines?
2) I'd put a comment noting that because the number of bytes must be a multiple of 16, the carry flag will be zero after the "shr ecx,4". If the carry flag might be 1 there, it would cause a one-off error.
3) One thing I'd do for (hopefully) improving primary cache line coherency is:

Code Select

mov eax,[ebx]
mov edx,[ebx+4]
adc eax,[esi]
adc edx,[esi+4]
mov [edi],eax
mov eax,[ebx+8]
mov [edi+4],edx
mov edx,[ebx+12]
adc eax,[esi+8]
adc edx,[esi+12]
mov [edi+8],eax
mov [edi+12],edx

It's just switching between the arrays half as much, so it shouldn't affect any other caches. With another 2 free registers, you could get all four loads from [ebx], then all four from [esi], then all stores to [edi], but then you'd probably need to go to 64-bit mode, in which case using the 64-bit registers would make more sense anyway.

Hope that helps :U

AeroASM · August 09, 2008, 09:39:42 AM

OK thanks a lot for the reply.
1) I realised this as I posted; I'm not really sure why I put them in.
2) In previous versions I had had an xor eax,eax, knowing that it would reset CF, then I took it out and I did realise that shr ecx,4 would reset it, but I had certainly forgotten by the time I posted. Thanks for pointing that out before it becomes a difficult bug!
3) I tried your ordering and it slowed down by about 20/30 cycles (for 1024 bytes). I think that its because the adc's depend on each other, so in my ordering i tried to separate them as much as possible. It doesn't help that the arrays I'm using in my timing program aren't aligned (I can't figure out how to align stuff in C).

If I free up ebp and esp, I do have another 2 free registers. I tried using these two, but I couldn't find an ordering faster than what I had before, in which case using ebp and esp is unnecessary.

Mark_Larson · August 09, 2008, 12:58:47 PM

Quote from: AeroASM on August 08, 2008, 06:02:02 PM
As for improvements, I thought about prefetching, then decided that memory access was probably not a bottleneck here. I also thought about MMX/SSE (specifically using paddq), but I would have to work out the carries separately, which I reckon would cancel out any speedup. But I am new to optimisation so I could well be wrong.

I have an optimization website. It might help you out. It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/

Mark_Larson · August 09, 2008, 03:14:56 PM

I wrote an SSE2 version. I need to make sure it works correctly. I am getting 278 - 281 cycles.

Mark

RuiLoureiro · August 09, 2008, 06:35:10 PM

Quote from: Mark_Larson on August 09, 2008, 12:58:47 PM
I have an optimization website. It might help you out. It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/

Hi Mark Larson,
Its just to say thank you for your work !
Rui

Mark_Larson · August 09, 2008, 08:16:42 PM

Quote from: RuiLoureiro on August 09, 2008, 06:35:10 PM
Quote from: Mark_Larson on August 09, 2008, 12:58:47 PM
I have an optimization website. It might help you out. It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/

Hi Mark Larson,
Its just to say thank you for your work !
Rui

Thanks! I am glad you find it useful :bg

Mark_Larson · August 10, 2008, 01:53:37 PM

I cut and pasted the code inline with the main procedure. You should make it a macro. It will shave a few cycles off the time. I forgot to save the carry bit in EAX in my original timings. My new timings with that fix are 277-281 cycles on my core 2 duo. It is optimized specifically for the core 2 duo, since you mentioned you had one. If you have any questions feel free to ask. I used different buffer names than you. So I also included my data section so you can see what variable names I used. Good luck :bg

Code Select

.data

ARR_SIZE	equ		1024
align 16
	src1	dd		ARR_SIZE dup (0)
	src2	dd		ARR_SIZE dup (0)
	dest	dd		ARR_SIZE dup (0)
.code
LOOP_COUNT		equ			1000000
.686p
option casemap:none
.xmm

mov		eax,0ffffffffh
mov		[src1],eax
mov		[src2],eax
counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
;	push edi
;ARR_SIZE needs to be a multiple of 32.
	lea eax,src1
	lea edx,src2
	lea edi,dest
	mov ecx,ARR_SIZE SHR 5
	
  	pxor	xmm7,xmm7		;carry flag
my_loop:
	movdqa	xmm0,[eax]		;in1
	movdqa	xmm1,[edx]		;in2
	paddq	xmm0,xmm1
	paddq	xmm0,xmm7
	movdqa	[edi],xmm0
	psrldq	xmm0,32		;shift right 32 bits to put the carry flag in as the lowest byte
	movdqa	xmm7,[eax+16]		;in1
	movdqa	xmm1,[edx+16]		;in2
add	eax,32
add	edx,32
	paddq	xmm7,xmm1
	paddq	xmm7,xmm0
	movdqa	[edi+16],xmm7
sub		ecx,1
lea	edi,[edi+32]
	psrldq	xmm7,32		;shift right 32 bits to put the carry flag in as the lowest byte
	jnz		my_loop
	movd	eax,xmm7		;return the carry flag in EAX
;	pop	edi
counter_end

AeroASM · August 10, 2008, 03:00:12 PM

Thank you for taking the time to write this code. However, I have been having a few problems with it.

When I was writing my routine I first wrote a C function that did the same, to use as a correctness check. My C function agrees with my asm routine, but disagrees with your routine. I'm not saying that your code is wrong, but there is definitely a discrepancy.

I don't really understand your method for dealing with the carries, in particular the line psrldq xmm0,32. Firstly the Intel manual says that the immediate operand for psrldq is the number of bytes to be shifted, but your comment says that this is supposed to shift right 32 bits. I tried replacing it with psrldq xmm0,4 but the discrepancy got worse.

Secondly, I am at a loss as to where the carry flag is coming from. psrldq shifts in zeroes, and no indication of overflow can come from the original contents of xmm0.

Mark_Larson · August 10, 2008, 07:42:29 PM

I had orginally done a different shift that didn't do the whole register. That's the issue. I'[ll fix it

Mark_Larson · August 10, 2008, 08:10:32 PM

I changed the shifts from psrldq to psrlq, and now it runs faster. 235-241 cycles

Code Select


.data

ARR_SIZE	equ		1024
align 16
	src1	dd		ARR_SIZE dup (0)
	src2	dd		ARR_SIZE dup (0)
	dest	dd		ARR_SIZE dup (0)
.code
LOOP_COUNT		equ			1000000
.686p
option casemap:none
.xmm

mov		eax,0ffffffffh
mov		[src1],eax
mov		[src2],eax
counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
;	push edi
;ARR_SIZE needs to be a multiple of 32.
	lea eax,src1
	lea edx,src2
	lea edi,dest
	mov ecx,ARR_SIZE SHR 5

  	pxor	xmm7,xmm7		;carry flag
my_loop:
	movdqa	xmm0,[eax]		;in1
	movdqa	xmm1,[edx]		;in2
	paddq	xmm0,xmm1
	paddq	xmm0,xmm7
	movdqa	[edi],xmm0
	psrlq	xmm0,32		;shift right 32 bits to put the carry flag in as the lowest byte
	
	movdqa	xmm7,[eax+16]		;in1
	movdqa	xmm1,[edx+16]		;in2
add	eax,32
add	edx,32
	paddq	xmm7,xmm1
	paddq	xmm7,xmm0
	movdqa	[edi+16],xmm7
sub		ecx,1
lea	edi,[edi+32]
	psrlq	xmm7,32		;shift right 32 bits to put the carry flag in as the lowest byte

	jnz		my_loop
	
	movd	eax,xmm7		;return the carry flag in EAX

;	pop	edi
counter_end

News:

Arbitrary-size integer addition

AeroASM

AeroASM

AeroASM