News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Arbitrary-size integer addition

Started by AeroASM, August 08, 2008, 06:02:02 PM

Previous topic - Next topic

AeroASM

I realise that this is probably a pretty standard piece of code, and for this reason I'm not expecting a huge discussion of my code (although that would be nice), but I would be really grateful if one or two of the experienced optimizers out there could spare the time to take a look and tell me if its ok.


badd proc in1:dword,in2:dword,output:dword,bytes:dword

;in1, in2, output are pointers to arrays of size 'bytes', which must be a multiple of 16
;in1 and in2 are the numbers to be added together and stored in output

push ebx
push esi
push edi

mov ebx,in1
mov esi,in2
mov edi,output
mov ecx,bytes
shr ecx,4

loopstart:

;this is mov eax,[ebx] / adc eax,[esi] / mov [edi],eax unrolled four times and reordered.

mov eax,dword ptr [ebx]
adc eax,dword ptr [esi]
mov edx,dword ptr [ebx+4]
mov [edi],eax
adc edx,dword ptr [esi+4]
mov [edi+4],edx
mov eax,dword ptr [ebx+8]
adc eax,dword ptr [esi+8]
mov edx,dword ptr [ebx+12]
mov [edi+8],eax
adc edx,dword ptr [esi+12]
mov [edi+12],edx

lea ebx,[ebx+16] ;lea instead of add to preserve carry flag
lea esi,[esi+16]
lea edi,[edi+16]
lea ecx,[ecx-1]
jecxz @F ;this preserves the carry flag, unlike sub ecx,1/jnz loopstart
jmp loopstart
@@:

pop edi
pop esi
pop ebx
rcl eax,1 ;grab the carry flag to return in eax
and eax,1
ret

badd endp


My timing for adding two 1024-byte numbers together is around 550-600 clock cycles, which is around 9 cycles per iteration. The loop has 18 opcodes, so I guess this demonstrates the dual pipeline nicely. (I have an Intel Core 2 Duo T5250 1.5GHz.)

As for improvements, I thought about prefetching, then decided that memory access was probably not a bottleneck here. I also thought about MMX/SSE (specifically using paddq), but I would have to work out the carries separately, which I reckon would cancel out any speedup. But I am new to optimisation so I could well be wrong.

Neo

That's a really interesting idea of how to save the carry.  A few points:
1) The "dword ptr " shouldn't be necessary.  Do you have any outstanding "assume" lines?
2) I'd put a comment noting that because the number of bytes must be a multiple of 16, the carry flag will be zero after the "shr ecx,4".  If the carry flag might be 1 there, it would cause a one-off error.
3) One thing I'd do for (hopefully) improving primary cache line coherency is:
mov eax,[ebx]
mov edx,[ebx+4]
adc eax,[esi]
adc edx,[esi+4]
mov [edi],eax
mov eax,[ebx+8]
mov [edi+4],edx
mov edx,[ebx+12]
adc eax,[esi+8]
adc edx,[esi+12]
mov [edi+8],eax
mov [edi+12],edx

It's  just switching between the arrays half as much, so it shouldn't affect any other caches.  With another 2 free registers, you could get all four loads from [ebx], then all four from [esi], then all stores to [edi], but then you'd probably need to go to 64-bit mode, in which case using the 64-bit registers would make more sense anyway.

Hope that helps  :U

AeroASM

OK thanks a lot for the reply.
1) I realised this as I posted; I'm not really sure why I put them in.
2) In previous versions I had had an xor eax,eax, knowing that it would reset CF, then I took it out and I did realise that shr ecx,4 would reset it, but I had certainly forgotten by the time I posted. Thanks for pointing that out before it becomes a difficult bug!
3) I tried your ordering and it slowed down by about 20/30 cycles (for 1024 bytes). I think that its because the adc's depend on each other,  so in my ordering i tried to separate them as much as possible. It doesn't help that the arrays I'm using in my timing program aren't aligned (I can't figure out how to align stuff in C).

If I free up ebp and esp, I do have another 2 free registers. I tried using these two, but I couldn't find an ordering faster than what I had before, in which case using ebp and esp is unnecessary.

Mark_Larson

Quote from: AeroASM on August 08, 2008, 06:02:02 PM
As for improvements, I thought about prefetching, then decided that memory access was probably not a bottleneck here. I also thought about MMX/SSE (specifically using paddq), but I would have to work out the carries separately, which I reckon would cancel out any speedup. But I am new to optimisation so I could well be wrong.

  I have an optimization website.  It might help you out.  It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

I wrote an SSE2 version.  I need to make sure it works correctly.  I am getting 278 - 281 cycles.

Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

RuiLoureiro

Quote from: Mark_Larson on August 09, 2008, 12:58:47 PM
  I have an optimization website.  It might help you out.  It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/

Hi Mark Larson,
                     Its just to say thank you for your work !
Rui

Mark_Larson

Quote from: RuiLoureiro on August 09, 2008, 06:35:10 PM
Quote from: Mark_Larson on August 09, 2008, 12:58:47 PM
  I have an optimization website.  It might help you out.  It has 60 tricks broken up into "beginner", "intermediate", and "advanced".

http://www.mark.masmcode.com/

Hi Mark Larson,
                     Its just to say thank you for your work !
Rui

  Thanks!  I am glad you find it useful  :bg
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  I cut and pasted the code inline with the main procedure.  You should make it a macro.  It will shave a few cycles off the time.  I forgot to save the carry bit in EAX in my original timings.  My new timings with that fix are 277-281 cycles on my core 2 duo.  It is optimized specifically for the core 2 duo, since you mentioned you had one.  If you have any questions feel free to ask.  I used different buffer names than you.  So I also included my data section so you can see what variable names I used.  Good luck  :bg

.data

ARR_SIZE equ 1024
align 16
src1 dd ARR_SIZE dup (0)
src2 dd ARR_SIZE dup (0)
dest dd ARR_SIZE dup (0)
.code
LOOP_COUNT equ 1000000
.686p
option casemap:none
.xmm

mov eax,0ffffffffh
mov [src1],eax
mov [src2],eax
counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
; push edi
;ARR_SIZE needs to be a multiple of 32.
lea eax,src1
lea edx,src2
lea edi,dest
mov ecx,ARR_SIZE SHR 5

  pxor xmm7,xmm7 ;carry flag
my_loop:
movdqa xmm0,[eax] ;in1
movdqa xmm1,[edx] ;in2
paddq xmm0,xmm1
paddq xmm0,xmm7
movdqa [edi],xmm0
psrldq xmm0,32 ;shift right 32 bits to put the carry flag in as the lowest byte
movdqa xmm7,[eax+16] ;in1
movdqa xmm1,[edx+16] ;in2
add eax,32
add edx,32
paddq xmm7,xmm1
paddq xmm7,xmm0
movdqa [edi+16],xmm7
sub ecx,1
lea edi,[edi+32]
psrldq xmm7,32 ;shift right 32 bits to put the carry flag in as the lowest byte
jnz my_loop
movd eax,xmm7 ;return the carry flag in EAX
; pop edi
counter_end
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

AeroASM

Thank you for taking the time to write this code. However, I have been having a few problems with it.

When I was writing my routine I first wrote a C function that did the same, to use as a correctness check. My C function agrees with my asm routine, but disagrees with your routine. I'm not saying that your code is wrong, but there is definitely a discrepancy.

I don't really understand your method for dealing with the carries, in particular the line psrldq xmm0,32. Firstly the Intel manual says that the immediate operand for psrldq is the number of bytes to be shifted, but your comment says that this is supposed to shift right 32 bits. I tried replacing it with psrldq xmm0,4 but the discrepancy got worse.

Secondly, I am at a loss as to where the carry flag is coming from. psrldq shifts in zeroes, and no indication of overflow can come from the original contents of xmm0.

Mark_Larson


  I had orginally done a different shift that didn't do the whole register.  That's the issue.  I'[ll fix it
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

 I changed the shifts from psrldq to psrlq, and now it runs faster.  235-241 cycles


.data

ARR_SIZE equ 1024
align 16
src1 dd ARR_SIZE dup (0)
src2 dd ARR_SIZE dup (0)
dest dd ARR_SIZE dup (0)
.code
LOOP_COUNT equ 1000000
.686p
option casemap:none
.xmm

mov eax,0ffffffffh
mov [src1],eax
mov [src2],eax
counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
; push edi
;ARR_SIZE needs to be a multiple of 32.
lea eax,src1
lea edx,src2
lea edi,dest
mov ecx,ARR_SIZE SHR 5

  pxor xmm7,xmm7 ;carry flag
my_loop:
movdqa xmm0,[eax] ;in1
movdqa xmm1,[edx] ;in2
paddq xmm0,xmm1
paddq xmm0,xmm7
movdqa [edi],xmm0
psrlq xmm0,32 ;shift right 32 bits to put the carry flag in as the lowest byte

movdqa xmm7,[eax+16] ;in1
movdqa xmm1,[edx+16] ;in2
add eax,32
add edx,32
paddq xmm7,xmm1
paddq xmm7,xmm0
movdqa [edi+16],xmm7
sub ecx,1
lea edi,[edi+32]
psrlq xmm7,32 ;shift right 32 bits to put the carry flag in as the lowest byte

jnz my_loop

movd eax,xmm7 ;return the carry flag in EAX

; pop edi
counter_end
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm