First, Hello to everyone. I just joined this forum.
I'd like some optimization advice from some of you experts. I have a routine in which the bottleneck is a memcpy(). The amount of data to be copied is smallish, usually a maximum of a couple hundred bytes. I have observed that 13% of the data lengths are for only 4 bytes, and it ocurred to me that it would be faster to just do a simple 32-bit copy in those cases, bypassing memcpy() entirely. The problem with this scheme is that the extra jump negates the savings of a simple copy vs. a 4-byte call to memcpy.
This is the original code, as compiled by Visual Studio 2003:
memcpy(alphaData + offset, aData, aLength);
01D39F7D mov eax,dword ptr [this]
01D39F80 mov edx,ecx
01D39F82 shr ecx,2
01D39F85 add edi,eax
01D39F87 rep movs dword ptr [edi],dword ptr [esi]
01D39F89 mov ecx,edx
01D39F8B and ecx,3
01D39F8E rep movs byte ptr [edi],byte ptr [esi]
And with my attempted improvement:
if (aLength == 4)
01C69E5B cmp esi,4
01C69E5E jne gfxImageFrame::SetAlphaData+0AAh (1C69E6Ah)
*((unsigned int *)(alphaData + offset)) = *((unsigned int *)aData);
01C69E60 mov ecx,dword ptr [ecx]
01C69E62 mov edx,dword ptr [this]
01C69E65 mov dword ptr [edi+edx],ecx
else
01C69E68 jmp gfxImageFrame::SetAlphaData+0C5h (1C69E85h)
memcpy(alphaData + offset, aData, aLength);
01C69E6A mov eax,dword ptr [this]
01C69E6D mov ecx,esi
01C69E6F mov esi,dword ptr [aData]
01C69E72 mov edx,ecx
01C69E74 shr ecx,2
01C69E77 add edi,eax
01C69E79 rep movs dword ptr [edi],dword ptr [esi]
01C69E7B mov ecx,edx
01C69E7D and ecx,3
01C69E80 rep movs byte ptr [edi],byte ptr [esi]
(Yes, I'm aware that the duplicate addition ("alphaData + offset") should be moved out to a common single place. I've done that since taking the above snippets of code, but that doesn't change the behavior I'm seeing.)
This code will only be run on Pentium3+ systems, so conditional moves are OK. In the 4-byte case I am replacing the 7 instructions of inline memcpy() code with a single move dword. That's good. However I am not reallizing any performance gains due to the extra compare-and-jump code. That's bad.
Any advice on how to get better performance from a known-length avoidance of memcpy()?
Thanks.
you might be loosing speed due to the memory alignment.
Hi swsnyder,
Welcome on board. The VC code is basically a REP MOVSD style of memory copy which sits in the middle range of speed for a short known memory copy. You have a number of choices, a manually coded copy using integer registers and incremented pointers handling DWORD then BYTE if the count is not on a 4 byte boundary. Then you have MMX or SSE if you don't mind being restricted to a later machine, something that is not that much of a problem these days.
Tell us what the actual byte count is for the copy and it may be simple enough to write one. One thing you will have to be careful with is how you use assembler code for such a short procedure in your VC code. The VC compiler will perform optimisations of its own and the register usage you choose may not match what the compiler uses which will increase your call overhead with register preservations.
I would tend to use a seperate module coded in MASM rather than inline assembler as it will be less inclined to mess up the internal compiler optimisation.
I remember to have read that the newest versions of msvc don't allow inline assembly anymore. With reason, I think.
Making a hand-coded version of the rep movs can certainly improve speed, because you are working with short run lengths (and rep movs is suitable for longer lengths):
; length in ecx
; source in esi
; destination in edi
mov edx, ecx
shr ecx, 2
_loop1:
sub ecx, 1
jc _loop2
mov eax, [esi + 4*ecx]
mov [edi + 4*ecx], eax
jmp _loop1
_loop2:
test edx, 3
jz _done
sub edx, 1
movzx eax, byte ptr [esi + edx]
mov [edi + edx], al
jmp _loop2
_done:
With long lengths and odd locations, you can handle the first few bytes separately to make the data 4-byte aligned, but it seems like you're working with short run lengths, so that won't help much. And because of those short lengths, much more this cannot be done.
If that doesn't solve you bottleneck, then you will have to optimize more than the memcpy itself.