Started by swsnyder, August 28, 2006, 02:08:24 PM

First, Hello to everyone.  I just joined this forum.

I'd like some optimization advice from some of you experts.  I have a routine in which the bottleneck is a memcpy().  The amount of data to be copied is smallish, usually a maximum of a couple hundred bytes.  I have observed that 13% of the data lengths are for only 4 bytes, and it ocurred to me that it would be faster to just do a simple 32-bit copy in those cases, bypassing memcpy() entirely.  The problem with this scheme is that the extra jump negates the savings of a simple copy vs. a 4-byte call to memcpy.

This is the original code, as compiled by Visual Studio 2003:

    memcpy(alphaData + offset, aData, aLength);
01D39F7D  mov           eax,dword ptr [this]
01D39F80  mov           edx,ecx
01D39F82  shr             ecx,2
01D39F85  add            edi,eax
01D39F87  rep movs  dword ptr [edi],dword ptr [esi]
01D39F89  mov           ecx,edx
01D39F8B  and           ecx,3
01D39F8E  rep movs  byte ptr [edi],byte ptr [esi]

And with my attempted improvement:

     if (aLength == 4)
01C69E5B  cmp           esi,4
01C69E5E  jne             gfxImageFrame::SetAlphaData+0AAh (1C69E6Ah)
        *((unsigned int *)(alphaData + offset)) = *((unsigned int *)aData);
01C69E60  mov           ecx,dword ptr [ecx]
01C69E62  mov           edx,dword ptr [this]
01C69E65  mov           dword ptr [edi+edx],ecx
01C69E68  jmp           gfxImageFrame::SetAlphaData+0C5h (1C69E85h)
        memcpy(alphaData + offset, aData, aLength);
01C69E6A  mov          eax,dword ptr [this]
01C69E6D  mov          ecx,esi
01C69E6F  mov          esi,dword ptr [aData]
01C69E72  mov          edx,ecx
01C69E74  shr            ecx,2
01C69E77  add          edi,eax
01C69E79  rep movs dword ptr [edi],dword ptr [esi]
01C69E7B  mov          ecx,edx
01C69E7D  and           ecx,3
01C69E80  rep movs  byte ptr [edi],byte ptr [esi]

(Yes, I'm aware that the duplicate addition ("alphaData + offset") should be moved out to a common single place.  I've done that since taking the above snippets of code, but that doesn't change the behavior I'm seeing.)

This code will only be run on Pentium3+ systems, so conditional moves are OK.  In the 4-byte case I am replacing the 7 instructions of inline memcpy() code with a single move dword.  That's good.  However I am not reallizing any performance gains due to the extra compare-and-jump code.  That's bad.

Any advice on how to get better performance from a known-length avoidance of memcpy()?



you might be loosing speed due to the memory alignment.


Hi swsnyder,

Welcome on board. The VC code is basically a REP MOVSD style of memory copy which sits in the middle range of speed for a short known memory copy. You have a number of choices, a manually coded copy using integer registers and incremented pointers handling DWORD then BYTE if the count is not on a 4 byte boundary. Then you have MMX or SSE if you don't mind being restricted to a later machine, something that is not that much of a problem these days.

Tell us what the actual byte count is for the copy and it may be simple enough to write one. One thing you will have to be careful with is how you use assembler code for such a short procedure in your VC code. The VC compiler will perform optimisations of its own and the register usage you choose may not match what the compiler uses which will increase your call overhead with register preservations.

I would tend to use a seperate module coded in MASM rather than inline assembler as it will be less inclined to mess up the internal compiler optimisation.
I remember to have read that the newest versions of msvc don't allow inline assembly anymore. With reason, I think.

Making a hand-coded version of the rep movs can certainly improve speed, because you are working with short run lengths (and rep movs is suitable for longer lengths):

  ; length in ecx
  ; source in esi
  ; destination in edi
  mov edx, ecx
  shr ecx, 2
  sub ecx, 1
  jc _loop2
  mov eax, [esi + 4*ecx]
  mov [edi + 4*ecx], eax
  jmp _loop1
  test edx, 3
  jz _done
  sub edx, 1
  movzx eax, byte ptr [esi + edx]
  mov [edi + edx], al
  jmp _loop2

With long lengths and odd locations, you can handle the first few bytes separately to make the data 4-byte aligned, but it seems like you're working with short run lengths, so that won't help much. And because of those short lengths, much more this cannot be done.

If that doesn't solve you bottleneck, then you will have to optimize more than the memcpy itself.