The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on June 26, 2008, 03:55:53 PM

Title: Unrolling unsuccessful
Post by: jj2007 on June 26, 2008, 03:55:53 PM
Unrolling unsuccessful for the following snippet (a mem2mem stringcopy) - any explanation?

Fast, 480 cycles
align 4
@@:
  mov ebx, [eax] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  cmp edi, edx ; safe copy...
  jle mcDone
  mov [edx], ebx ; store a dword
  add eax, 4
  add edx, 4 ; inc destination
  jmp @B


Slow, 560 cycles
align 4
@@:
  mov ebx, [eax] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, 01010101h
  ; lea ecx,[ebx-01010101h] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  cmp edi, edx ; safe copy...
  jle mcDone
  mov [edx], ebx ; store a dword

  mov ebx, [eax+4] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  ; cmp edi, edx ; dropped: safe copy test No. 2
  ; jle mcDone
  mov [edx+4], ebx ; store a dword

  add eax, 8
  add edx, 8 ; inc destination
  jmp @B

Title: Re: Unrolling unsuccessful
Post by: NightWare on June 26, 2008, 10:45:28 PM
hi, from my xp :
the size of the Code has a small impact with the cache
there is 4 possible "align 4", some are slower in speedtest (so you should test both with align 16)
the "2nd" loop is possibly not well aligned
a mem access using [r+i] is a bit slower than [r] (an extra add is produced in hardware)

now, i understand why you have removed a useless add eax,4/add edx,4 , but why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?

Title: Re: Unrolling unsuccessful
Post by: jj2007 on June 26, 2008, 11:37:36 PM
Quote from: NightWare on June 26, 2008, 10:45:28 PM
why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?

Yes, that was the idea - a general purpose buffer with an extra dword. But the code slowed down instead of speeding up...
Title: Re: Unrolling unsuccessful
Post by: lingo on June 26, 2008, 11:56:39 PM
You can try this: :lol
pxor MM1, MM1
movq MM0, qword ptr [ebx]
movq MM2, qword ptr [ebx+8]
@@:
movq [eax+edx], MM0
movq [eax+edx+8], MM2
add edx, 16
pcmpeqb MM0, MM1
pcmpeqb MM2, MM1
packsswb MM0, MM0
packsswb MM2, MM2
movd ecx, MM0
movd edi, MM2
movq MM0, qword ptr [ebx+edx]
movq MM2, qword ptr [ebx+edx+8]
test ecx, ecx
jne @1a3
test edi, edi
je @b
bsf edi, edi
shr edi, 2
lea eax, [edx+edi-8] ; eax->strlen without  "0"
jmp @1a4
@1a3:
bsf ecx, ecx
shr ecx, 2
lea eax, [edx+ecx-16] ; eax->strlen without  "0"
@1a4:


Title: Re: Unrolling unsuccessful
Post by: jj2007 on June 26, 2008, 11:59:33 PM
Hi Lingo, good to see you are still visiting this friendly place!
The usual question: Will it work on arbitrarily non-aligned code? Right now I am too tired to test it immediately, sorry  :bg
Title: Re: Unrolling unsuccessful
Post by: hutch-- on June 27, 2008, 12:15:41 AM
Usually memory copy algos run into the brick wall of how many memory reads and writes they have to do and this is generally the limiting factor rather that instruction choice or coding efficiency. Where you can do it, reading and writing in larger data sizes improves the speed by reducing the number of memory reads and writes but you then run into problems of the alignment of the data to be read or written.

If it is reasonable amounts of data, doing a short byte copy to align the start of the larger reads and write works but for short data its of no advantage. To complicte matters, it depends on whether the data is in close cache or not, if it is then the speed will be reasonable, irf its not you will take a big speed hit. You can solve this problem with SSE with temporal reads and non temporal writes.
Title: Re: Unrolling unsuccessful
Post by: jj2007 on June 27, 2008, 08:05:42 AM
Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?
Title: Re: Unrolling unsuccessful
Post by: Mark_Larson on June 27, 2008, 09:00:52 PM
Quote from: jj2007 on June 27, 2008, 08:05:42 AM
Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?

the lowest bit is set if it is odd, and it is 0 if it is even, so you can use the "TEST" instruction.  TEST let's you test one bit of a register, which is useful in ths case.  It sets the ZERO bit if the bit is NOT set.  It clears the zero bit if the bit is set.  Thus if the JNZ after the TEST tells us that no zero bit was set, so it is odd.


;esi holds source address
   test   esi, 01h
   jnz       is_odd
;handle even
Title: Re: Unrolling unsuccessful
Post by: jj2007 on June 27, 2008, 09:16:18 PM
Quote from: Mark_Larson on June 27, 2008, 09:00:52 PM
JJ: How would you do that  [align the start of the larger reads and write] if your source is on an odd, your target on an even address?
the lowest bit is set if it is odd
Quote

Mark, thanxalot for the hint. It seems I have to deliver a concrete example:
- target is at 40200h (we used .data and aligned it to a dword)
- source comes from somewhere out of the blue and is at 42203h
So how could we achieve a dword-aligned memcopy?