Unrolling unsuccessful

jj2007 · June 26, 2008, 03:55:53 PM

Unrolling unsuccessful for the following snippet (a mem2mem stringcopy) - any explanation?

Fast, 480 cycles

align 4
@@:
  mov ebx, [eax]			; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic]	; subtract 1 from each byte
  test ecx, 80808080h		; same as and ecx, 80808080h
  jnz @F					; jump if zero byte
  cmp edi, edx				; safe copy...
  jle mcDone
  mov [edx], ebx			; store a dword
  add eax, 4
  add edx, 4				; inc destination
  jmp @B

Slow, 560 cycles

Code Select

align 4
@@:
  mov ebx, [eax]			; get next 4 bytes
  mov ecx, ebx
  sub ecx, 01010101h
  ; lea ecx,[ebx-01010101h]	; subtract 1 from each byte
  test ecx, 80808080h		; same as and ecx, 80808080h
  jnz @F					; jump if zero byte
  cmp edi, edx				; safe copy...
  jle mcDone
  mov [edx], ebx			; store a dword

  mov ebx, [eax+4]		; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic]	; subtract 1 from each byte
  test ecx, 80808080h		; same as and ecx, 80808080h
  jnz @F					; jump if zero byte
  ; cmp edi, edx				; dropped: safe copy test No. 2
  ; jle mcDone
  mov [edx+4], ebx			; store a dword

  add eax, 8
  add edx, 8				; inc destination
  jmp @B

NightWare · June 26, 2008, 10:45:28 PM

hi, from my xp :
the size of the Code has a small impact with the cache
there is 4 possible "align 4", some are slower in speedtest (so you should test both with align 16)
the "2nd" loop is possibly not well aligned
a mem access using [r+i] is a bit slower than [r] (an extra add is produced in hardware)

now, i understand why you have removed a useless add eax,4/add edx,4 , but why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?

jj2007 · June 26, 2008, 11:37:36 PM

Quote from: NightWare on June 26, 2008, 10:45:28 PM
why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?

Yes, that was the idea - a general purpose buffer with an extra dword. But the code slowed down instead of speeding up...

lingo · June 26, 2008, 11:56:39 PM

You can try this: :lol

Code Select


	pxor	MM1, MM1		
	movq	MM0, qword ptr [ebx]
	movq	MM2, qword ptr [ebx+8]
@@:
	movq	[eax+edx], MM0
	movq	[eax+edx+8], MM2
	add	edx, 16
	pcmpeqb	MM0, MM1
	pcmpeqb	MM2, MM1			
	packsswb	MM0, MM0			
	packsswb	MM2, MM2			
	movd	ecx, MM0			
	movd	edi, MM2
	movq	MM0, qword ptr [ebx+edx]
	movq	MM2, qword ptr [ebx+edx+8]	
	test	ecx, ecx			
	jne	@1a3				
	test	edi, edi			
	je	@b				
	bsf	edi, edi			
	shr	edi, 2				
	lea	eax, [edx+edi-8]			; eax->strlen without  "0"
	jmp	@1a4					
@1a3:			
	bsf	ecx, ecx
	shr	ecx, 2
	lea	eax, [edx+ecx-16]			; eax->strlen without  "0"
@1a4:

jj2007 · June 26, 2008, 11:59:33 PM

Hi Lingo, good to see you are still visiting this friendly place!
The usual question: Will it work on arbitrarily non-aligned code? Right now I am too tired to test it immediately, sorry :bg

hutch-- · June 27, 2008, 12:15:41 AM

Usually memory copy algos run into the brick wall of how many memory reads and writes they have to do and this is generally the limiting factor rather that instruction choice or coding efficiency. Where you can do it, reading and writing in larger data sizes improves the speed by reducing the number of memory reads and writes but you then run into problems of the alignment of the data to be read or written.

If it is reasonable amounts of data, doing a short byte copy to align the start of the larger reads and write works but for short data its of no advantage. To complicte matters, it depends on whether the data is in close cache or not, if it is then the speed will be reasonable, irf its not you will take a big speed hit. You can solve this problem with SSE with temporal reads and non temporal writes.

jj2007 · June 27, 2008, 08:05:42 AM

Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?

Mark_Larson · June 27, 2008, 09:00:52 PM

Quote from: jj2007 on June 27, 2008, 08:05:42 AM
Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?

the lowest bit is set if it is odd, and it is 0 if it is even, so you can use the "TEST" instruction. TEST let's you test one bit of a register, which is useful in ths case. It sets the ZERO bit if the bit is NOT set. It clears the zero bit if the bit is set. Thus if the JNZ after the TEST tells us that no zero bit was set, so it is odd.

Code Select


;esi holds source address
   test   esi, 01h
   jnz       is_odd
;handle even

jj2007 · June 27, 2008, 09:16:18 PM

Quote from: Mark_Larson on June 27, 2008, 09:00:52 PM
JJ: How would you do that [align the start of the larger reads and write] if your source is on an odd, your target on an even address?

the lowest bit is set if it is odd

Quote

Mark, thanxalot for the hint. It seems I have to deliver a concrete example:
- target is at 40200h (we used .data and aligned it to a dword)
- source comes from somewhere out of the blue and is at 42203h
So how could we achieve a dword-aligned memcopy?

News:

Unrolling unsuccessful