News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Unrolling unsuccessful

Started by jj2007, June 26, 2008, 03:55:53 PM

Previous topic - Next topic

jj2007

Unrolling unsuccessful for the following snippet (a mem2mem stringcopy) - any explanation?

Fast, 480 cycles
align 4
@@:
  mov ebx, [eax] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  cmp edi, edx ; safe copy...
  jle mcDone
  mov [edx], ebx ; store a dword
  add eax, 4
  add edx, 4 ; inc destination
  jmp @B


Slow, 560 cycles
align 4
@@:
  mov ebx, [eax] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, 01010101h
  ; lea ecx,[ebx-01010101h] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  cmp edi, edx ; safe copy...
  jle mcDone
  mov [edx], ebx ; store a dword

  mov ebx, [eax+4] ; get next 4 bytes
  mov ecx, ebx
  sub ecx, mcMagic
  ; lea ecx,[ebx-mcMagic] ; subtract 1 from each byte
  test ecx, 80808080h ; same as and ecx, 80808080h
  jnz @F ; jump if zero byte
  ; cmp edi, edx ; dropped: safe copy test No. 2
  ; jle mcDone
  mov [edx+4], ebx ; store a dword

  add eax, 8
  add edx, 8 ; inc destination
  jmp @B


NightWare

hi, from my xp :
the size of the Code has a small impact with the cache
there is 4 possible "align 4", some are slower in speedtest (so you should test both with align 16)
the "2nd" loop is possibly not well aligned
a mem access using [r+i] is a bit slower than [r] (an extra add is produced in hardware)

now, i understand why you have removed a useless add eax,4/add edx,4 , but why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?


jj2007

Quote from: NightWare on June 26, 2008, 10:45:28 PM
why removing cmp edi,edx ? (coz i suppose edi contain startaddress+size), or maybe you've added an extra dword for the mem space ?

Yes, that was the idea - a general purpose buffer with an extra dword. But the code slowed down instead of speeding up...

lingo

You can try this: :lol
pxor MM1, MM1
movq MM0, qword ptr [ebx]
movq MM2, qword ptr [ebx+8]
@@:
movq [eax+edx], MM0
movq [eax+edx+8], MM2
add edx, 16
pcmpeqb MM0, MM1
pcmpeqb MM2, MM1
packsswb MM0, MM0
packsswb MM2, MM2
movd ecx, MM0
movd edi, MM2
movq MM0, qword ptr [ebx+edx]
movq MM2, qword ptr [ebx+edx+8]
test ecx, ecx
jne @1a3
test edi, edi
je @b
bsf edi, edi
shr edi, 2
lea eax, [edx+edi-8] ; eax->strlen without  "0"
jmp @1a4
@1a3:
bsf ecx, ecx
shr ecx, 2
lea eax, [edx+ecx-16] ; eax->strlen without  "0"
@1a4:



jj2007

Hi Lingo, good to see you are still visiting this friendly place!
The usual question: Will it work on arbitrarily non-aligned code? Right now I am too tired to test it immediately, sorry  :bg

hutch--

Usually memory copy algos run into the brick wall of how many memory reads and writes they have to do and this is generally the limiting factor rather that instruction choice or coding efficiency. Where you can do it, reading and writing in larger data sizes improves the speed by reducing the number of memory reads and writes but you then run into problems of the alignment of the data to be read or written.

If it is reasonable amounts of data, doing a short byte copy to align the start of the larger reads and write works but for short data its of no advantage. To complicte matters, it depends on whether the data is in close cache or not, if it is then the speed will be reasonable, irf its not you will take a big speed hit. You can solve this problem with SSE with temporal reads and non temporal writes.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?

Mark_Larson

Quote from: jj2007 on June 27, 2008, 08:05:42 AM
Quote from: hutch-- on June 27, 2008, 12:15:41 AM
doing a short byte copy to align the start of the larger reads and write works

How would you do that if your source is on an odd, your target on an even address?

the lowest bit is set if it is odd, and it is 0 if it is even, so you can use the "TEST" instruction.  TEST let's you test one bit of a register, which is useful in ths case.  It sets the ZERO bit if the bit is NOT set.  It clears the zero bit if the bit is set.  Thus if the JNZ after the TEST tells us that no zero bit was set, so it is odd.


;esi holds source address
   test   esi, 01h
   jnz       is_odd
;handle even
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

jj2007

Quote from: Mark_Larson on June 27, 2008, 09:00:52 PM
JJ: How would you do that  [align the start of the larger reads and write] if your source is on an odd, your target on an even address?
the lowest bit is set if it is odd
Quote

Mark, thanxalot for the hint. It seems I have to deliver a concrete example:
- target is at 40200h (we used .data and aligned it to a dword)
- source comes from somewhere out of the blue and is at 42203h
So how could we achieve a dword-aligned memcopy?