Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+
128-byte string copy timing results:
lstrcpy : 432 clocks
SzCpy1 (movsb/cmp) : 540 clocks
SzCpy2 (mov/cmp/inc) : 540 clocks
SzCpy3 (mov/cmp/add) : 666 clocks
SzCpy4 (mov/test/inc): 344 clocks
SzCpy5 (mov/test/add): 479 clocks
[attachment deleted by admin]
Thes are the results I got Mark.
128-byte string copy timing results:
lstrcpy : 755 clocks
SzCpy1 (movsb/cmp) : 1185 clocks
SzCpy2 (mov/cmp/inc) : 582 clocks
SzCpy3 (mov/cmp/add) : 454 clocks
SzCpy4 (mov/test/inc): 593 clocks
SzCpy5 (mov/test/add): 445 clocks
I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count.
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * -----------------------------------------------
copied length minus terminator is returned in EAX
----------------------------------------------- *
align 16
szCopy proc src:DWORD,dst:DWORD
push ebp
mov edx, [esp+8]
mov ebp, [esp+12]
xor ecx, ecx ; ensure no previous content in ECX
mov eax, -1
@@:
add eax, 1
mov cl, [edx+eax]
mov [ebp+eax], cl
test cl, cl
jnz @B
pop ebp
ret 8
szCopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Box is a 2.8 gig Prescott PIV.
Hi Mark-
The Athlon is very sensitive to alignment. I tried some alignment tests. Before the changes, my times were virtually identical to yours. Here are my results with alignment changes.
(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).
lstrcpy : 432 clocks
SzCpy1 (movsb/cmp) : 540 clocks
SzCpy2 (mov/cmp/inc) : 526 clocks
SzCpy3 (mov/cmp/add) : 408 clocks
SzCpy4 (mov/test/inc): 325 clocks
SzCpy5 (mov/test/add): 342 clocks
[attachment deleted by admin]
Hi Mark,
here the results from my machine (AMD Athlon 500Mhz):
128-byte string copy timing results:
lstrcpy : 378 clocks
SzCpy1 (movsb/cmp) : 536 clocks
SzCpy2 (mov/cmp/inc) : 535 clocks
SzCpy3 (mov/cmp/add) : 660 clocks
SzCpy4 (mov/test/inc): 341 clocks
SzCpy5 (mov/test/add): 421 clocks
Here are the times on my Sempron 2.4
128-byte string copy timing results:
lstrcpy : 398 clocks
SzCpy1 (movsb/cmp) : 532 clocks
SzCpy2 (mov/cmp/inc) : 534 clocks
SzCpy3 (mov/cmp/add) : 655 clocks
SzCpy4 (mov/test/inc): 339 clocks
SzCpy5 (mov/test/add): 424 clocks
My celeron,
128-byte string copy timing results:
lstrcpy : 608 clocks
SzCpy1 (movsb/cmp) : 1366 clocks
SzCpy2 (mov/cmp/inc) : 644 clocks
SzCpy3 (mov/cmp/add) : 511 clocks
SzCpy4 (mov/test/inc): 575 clocks
SzCpy5 (mov/test/add): 456 clocks
P3-500, including the procedure that Hutch posted:
lstrcpy : 672 clocks
SzCpy1 (movsb/cmp) : 654 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc): 402 clocks
SzCpy5 (mov/test/add): 402 clocks
szCopy : 289 clocks
And some strange and not very useful timings for an AMD K5:
lstrcpy : 531 clocks
SzCpy1 (movsb/cmp) : 984 clocks
SzCpy2 (mov/cmp/inc) : 984 clocks
SzCpy3 (mov/cmp/add) : 984 clocks
SzCpy4 (mov/test/inc): 984 clocks
SzCpy5 (mov/test/add): 984 clocks
:lol
WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure
and my procedure SzCpy7
It is 1st because I want to use the zeroes from the
destination buffer...:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h
SzCpy7 proc SzDest:DWORD, SzSource:DWORD
mov ecx, [esp+8] ; ecx = source
mov edx, [esp+4] ; edx= destination
push ebx
xor eax, eax
mov bl, [ecx] ; read byte
@@:
add [edx+eax], bl ; I use the fact ->the destination
lea eax, [eax+1] ; buffer is zero filled!!!
mov bl, [ecx+eax] ; read byte
jnz @B ; keep going
pop ebx
ret 2*4
SzCpy7 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Microsoft (R) Macro Assembler Version 7.10.3077
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: SzCpy.asm
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation. All rights reserved.
Press any key to continue . . .
SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom)
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.
128-byte string copy timing results:
SzCpy7 (mov/add/lea->Lingo): 172 clocks
lstrcpy : 646 clocks
SzCpy1 (movsb/cmp) : 1236 clocks
SzCpy2 (mov/cmp/inc) : 667 clocks
SzCpy3 (mov/cmp/add) : 532 clocks
SzCpy4 (mov/test/inc): 602 clocks
SzCpy5 (mov/test/add): 453 clocks
SzCpy6 (mov/test/add->Hutch): 545 clocks
Press ENTER to exit...
Regards,
Lingo
[attachment deleted by admin]
Athlon64 2800+
128-byte string copy timing results:
SzCpy7 (mov/add/lea->Lingo): 157 clocks
lstrcpy : 327 clocks
SzCpy1 (movsb/cmp) : 714 clocks
SzCpy2 (mov/cmp/inc) : 574 clocks
SzCpy3 (mov/cmp/add) : 712 clocks
SzCpy4 (mov/test/inc): 301 clocks
SzCpy5 (mov/test/add): 432 clocks
SzCpy6 (mov/test/add->Hutch): 303 clocks
Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
szCopyx proc src:DWORD,dst:DWORD
push ebx
push esi
push edi
xor ecx, ecx
mov esi, src
mov edi, dst
mov ebx, src
mov edx, dst
mov eax, -4
align 4
@@:
add eax, 4
mov cl, [esi+eax]
mov [edi+eax], cl
test cl, cl
jz @F
mov cl, [ebx+eax+1]
mov [edx+eax+1], cl
test cl, cl
jz @F
mov cl, [esi+eax+2]
mov [edi+eax+2], cl
test cl, cl
jz @F
mov cl, [ebx+eax+3]
mov [edx+eax+3], cl
test cl, cl
jnz @B
@@:
pop edi
pop esi
pop ebx
ret
szCopyx endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll:
XP SP1/AMD XP 1800+
128-byte string copy timing results:
lstrcpy : 433 clocks
SzCpy1 (movsb/cmp) MJ : 540 clocks
SzCpy2 (mov/cmp/inc) MJ : 541 clocks
SzCpy3 (mov/cmp/add) MJ : 666 clocks
SzCpy4 (mov/test/inc) MJ : 344 clocks
SzCpy5 (mov/test/add) MJ : 477 clocks
SzCopy (mov/test/add) Hutch : 367 clocks
SzCpy7 (mov/add/lea) Lingo : 148 clocks
szCopyx (4x unroll) Hutch : 300 clocks
Press ENTER to exit...
[attachment deleted by admin]
The latest one gives me
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA
SzCpy.exe : fatal error LNK1120: 1 unresolved externals
How about moving a word at a time?
align 16
SzCpy12 proc uses ebx SzDest:DWORD, SzSource:DWORD
xor ecx, ecx
mov eax, -4
mov ebx, SzSource
mov edx, SzDest
align 4
@@:
add eax,4
mov cx, [ebx+eax]
test cl, cl
jz move1 ; done, go move zero byte
mov [edx+eax], cx ; save both
test ch,ch
jz @f ; all done4
mov cx, [ebx+eax+2]
test cl, cl
jz move2
mov [edx+eax+2], cx
test ch,ch
jnz @b
@@:
ret
move1:
mov [ebx+eax],cl
ret
move2:
mov [ebx+eax+2],cl
ret
SzCpy12 endp
runs in 255 on my Athlon
p.s.
Did anyone try Lingo's routine? It doesn't give me the correct results, I must be doing something wrong.
Mark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's.
Lingo, couple points. I haven't tried any of this, but just stuff that probably could make it faster.
1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.
2) "mov bl, [ecx] " is going to cause a partial register stall. Even on the P4 you can still get "false partial register stalls". change it to a MOVZX or add a "xor ebx,ebx" before the move
3) Lea is the slowest way to increment a number on a P4. ADD/SUB is the fastest. Also try INC/DEC..
lea eax, [eax+1] ; buffer is zero filled!!!
mov bl, [ecx+eax] ; read byte
4) the above code snippet has a stall ( read after write dependeny) from the previous line. Instead try switching the two lines and using "ecx+eax+1" for the memory address. That should free up a stall.
5) comment below.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h - I am clueless why you added 3 NOPs before the procedure? this makes it not aligned. For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance.
SzCpy7 proc SzDest:DWORD, SzSource:DWORD
6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls. Just need to free up enough registers. If you'd like me to explain this in more detail, just ask.
I'll post a bit of it. I have not verified this runs faster, but it should help break up some stalls. I also used your original code, so I didn't make any of the changes I suggested above.
;ecx is source
;edi is destination - changed this from EDX to EDI, to free up a byte accessible register.
;esi is the same as eax, just a different offset
;all new code left justified
xor eax, eax
mov esi,1
mov bl, [ecx] ; read byte
mov dl,[ecx+1]
@@:
add [edi+eax], bl ; I use the fact ->the destination
add [edi+esi],dl
lea eax, [eax+1] ; buffer is zero filled!!!
lea esi,[esi+1]
mov bl, [ecx+eax] ; read byte
mov dl,[ecx+esi]
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
Results for W2K SP4 / Athlon64 3000+....
128-byte string copy timing results:
lstrcpy : 355 clocks
SzCpy1 (movsb/cmp) MJ : 663 clocks
SzCpy2 (mov/cmp/inc) MJ : 534 clocks
SzCpy3 (mov/cmp/add) MJ : 661 clocks
SzCpy4 (mov/test/inc) MJ : 281 clocks
SzCpy5 (mov/test/add) MJ : 402 clocks
SzCopy (mov/test/add) Hutch : 282 clocks
SzCpy7 (mov/add/lea) Lingo : 161 clocks
szCopyx (4x unroll) Hutch : 230 clocks
Edit:
Same box, but WindowsXP 64bit beta:
lstrcpy : 311 clocks
SzCpy1 (movsb/cmp) MJ : 661 clocks
SzCpy2 (mov/cmp/inc) MJ : 532 clocks
SzCpy3 (mov/cmp/add) MJ : 659 clocks
SzCpy4 (mov/test/inc) MJ : 279 clocks
SzCpy5 (mov/test/add) MJ : 401 clocks
SzCopy (mov/test/add) Hutch : 281 clocks
SzCpy7 (mov/add/lea) Lingo : 161 clocks
szCopyx (4x unroll) Hutch : 230 clocks
I wonder why lstrcpy gets faster? I expected it to be slower because of WOW...
QuoteMark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's.
Duh... I get it now, you can't test it multiple times so the normal test a million times loops won't work. No way to test effectively.
Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.
Then the jnz @B won't have a condition to test (from the add).
Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.
Mark Jones, :lol
Here is faster (unrolled by 8) variant:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h
SzCpy8 proc SzDest:DWORD, SzSource:DWORD
mov ecx, [esp+8] ; ecx = source
mov edx, [esp+4] ; edx= destination
push ebx
xor eax, eax
mov bl, [ecx] ; read byte
@@:
add [edx+eax], bl ; write byte
mov bl, [ecx+eax+1] ; read byte
jz @F ; if zero exit
add [edx+eax+1], bl ; write byte
mov bl, [ecx+eax+2] ; read byte
jz @F ; if zero exit
add [edx+eax+2], bl ; write byte
mov bl, [ecx+eax+3] ; read byte
jz @F ; if zero exit
add [edx+eax+3], bl ; write byte
mov bl, [ecx+eax+4] ; read byte
jz @F ; if zero exit
add [edx+eax+4], bl ; write byte
mov bl, [ecx+eax+5] ; read byte
jz @F ; if zero exit
add [edx+eax+5], bl ; write byte
mov bl, [ecx+eax+6] ; read byte
jz @F ; if zero exit
add [edx+eax+6], bl ; write byte
mov bl, [ecx+eax+7] ; read byte
jz @F ; if zero exit
add [edx+eax+7], bl ; write byte
lea eax, [eax+8]
mov bl, [ecx+eax] ; read byte
jnz @B ; keep going
@@:
pop ebx
ret 2*4
SzCpy8 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Regards,
Lingo
Jimg, :lol
Quote
Duh... I get it now, you can't test it multiple times so the normal test a million times loops won't work. No way to test effectively.
; -----------------------------------------------
print addr SzSzCpy8
xor ecx, ecx
mov edx, 128
mov eax, offset String2
@@:
mov [eax+edx-4], ecx
add edx, -4
jnz @@B
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke SzCpy8, addr String2, addr String1
counter_end
print ustr$(eax)
print chr$(" clocks",13,10)
invoke Sleep,250
; -----------------------------------------------
If you add zero or other value to the register
the time is the same and the test is correct.
Regards,
Lingo
Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
No they don't. They force it to be unaligned. They were put after an "align 16". The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary. On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.
QuoteIf you add zero or other value to the register
the time is the same and the test is correct.
Logically, you are correct. Realistically, since everything seems to have some strange effect on timing, I want to see the result correct after the million times loop!
After more testing, I am convinced that the repeated overflow condition is affecting the test somehow, but it's just too many step to trace by hand. Changing the add to a mov, followed by a testbl,bl shouldn't double the execution time. What's going on here?
What do you think about this one?
szCopyP proc src:DWORD,dst:DWORD
mov ecx,src ; ecx = source
mov edx,dst ; edx = destination
xor eax,eax
@@:
add al,[ecx] ; read byte
jz isByte ; Result = 0?
shl eax,8 ; shift al > ah
add al,[ecx+1] ; read byte
jz isWord
shl eax,8
add al,[ecx+2]
jz isByteWord
shl eax,8
add al,[ecx+3]
jz isDWord
mov [edx],eax
add ecx,4 ; inc pointers
add edx,4
jmp @B
isByte:
mov [edx],al
jmp done
isWord:
mov [edx],ax
jmp done
isByteWord:
mov [edx+1],ax
shr eax,16
mov [edx],al
jmp done
isDWord:
mov [edx],eax
done:
Ret
szCopyP EndP
Results (W2K SP4 / Athlon64 3000+):
128-byte string copy timing results:
lstrcpy : 354 clocks
SzCpy1 (movsb/cmp) MJ : 663 clocks
SzCpy2 (mov/cmp/inc) MJ : 534 clocks
SzCpy3 (mov/cmp/add) MJ : 661 clocks
SzCpy4 (mov/test/inc) MJ : 281 clocks
SzCpy5 (mov/test/add) MJ : 403 clocks
SzCopy (mov/test/add) Hutch : 282 clocks
SzCpy7 (mov/add/lea) Lingo : 161 clocks
szCopyx (4x unroll) Hutch : 230 clocks
szCopyP (DWORD mov) Phoenix : 250 clocks
Not too bad, i think?
Phoenix, :lol
QuoteNot too bad, i think?
Not too bad but you must move
@@ before xor eax, eax
and pls include my last procedure in your test...
Regards,
Lingo
Lingo's technique is an interesting one and while it is not properly general purpose, it will be very useful where you are preallocating a zero filled buffer then repeatedly appending byte data to it at the end of the last write. To benchmark it you would have to allocate a large enough zero filled buffer then progressively append string data to the end of it for a duration of about a half a second and you should get a reasonable idea of how fast it is in that context against the others.
Hutch, :lol
Quote
"..for a duration of about a half a second and you should get a reasonable idea of how fast it is.."
If we use
invoke HeapAlloc, hHeap, dwFlags, dwBytes
it is very interesting how you measured "a half a second" difference
between the usage of the two parameters:
dwFlags = 0
and
dwFlags = HEAP_ZERO_MEMORY
Regards,
Lingo
Lingo,
The comment was to make the test piece large enough so the test would run for about half a second so that the test results were more reliable. They don't improve in accuracy much over that.
Lingo,
QuoteNot too bad but you must move
@@ before xor eax, eax
:red of course, you are right.... but with the fix, the result is the same: 250 clocks (?)
An align 16 returns the result: 216 clocks...
The result for SzCpy8 is 161 clocks on my box, but...
did you check the resulting strings from SzCopy7 and SzCpy8? I added
print addr String2 after your procs and the result seems to be not the expected one.
Edit: The result string is Ë♂VýºÓ▬©↑¶‗áÓz<∟♦∟♦
Regards, Phoenix
Quote from: Jimg on May 10, 2005, 06:39:12 PM
Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.
Then the jnz @B won't have a condition to test (from the add).
Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.
I guess I wasn't clear enough. I wanted to change it to use ECX, and MOVZX, and then do a JECXZ. Not sure the speed difference.
This uses a "MOV" instead of an "ADD", a "MOVZX" instead of an 8 byte move to get rid of partial register stalls. And I got rid of the LEA, since I don't need to preserve the flags and LEA is slow. That's 3 of my points above.
;esi points to the source
xor eax, eax
movzx ecx,byte ptr [esi] ; read byte
@@:
mov [edx+eax], cl ; I use the fact ->the destination
; lea eax, [eax+1] ; buffer is zero filled!!!
add eax,8 ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
jecxz @F
movzx ecx, byte ptr [esi+eax] ; read byte
jmp @B ; keep going
Lingo, you can use a REPEAT macro to make the code smaller. I didn't unroll it the full 8 times using REPEAT due to the fact you have the LEA in the middle of the sequence. You could technically swap LEA and the line after 8, and do the full 8 lines of unrolling. This also lets you easily play with different amounts of unrolling without having to do cut and pasting, and modifying offsets ( ewww).
@@:
UNROLL_AMT equ 7
looper = 0
REPEAT UNROLL_AMT
add [edx+eax+looper], bl ; write byte
mov bl, [ecx+eax+looper+1] ; read byte
jz @F ; if zero exit
looper = looper + 1
endm
add [edx+eax+UNROLL_AMT], bl ; write byte
lea eax, [eax+UNROLL_AMT+1]
mov bl, [ecx+eax] ; read byte
jnz @B ; keep going
Mark_Larson-
Quote;esi points to the source
xor eax, eax
movzx ecx,byte ptr [esi] ; read byte
@@:
mov [edx+eax], cl ; I use the fact ->the destination
; lea eax, [eax+1] ; buffer is zero filled!!!
add eax,8 ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
jecxz @F
movzx ecx, byte ptr [esi+eax] ; read byte
jmp @B ; keep going
Looks ok except add eax,8 needs to be add eax,1 doesn't it?
I used this-
align 16
db 8 dup (0) ; 8=411
SzCpy16 proc SzDest:DWORD, SzSource:DWORD
mov edx, SzDest
mov esi, SzSource
xor eax, eax
movzx ecx,byte ptr [esi] ; read byte
@@:
mov [edx+eax], cl
add eax,1
jecxz @F
movzx ecx, byte ptr [esi+eax] ; read byte
jmp @B ; keep going
@@:
ret
SzCpy16 endp
best time was 411 cycles.
You are right about the lea though. This mod of lingo runs in 130 cycles
align 16
db 3 dup (0)
SzCpy17 proc uses ebx SzDest:DWORD, SzSource:DWORD
mov ecx, SzSource ; ecx = source
mov edx, SzDest ; edx= destination
xor eax, eax
mov bl, [ecx] ; read byte
sub eax,1
@@:
add eax, 1 ;[eax+1]
add [edx+eax], bl ; write byte
mov bl, [ecx+eax] ; read byte
jnz @B ; keep going
ret
SzCpy17 endp
but I still can't figure out why the add should be three times faster than a mov followed by test bl,bl
I played with the unrolled version a little more, the register alternation was not making it any faster so I reduced the entry and exit overhead and removed the stack frame and it showing a little faster on my PIV. Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
szCopyx proc src:DWORD,dst:DWORD
mov [esp-4], esi
xor eax, eax
mov esi, [esp+4]
mov edx, [esp+8]
mov ecx, -4
align 4
@@:
add ecx, 4
mov al, [esi+ecx]
mov [edx+ecx], al
test al, al
jz @F
mov al, [esi+ecx+1]
mov [edx+ecx+1], al
test al, al
jz @F
mov al, [esi+ecx+2]
mov [edx+ecx+2], al
test al, al
jz @F
mov al, [esi+ecx+3]
mov [edx+ecx+3], al
test al, al
jnz @B
@@:
mov esi, [esp-4]
ret 8
szCopyx endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Hutch-
Best I could get with your latest is 284. Other than the lingo and it's mods, the best I've found is 255 with-
align 16
SzCpy12 proc uses ebx SzDest:DWORD, SzSource:DWORD
xor ecx, ecx
mov eax, -4
mov ebx, SzSource
mov edx, SzDest
align 4
@@:
add eax,4
mov cx, [ebx+eax]
test cl, cl
jz move1 ; done, go move zero byte
mov [edx+eax], cx ; save both
test ch,ch
jz @f ; all done4
mov cx, [ebx+eax+2]
test cl, cl
jz move2
mov [edx+eax+2], cx
test ch,ch
jnz @b
@@:
ret
move1:
mov [ebx+eax],cl
ret
move2:
mov [edx+eax+2],cl
ret
SzCpy12 endp
Still tweaking. Hutch's latest runs in 343 cycles. I made some changes to use REPEAT macros and use MOVZX. With the same number of unrolls and MOVZX it runs in 284 cycles.
hutch's code above 343
4 unrolls movzx 284
8 unrolls movzx 284
16 unrolls movzx 283
32 unrolls movzx 282
64 unrolls movzx 271
128 unrolls movzx 262 - 128 byte string being tested. So any further unrolls yields no speed up.
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
szCopyx2 proc src:DWORD,dst:DWORD
mov [esp-4], esi
xor eax, eax
mov esi, [esp+4]
mov edx, [esp+8]
mov ecx, -UNROLL_AMTx
align 4
@@:
UNROLL_AMTx equ 4
add ecx, UNROLL_AMTx
looper = 0
REPEAT UNROLL_AMTx
movzx eax, byte ptr [esi+ecx+looper]
mov [edx+ecx+looper], al
test al, al
jz @F
looper = looper + 1
endm
jmp @B
@@:
mov esi, [esp-4]
ret 8
szCopyx2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Here is the fastest variant: :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 8Dh, 0A4h,24h, 0, 0, 0, 0, 8Dh, 0A4h, 24h, 0, 0, 0, 0, 90h
SzCpy15 proc SzDest:DWORD, SzSource:DWORD
push ebx
xor eax, eax
mov ebx, [esp+2*4+1*4] ; ebx = source
mov edx, [esp+1*4+1*4] ; edx= destination
movq MM0, [ebx]
pxor MM7, MM7
@@:
movq [edx+eax], MM0
add eax, 8
pcmpeqb MM0, MM7
packsswb MM0, MM0
movd ecx, MM0
movq MM0, [ebx+eax]
jecxz @B
pop ebx
emms ; femms for AMD
ret 2*4
SzCpy15 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
"Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose."
Regards,
Lingo
Lingo,
It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?
Mark,
The MOVZX mod looks like a good one, I will have a play with it a bit later.
LATER : Yes, this method is noticably faster, the test piece is now twice as fast as the version in the libray. What I will be interested to see is if this mod works on AMD hardware as well.
Quote from: Mark_Larson on May 10, 2005, 08:28:11 PM
Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
No they don't. They force it to be unaligned. They were put after an "align 16". The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary. On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.
Check it in debugger, the entry point is 3 bytes off, but the inner loop is aligned to 16 bytes.. :eek
This is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
szCopyx proc src:DWORD,dst:DWORD
mov [esp-4], esi
mov esi, [esp+4]
mov edx, [esp+8]
mov ecx, -4
align 4
@@:
add ecx, 4
movzx eax, BYTE PTR [esi+ecx]
mov [edx+ecx], al
test al, al
jz @F
movzx eax, BYTE PTR [esi+ecx+1]
mov [edx+ecx+1], al
test al, al
jz @F
movzx eax, BYTE PTR [esi+ecx+2]
mov [edx+ecx+2], al
test al, al
jz @F
movzx eax, BYTE PTR [esi+ecx+3]
mov [edx+ecx+3], al
test al, al
jnz @B
@@:
mov esi, [esp-4]
ret 8
szCopyx endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
The next one is for novelty value, its slow but it works. It requires a DWORD aligned target buffer.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
xorcopy proc src:DWORD,dst:DWORD
push edi
push ebx
mov edx, src
mov edi, dst
mov ecx, -4
xor ebx, ebx
align 4
@@:
add ecx, 4
mov DWORD PTR [edi+ecx], ebx
xor al, al
xor al, [edx+ecx]
xor [edi+ecx], al
jz @F
xor al, al
xor al, [edx+ecx+1]
xor [edi+ecx+1], al
jz @F
xor al, al
xor al, [edx+ecx+2]
xor [edi+ecx+2], al
jz @F
xor al, al
xor al, [edx+ecx+3]
xor [edi+ecx+3], al
jnz @B
@@:
pop ebx
pop edi
ret
xorcopy endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
I didn't have very long to work on this, but it seems to do well even with a stack frame and with no specific alignment. I didn't have time to include the last 4 (or 5?) versions posted in my timing test.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
include timers.asm
SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
str2 db 128 dup(0)
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT EQU 10000000
invoke SzCopy_dw,ADDR str1,ADDR str2
print ustr$(eax)
print chr$(13,10)
print ADDR str1
print chr$("<",13,10)
print ADDR str2
print chr$("<",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke SzCopy_dw,ADDR str1,ADDR str2
counter_end
print ustr$(eax)
print chr$(" clocks",13,10)
invoke Sleep,250 ; reduced cycle count by ~26 ??
mov eax,input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
mov eax,lpsrc
mov edx,lpdest
sub eax,4
sub edx,4
@@:
mov ebx,[eax+4] ; get next 4 bytes
add eax,4
lea ecx,[ebx-01010101h] ; subtract 1 from each byte
add edx,4
test ecx,80808080h
jnz @F ; jump if zero byte
mov [edx],ebx
jmp @B
@@:
mov cl,[eax]
add eax,1
mov [edx],cl
add edx,1
test cl,cl
jnz @B
sub eax,lpsrc
sub eax,1 ; return length w/o null
ret
SzCopy_dw endp
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
This was after I reversed the order of the arguments for szCopyx, szCopyP, and SzCopy_dw, and running on a P3:
lstrcpy : 671 clocks
SzCpy1 (movsb/cmp) : 653 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc) : 401 clocks
SzCpy5 (mov/test/add) : 401 clocks
szCopyx (hutch unroll 4): 289 clocks
SzCpy8 (lingo unroll 8) : 147 clocks
szCopyP (Phoenix) : 1261 clocks
szCopy_dw : 162 clocks
It's funny how things get revisited on this board. We've had lots of discussion about string length algorithms, including using MMX and SSE2. Check this thread. http://www.old.masmforum.com/viewtopic.php?t=2632. I even posted a version that used MMX and SSE2 in that thread ( along with some other people who did MMX). My account had gotten messed up, so all my posts will be from "hutch--", but the account type will say "guest". Whereas the real ones from hutch-- say "Site Admin" and have the big red crest with the gold lion.
Hutch, as far as alignment goes, I posted a trick to get around that, when we did either a string length or string copy routine. Basically you do a byte routine until the data is the appropriate alignment, and then you switch to the faster method.
Hutch-
QuoteThis is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.
On my Athlon, runs at 284 with best alignment. Exactly the same as your previous routine, i.e. movzx doesn't seem to be any faster or slower on amd chips.
Lingo-
Your last version is excellent. Runs in 112 cycles (with three extra bytes inserted after the align 16). Better than twice as fast as any other so far. Good Job :U
Now if I only understood what the heck it was doing :wink
Later.....
Ok, now I see what's going on. Too bad that it requires, in the worst case, at least 7 extra bytes in the destination string. :(
This is the SSE2 version that I posted on that other thread. The test string length was 228 bytes instead of the 128 bytes we are using here. So keep that in mind when it gives timings. It also doesn't require that you pad the end with 8 or 16 zeros. But it does requires the strings to have 16 byte alignment for the SSE2 version. I'll see if I can dig up the MMX version I wrote later. Because I didn't post it to the forums. I checked. But for the MMX version, we can do a quick conversion and see how many cycles a byte we are talking about, because I give timing info for both the MMX and SSE2 versions.
168 cycles / 228 bytes = 0.73684210526315789473684210526316 cycles per byte for MMX version I need to dig up off my home computer ( I didn't post it to the thread.)
68 cycles / 228 bytes = 0.29824561403508771929824561403509 cycles per byte for SSE2 version I posted below.
So theoretically the MMX version should do 128 bytes in
168*128 = 228*new time in cycles
168*128
------------ = new time in cycles for MMX version for 128 bytes = 94.315789473684210526315789473684 cycles
228
and for the SSE2 version.
68*128 = 228*new time in cycles
68*128
------------ = new time in cycles for SSE2 version for 128 bytes = 38.175438596491228070175438596491 cycles
228
Obviously putting an SSE2 version of szCopy is out of the question. Not as many people could use it. But I think doing an MMX version won't be a problem. I don't know anyone that has a CPU that doesn't support MMX.
Quote
posted some strlen code under the GoAsm forum. I shaved 48 cycles off his code by doing something similar to what I did with your stuff Michael. I just now converted his MMX code to SSE2. Won't work on a P3 or below. The MMX version of Donkey's code that was changed to run faster on the P4 by me, finds the strlen of a 228-byte string in 168 cycles. The below version using SSE2 finds it in 68 cycles!!! So BIG improvement.
Code:
szLenSSE2 proc near pString:ptr byte
mov eax,[pString] ; mov the offset of the string into eax
pxor xmm0,xmm0 ;remove dependencies
pxor xmm2,xmm2 ;remove dependencies
pxor xmm1,xmm1 ;fill XMM reg 1 with NULLs
pxor xmm3,xmm3 ;FILL XMM reg 3 with NULLs
;align 16 ; align for code isn't needed on the P4
; The following loop starts at exactly 16 bytes into the proc
; it is also exactly 16 bytes long, this makes for optimal execution
loop:
movdqa xmm0,[eax] ; mov 16 bytes of the string into MMX reg 0
pshufd xmm2,[eax+16],0e4h ; mov 16 bytes of the string into MMX reg 0
pcmpeqb xmm0,xmm1 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)
pcmpeqb xmm2,xmm3 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)
pmovmskb ecx,xmm0 ; This moves the most signf. bit of each byte to ecx
add eax,32 ; increment the string address
pmovmskb edx,xmm2 ; This moves the most signf. bit of each byte to ecx
or ecx,ecx ; Was any bit set ?
jnz end_loop
or edx,edx
jz loop ; no then continue the loop, no zero was found
mov ecx,edx ; Move EDX into ECX for the BSF later
end_loop:
sub eax,[pString] ; subtract the original address
bsf ecx,ecx ; find out which bit is set (where the NULL is)
sub eax,32 ; subtract the last increment from eax
add eax,ecx ; add the number of bytes to the NULL
RET
lszLenSSE2 endp
I got my version down to 127 cycles on a P3, using 486-compatible code. I have no way to test on a P4.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
include timers.asm
SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
str2 db 128 dup(0)
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT EQU 10000000
invoke SzCopy_dw,ADDR str1,ADDR str2
print ustr$(eax)
print chr$(13,10)
print ADDR str1
print chr$("<",13,10)
print ADDR str2
print chr$("<",13,10,13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke SzCopy_dw,ADDR str1,ADDR str2
counter_end
print ustr$(eax)
print chr$(" clocks",13,10)
invoke Sleep,250 ; reduced cycle count by ~26 ??
mov eax,input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
mov eax,lpsrc
mov edx,lpdest
sub eax,4
sub edx,4
@@:
REPEAT 8 ; unroll x 8
mov ebx,[eax+4] ; get next 4 bytes
add eax,4
lea ecx,[ebx-01010101h] ; subtract 1 from each byte
add edx,4
test ecx,80808080h ; test for any -1 bytes
jnz @F ; jump if zero byte present
mov [edx],ebx ; otherwise copy 4 bytes
ENDM
jmp @B
@@:
mov cl,[eax] ; finish up moving bytes
add eax,1
mov [edx],cl
add edx,1
test cl,cl
jnz @B
sub eax,lpsrc
sub eax,1 ; return length w/o null
ret
SzCopy_dw endp
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
[attachment deleted by admin]
MichaelW
That's a real nice routine. Unfortunately, it get much slower if there is a high ascii character near the beginning of the string :( Maybe if you replace the last part with one of the other routines when high ascii is found?
Michael, here are the timings on my work P4. I believe it's a 1.5
127
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <
148 clocks
Press enter to exit...
I'll take a stab at speeding it up some to. Yours is real generic like Hutch's and faster.
EDIT: I haven't tried this since I have a P4. but Michael have you tried adding an Align for the inner loop and for the procedure?
I tried a number of things. And a few things gave me some speed ups.
First. The last part re-reads the last dword a byte at a time when it is already in a register.
hi_mom:
shr ebx,8
@@:
;the dword is already IN a register, so why re-read it from memory?
; mov cl,[eax] ; finish up moving bytes
; add eax,1
mov [edx],bl
add edx,1
test bl,bl
jnz hi_mom
sub eax,lpsrc
sub eax,1 ; return length w/o null
ret
SzCopy_dw endp
Second, I tried unrolling the loop 128 times with the original algorithm ( not including any other changes), and I got it from 148 cycles to 114 cycles. So if we pick some middle ground, we can probably get a decent speed up from this.
Third, we can try removing the Epilogue and Prologue. That is usually good for a few cycles.
Fourth I re-wrote the inner loop to work differently. It now handles two bytes in semi-parallel. It runs about 8 cycles faster than the original code. But with all the other stuff I posted above, it adds up.
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
mov eax,lpsrc
mov edx,lpdest
@@:
REPEAT 128 ; unroll x 8
mov ebx,[eax] ; get next 4 bytes
mov esi,[eax+4]
lea ecx,[ebx-01010101h] ; subtract 1 from each byte
add eax,4
test ecx,80808080h ; test for any -1 bytes
jnz @F ; jump if zero byte present
lea edi,[esi-01010101h] ; subtract 1 from each byte
add eax,4
mov [edx],ebx ; otherwise copy 4 bytes
add edx,4
test edi,80808080h ; test for any -1 bytes
jnz @F ; jump if zero byte present
mov [edx],esi ; otherwise copy 4 bytes
add edx,4
ENDM
jmp @B
@@:
sub eax,4
@@:
mov cl,[eax] ; finish up moving bytes
add eax,1
mov [edx],cl
add edx,1
test cl,cl
jnz @B
sub eax,lpsrc
sub eax,1 ; return length w/o null
ret
SzCopy_dw endp
Fifth, for non-P4 processors aligning the inner loop should hopefully give a speed up.
Hutch :lol
"Lingo,
It looks good, how does it perform on
unaligned string data as source and destination and lengths that are not 4 byte boundaries ?"
My algos are for ASM professionals rather than for newbies...
MichaelW :lol
1. This is buliaNaza's algo rather than A.Fog's algo
A.Fog's algo has additional 2 instructions in the main loop:
XOR EBX, -1 ; invert all bytes
AND ECX, EBX ; and these two
I know it very well because in a long time ago I improved buliaNaza's algo:
http://board.win32asmcommunity.net/index.php?topic=8330.msg77056#msg77056
2. As
Jimg stated above
Quote
.... Unfortunately, it get much slower if there is a high ascii character near the beginning of the string
Maybe if you replace the last part with one of the other routines when high ascii is found?
That is true. "high ascii characters" are characters above ASCII code 80h and
if we have them in our string your algo will work in "slow" last part, copy one byte at a time rather than a dword...
Regards,
Lingo
:bg
Lingo,
> My algos are for ASM professionals rather than for newbies...
I gather this means that it does not handle byte aligned data and does not handle lengths that are not multiples of 4. This is not a criticism as the algo looks good and it should be fast as the testing shows but its not a general purpose algorithm as byte data is usually of random length. One of the normal tasks for zero terminated string copy is to append to the end of a buffer where the location is not 4 byte aligned and the string being copied to it is not a multiple of 4.
Compliments though, for its task range it is a very good algo design.
Jim,
just for my reference, what model Athlon are you testing on ? My last ancient AMD died recently and I have a Sempron 2.4 in its place which deliverss similar timing spacing to the two PIVs I use so its worth me knowing this for writing general purpose code that is basically mixed model algo design.
Hey Hutch--, there was one other problem with Lingo's code. It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes. So it copies extra data past the end of the source string to the destination string. In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash. In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.
Here's the MMX code. I accidentally posted a string length routine in the previous post. I haven't finished optimizing it yet. But I figured I'd post it to see if anyone can use it for their own ideas. It does properly handle non-divisible by 8 string sizes. It runs in 107 cycles on my P4 for the 128 byte sample string.
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,[src_str]
mov esi,[dst_str]
align 16
qword_copy:
pxor mm1,mm1
movq mm0,[eax]
pcmpeqb mm1,mm0
add eax,8
pmovmskb ecx,mm1
or ecx,ecx
jnz finish_rest
movq [esi],mm0
add esi,8
jmp qword_copy
finish_rest:
sub eax,8
byte_copy:
movzx ecx,byte ptr [eax]
add eax,1
mov [esi],cl
add esi,1
test cl,cl
jz zero_byte
jmp byte_copy
zero_byte:
; emms
RET
szCopyMMX endp
Mark,
This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.
Jim,
Thanks for pointing out my stupidity :bg As it turns out, and as I see lingo has now pointed out, that was what the extra instructions in Agner Fog's algorithm were for. Using another method after the first high-ascii character did not seem too promising, so I added the two additional instructions from Agner Fog's algorithm so the result bytes are now 80h for a null byte, and zero for any other byte value.
Mark,
Thanks for the suggestions. Running on a P3, with the sample strings, the optimum unroll is 8. Higher values slow the code down. The current code copies the sample string in 170 cycles. The dword loop still contains a dependency, but I can't see any reasonable way to eliminate it.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
include timers.asm
SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
strx db 80h,"ample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
str2 db 128 dup(0)
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT EQU 10000000
invoke SzCopy_dw,ADDR str1,ADDR str2
print ustr$(eax)
print chr$(13,10)
print ADDR str1
print chr$("<",13,10)
print ADDR str2
print chr$("<",13,10,13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke SzCopy_dw,ADDR str1,ADDR str2
counter_end
print ustr$(eax)
print chr$(" clocks",13,10)
invoke Sleep,250 ; reduced cycle count by ~26 ??
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke SzCopy_dw,ADDR strx,ADDR str2
counter_end
print ustr$(eax)
print chr$(" clocks",13,10)
invoke Sleep,250 ; reduced cycle count by ~26 ??
mov eax,input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to detect the terminating null, without
; detecting the first high-ascii character as a null, was
; adapted from the MASM32 STRLEN procedure, which was adapted
; from an algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
; Corrected another mistake that MASM was actually ingoring:
; SzCopy_dw proc uses ebx edi lpsrc:DWORD,lpdest:DWORD
SzCopy_dw proc lpsrc:DWORD,lpdest:DWORD
push ebx
mov eax,[esp+8] ; lpsrc
mov edx,[esp+12] ; lpdest
align 4
@@:
REPEAT 8 ; unroll x 8
mov ebx,[eax] ; get next 4 bytes
add eax,4
lea ecx,[ebx-01010101h] ; subtract 1 from each byte
not ebx ; invert all bytes
and ecx,ebx ; and each byte with inversion
not ebx ; restore bytes
and ecx,80808080h ; bytes=80h for null,otherwise 0
jnz @F ; jump if zero byte present
mov [edx],ebx ; copy 4 bytes
add edx,4
ENDM
jmp @B
@@:
mov [edx],bl ; move first byte
add edx,1
test bl,bl
jz @F ; jump if was null
mov [edx],bh ; move next byte
add edx,1
test bh,bh
jz @F ; jump if was null
shr ebx,16
mov [edx],bl ; move next byte
add edx,1
test bl,bl
jz @F ; jump if was null
mov [edx],bh ; move last byte
@@:
mov eax,edx ; return length w/o null
sub eax,[esp+12] ; lpdest
pop ebx
ret 8
SzCopy_dw endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
[attachment deleted by admin]
Quote from: hutch-- on May 12, 2005, 04:13:06 AM
Mark,
This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.
It doesn't require 8 byte aligned source and data strings. It just runs slower like a dword algorithm with unaligned dword data. Only SSE and SSE2 require aligned source and data otherwise they generate an exception. I also got it down to 86 cycles last night after posting it. I need to check for bugs and make sure it works correctly before posting the changes.
Michael-
Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.
Mark_Larson
Mark- is it ever possible to have a page fault or other error reading in the string when using movq where there is only 1 byte left and it happens to be at the end of one's address space? Or is there always enough extra space that it could never happen?
Quote from: Jimg on May 12, 2005, 02:00:08 PM
Michael-
Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.
Jim, MMX is also non-fpu. I think you meant "486 compatible" code.
Hutch, what about having two routines? An szCopy and an szCopyMMX? And let the programmer pick which one they want to use. Because the programmer is best going to know the minimum processor supported for his code.
I did some bench testing for un-aligned MMX. I used the current version I have and did some fancy batch files to auto-check them all. The current version I have is running in 87 cycles at work. So I am going to use that as my baseline. I am using Michael's posted code except I changed the timing macros to use REALTIME.
I made some changes to the ASM file and the batch file to compile the code. First let's look at the changes to the assembler code. You pass in DO_TESTING using the /D switch to ML.EXE to make it do the "special" compile of forcing non-alignment. You also pass in "FRED_VALUE" to ML.EXE with different values ranging from 1 to 7. So in the code snippet below the code gets both DO_TESTING and FRED_VALUE from the command line using the /D command line switch. Because of the "ALIGN 16" right before the DB this forces it to be 1 to 7 bytes off of a 16 byte alignment.
align 16
ifdef DO_TESTING
db FRED_VALUE dup (0)
endif ; ifdef DO_TESTING
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
I created 7 different batch files that each pass in the values ranging from 1 to 7 for FRED_VALUE. Here is an example of setting FRED_VALUE=2, and it also turns DO_TESTING on.
\masm32\bin\ml /D"DO_TESTING=1" /D"FRED_VALUE=2" /c /coff "szcopytest.asm"
So now let's look at some timings. I ran each one 5 times and took the fastest time running at REALTIME priority.
1 byte off of 16 byte alignment 275 clocks
2 byte off of 16 byte alignment 266 clocks
3 byte off of 16 byte alignment 283 clocks
4 byte off of 16 byte alignment 256 clocks
5 byte off of 16 byte alignment 284 clocks
6 byte off of 16 byte alignment 282 clocks
7 byte off of 16 byte alignment 284 clocks
8 byte off of 16 byte alignment 87 - ran this one manually outside of the process to verify it goes back to being the same speed.
8 bytes off of a 16 byte alignment means it's 8 byte aligned, which is all MMX requires for speed.
So the moral of the story, ALIGN ALIGN ALIGN ( which is exactly what I said on my webpage). However the penalty for un-aligned dword is a lot less than un-aligned MMX.
MarkL-
QuoteJim, MMX is also non-fpu. I think you meant "486 compatible" code.
:red (too old I guess)
pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )
Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?
.mmx = MMX as normal
.xmm = MMX, SSE and SSE2
Quote from: Jimg on May 12, 2005, 04:10:19 PM
pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )
Yep pmovmskb was added with SSE support. If I wanted to make it exclusively MMX I'd have to change that to some other instruction.
XMM is for both SSE and SSE2. Things get confusing, because when Intel added SSE support they also added additional MMX instructions. They did a similar thing with SSE2 and added some new MMX instructions along with the new double precision floating point support. So when I say "SSE" support I mean both the MMX instructions they added for SSE support as well as the single precision floating point instructions they added. So you have to use .xmm to use "pmovmskb" even though it's not a single precision floating point instruction because it was added as part of P3's "SSE support". make sense? The way you tell is to pull up book 2 ( instruction set reference), and go to Appendix B, they have all the different instructions broken up by type. So the ones added with "SSE" support whether they are floating point or MMX instructions are under the sub-heading "SSE" in appendix B. very handy. I use that all the time to see what processors support what instructions.
Quote
Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?
Thanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better. It's 87 cycles, and it hasn't been fully checked for bugs, so I hate posting. It isn't very professional, and people usually jump all over you :). I guess I can post it if everyone realizes I haven't finished checking it for bugs yet.
Quote from: AeroASM on May 12, 2005, 05:33:20 PM
.mmx = MMX as normal
.xmm = MMX, SSE and SSE2
This is correct. I use .xmm in all my problems and it enables MMX, SSE, and SSE2. It's just a lot easier. This in my opinion is a bug. Try doing this, and it'll give you errors if you trying doing SSE or SSE2 code. The reason is the .XMm and .mmx aren't additive, it only uses the last one found in the current code. I found that out the hard way. So I just use .xmm in all my code to be extra sure. You also need the ".xmm" AFTER your "option casemap" or it will start complaining that the code isn't in all CAPITAL letters ( ewww).
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.xmm
.mmx ; ignores .xmm and just enables MMX support.
Take the code with a grain of salt, since I have not finished debugging it yet. The only string I have tested it with is the one in Michael's test code. It works fine with that, but with all the "special cases" I added, I need to do a lot more testing before it's ready.
I am only showing the portion I changed. I modified the second loop, where it is copying byte by byte, because there is less than 8 bytes left to copy.
Because of PCMPEQB and PMOVMSKB in the code ECX will have all 0's for each bit that is not a 0. Each bit that is a 0 will have a 1. So BSF let's me get, which bit was set in a single instruction, which corresponds to where the 0 is. If you have more than one 0 in a row, the first one is the one we care about, and BSF scans from the start of the DWORD, so it will find the very first one ( which is the only one we care about). After determine, which bit is set, I jump to special case code to handle different numbers of bytes left.
bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2
do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret
do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret
do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret
do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret
do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret
do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret
;cmp ecx,1
;je do_1
movzx ecx,byte ptr [eax-8]
mov [esi],cl
ret 8 ; ret
QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.
The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.
Quote from: Jimg on May 13, 2005, 12:58:30 AM
QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.
The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.
In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.
PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,[esp+4]
mov esi,[esp+8]
szCopyMMX endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: Mark_Larson on May 13, 2005, 01:28:31 AM
In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.
There is only a cycle or two difference when using the prologue/epilogue code on the Athon. Certainly not enough to bother with.
After extensive testing, inserting this runs the fastest-
align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,src_str
mov esi,dst_str
jmp qword_copy
db 7 dup (0) ;7=159
qword_copy:
pxor mm1,mm1
movq mm0,[eax]
.
.
.
Quote
That is equivalent to doing an "ALIGN 16" in the inner loop, which what I kept trying to get you to try :) Here's why it is the same as an align ( without the jump of course). Let's look at all the code before the first loop. That consists of exactly two lines of code.
align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,[src_str]
mov esi,[dst_str]
;THIS gets translated into the following. The first two lines are the "prologue"
; the second two lines are the first two lines of the code. You will notice that it
; also gives the opcodes for the instructions.
00401260 55 push ebp
00401261 8B EC mov ebp,esp
00401263 8B 45 08 mov eax,dword ptr [ebp+8]
00401266 8B 75 0C mov esi,dword ptr [ebp+0Ch]
push ebp = 55h
mov ebp,esp = 8Bh ECh
mov eax,dword ptr [ebp+8] = 8Bh 45h 08h
mov esi,dword ptr [ebp+0Ch] = 8Bh 75h 0Ch
If you add up the opcodes contained in those 4 instructions it comes out to 9. Now if you remember right you added 7 bytes to just before the label "qword_copy". 9+7 = 16 bytes. Since the first line of the routine is 16-byte aligned, this forces the "qword_copy" label to also be 16-byte aligned. It's just easier to use ALIGN 16 in front of the label.
align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,[src_str]
mov esi,[dst_str]
align 16
qword_copy:
Right. But it measures 8 cycles longer to work through the nop's rather than the jump.
Quote from: Jimg on May 13, 2005, 04:43:59 AM
Right. But it measures 8 cycles longer to work through the nop's rather than the jump.
That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.
00401269 8D A4 24 00 00 00 00 lea esp,[esp]
So if it is coming out 8 cycles faster I'd try a few things.
1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.
2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.
jmp qword_copy
align 16
qword_copy:
Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.
align 16
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
str2 db 128 dup(0),0
Quote from: Mark_Larson on May 13, 2005, 01:23:25 PM
That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.
00401269 8D A4 24 00 00 00 00 lea esp,[esp]
Yes, that's exactly the code being inserted.
QuoteSo if it is coming out 8 cycles faster I'd try a few things.
1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.
Yes, I was using Realtime, and went from 1000000 to 10000000 with no difference.
Quote
2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.
jmp qword_copy
align 16
qword_copy:
Ok, I tried that. Still slower. In this case, it inserted the instruction
05 00000000 ADD EAX,0
Quote
Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.
align 16
str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
"pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
str2 db 128 dup(0),0
Yes, I did that also. I've attached my test code so you can see what I'm doing. Sorry for the mess, it's a work in progress :wink
My results are printed in the description when the program runs.
align 16 only - 158
jmp and db 7 dup (0) - 149
jmp and align 16 - 159
The only explanation I can think of is the Athlon loads the lea esp,[esp] insturction in the prefetch and thinks about it for awhile, but the zero require no thought :dazzled:
[attachment deleted by admin]
For shoots and grins I compared both the "JMP and DB" and the "JMP and ALIGN" to the normal code execution time. They all execute at the same speed on the P4, which was what I was expecting. I know the P4 extremely well, but I don't know AMD as well, since I have never owned an AMD. I am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?
QuoteI am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?
Nope. I'll take a look tonight.
Found the optimization PDF, and an online HTML webpage for optimization. The one I grabbed was for AMD64, it should be similar to XP.
http://63.204.158.36/amd/optimization/wwhelp/wwhimpl/js/html/wwhelp.htm
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
I think I found the slowdown in the central loop for AMD. I'll post more shortly.
EDIT: Here's the update.
One of the things I do, when trying to optimize for a specific processor is write down timing information ( latency, execution unit, etc). For AMD you don't want to use instructions that use Vectorpath. If you notice PMOVMSKB uses VectorPath.
qword_copy:
pxor mm1,mm1 DirectPath FADD/FMULL 2
movq mm0,[eax] DirectPath FADD/FMULL/FSTORE 4
pcmpeqb mm1,mm0 DirectPath FADD/FMUL 4
add eax,8 DirectPath ALU 1
pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
or ecx,ecx DirectPath ALU 1
jnz finish_rest DirectPath ALU 1
movq [esi],mm0 DirectPath FSTORE 2
add esi,8 DirectPath ALU 1
jmp qword_copy DirectPath ALU 1
finish_rest:
You can try using PSADBW instead. It is DirectPath and runs in 3 cyclces. It subtracts two MMX registers taking the absolute value, and then sums the bytes in a register. By making the second register all 0's, you basically get a way of doing a horizontal sum within a register ( COOL!). After the PCMPEQB every single byte in MM1 is going to be a "F" or a "0". Nothing else. So after the PSADBW you can move the result into a CPU register using MOVD. If the result is not 7F8h then you know you have a zero-value in there some where. The 7F8h comes from adding 8 FF's together, which is the result if all the bytes are non-zero. I have not verified the code works, so it might need testing.
pcmpeqb mm1,mm0 DirectPath FADD/FMUL 4
;you can actually move the PXOR outside the loop.
pxor mm2,mm2 DirectPath, FADD/FMUL 2
add eax,8 DirectPath ALU 1
; pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
psadbw mm1,mm2 DirectPath FADD 3
movd ecx,mm1 Double 4
; or ecx,ecx DirectPath ALU 1
cmp ecx,7F8h
; jnz finish_rest DirectPath ALU 1
jb finish_rest
You can also use PACKUSWB which is also DirectPath and takes 2 cycles. And follow that by a MOVD to a CPU register, and compare it to a 0FFFFFFFFh, ,which is what it should be if no bytes are 0. I also have not verified this works, so use it with a grain of salt.
pcmpeqb mm1,mm0 DirectPath FADD/FMUL 4
add eax,8 DirectPath ALU 1
; pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
packuswb mm1,mm2 DirectPath FADD/FMUL 2
movd ecx,mm1 Double 4
; or ecx,ecx DirectPath ALU 1
cmp ecx,0FFFFFFFFh
jnz finish_rest DirectPath ALU 1
Quite entertaining, but several times slower :'(
Edited-
Scratch that, I need to do more testing.
Ok, I think I got it now.
First example:
pxor mm1,mm1
pxor mm2,mm2
qword_copy:
movq mm0,[eax]
pcmpeqb mm1,mm0 ;DirectPath FADD/FMUL 4
add eax,8 ; DirectPath ALU 1
; pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
psadbw mm1,mm2 ;DirectPath FADD 3
movd ecx,mm1 ;Double 4
or ecx,ecx ; DirectPath ALU 1
; cmp ecx,7F8h
jnz finish_rest ;DirectPath ALU 1
; jb finish_rest
movq [esi],mm0
add esi,8
jmp qword_copy
the cmp ecx,7F8h/jb always jumped so I changed it as above.
runs in 156 vs. previous best of 149
Example 2:
qword_copy:
pxor mm1,mm1
movq mm0,[eax]
pcmpeqb mm1,mm0 ;DirectPath FADD/FMUL 4
add eax,8 ;DirectPath ALU 1
; pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
packuswb mm1,mm2 ;DirectPath FADD/FMUL 2
movd ecx,mm1 ;Double 4
or ecx,ecx ;DirectPath ALU 1
; cmp ecx,0FFFFFFFFh
jnz finish_rest
movq [esi],mm0
add esi,8
jmp qword_copy
I either don't understand hou packuswb if supposed to work or it doesn't work in this context.
packuswb of the word 7700h gives 00h, not 77h
I analyzed your new ending, and here the same thing rewritten as an alternate ending for you. Probably slightly slower in strings not a multiple of 4 bytes including the 0 byte terminator. Runs in 130 here.
finish_rest:
bsf ecx,ecx
cmp ecx,4
jb lowerx
mov edx,[eax-8]
mov [esi],edx
lowerx:
mov edx,[eax+ecx-11] ; misaligned 3/4 of the time but a lot less code.
mov [esi+ecx-3],edx
ret
93 cycles with MMX
96 cycles with XMM
Why longer with XMM?
Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz
[attachment deleted by admin]
I just realized my previous post of a different ending could corrupt the string preceding the destination for source strings less than 4 bytes long. Not good at all :(
Aero,
Did you try to run the code that you posted? On my system (a P3) the MMX procedure hangs, and when it does, with REALTIME_PRIORITY_CLASS, it takes Windows down with it. This is why I have recently been posting code with HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, to help the user recover from any mistakes I may have made.
BTW, you could save at least some of us some time and effort if you would indicate that the code requires MASM 7, or add-on macro support for the earlier versions.
Ok, here's an ending without the bugs:
finish_rest:
bsf ecx,ecx
cmp ecx,3
jb lowerx
mov edx,[eax-8] ; save first 4 bytes
mov [esi],edx
mov edx,[eax+ecx-11] ; do the rest
mov [esi+ecx-3],edx ; faster to do it than test
ret
lowerx: ; was 2 or 1 or 0
test cl,cl
jnz ItsOneOrTwo
mov byte ptr [esi],0 ; was zero, just save terminator
ret
ItsOneOrTwo:
movzx ecx,word ptr[eax-8] ; either xx0? or x0??
mov [esi],cx
cmp ecx,2
je do_2
ret
do_2: ; xx0? ????
mov byte ptr [esi+2],0
ret ; ret
AeroASM, :lol
You can use my zip file...
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment *MMX by Lingo*
align 16
@@:
movq MM1, [ecx+eax]
movq [edx+eax-8], MM0
@@2:
movq MM0, [ecx+eax]
pcmpeqb MM1, MM7
packsswb MM1, MM1
movd ebx, MM1
test bl, bl
lea eax, [eax+8]
je @B
@@:
movzx ebx, byte ptr [ecx+eax-8]
add eax, 1
test bl, bl
mov [edx+eax-9], bl
jnz @B
mov ebx, [esp+2*4]
ret 2*4
align 16
SzCpy10 proc SzDest:DWORD, SzSource:DWORD
mov ecx, [esp+2*4] ; ecx = source
xor eax, eax
mov edx, [esp+1*4] ; edx= destination
pxor MM7, MM7
movq MM1, [ecx]
mov [esp+2*4], ebx ; save ebx register
je @@2
SzCpy10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Here is the results:
P4 3.6GHz Prescott
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.
128-byte string copy timing results:
SzCpy10 (Lingo -> MMX): 120 clocks
SzCpy19 (Mark Larson ->MMX):171 clocks
SzCpy11 (Lingo-> SSE): 116 clocks
SzCpy18 (Mark Larson ->SSE):132 clocks
Press ENTER to exit...
Regards,
Lingo
[attachment deleted by admin]
MichaelW: I did test it many times, and it takes about 15 seconds each time on a 1.5GHz.
Aero,
Correction, the procedure does not hang, but with REALTIME_PRIORITY_CLASS Windows did. After a change to HIGH_PRIORITY_CLASS the procedure takes ~40 seconds to run (P3-500), so perhaps this was just too long for REALTIME_PRIORITY_CLASS.
I added a function test using your source string, and another one that that substituted a high-ascii character at the start of the source string, and in both cases the string copied OK. The cycle count for the MMX version was a uniform 139. The XMM version generated errors, so I'm guessing it contains at least one instruction that my P3 does not support.
Quote from: lingo on May 16, 2005, 02:35:51 AM
SzCpy10 (Lingo -> MMX): 120 clocks
SzCpy19 (Mark Larson ->MMX):171 clocks
SzCpy11 (Lingo-> SSE): 116 clocks
SzCpy18 (Mark Larson ->SSE):132 clocks
Press ENTER to exit...
Regards,
Lingo
Three things Lingo.
1) You modified my code, yet still list the timings as mine. If you are going to benchmark my code, and list it as my code, don't change the code. ( szpy18 routine) Not to mention this throws off any other comparative timings people are doing with my code.
2) I made an update on the previous page to my code. With the update my code runs in 87 cycles. You didn't pick that up either.
3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4. Considering the original timing without the update was running at about 100 cycles. When I updated your timing code to do REALTIME and updated the number of loops, the updated code I posted last week runs in 87 cycles as I was expecting.
EDIT: Here are the timings after I changed my code back to its original form and added the code update from last week, along with the more accurate timings . Now when I run it mutliple times, I only get a 1 cycle clock variance.
128-byte string copy timing results:
SzCpy10 (Lingo -> MMX): 116 clocks
SzCpy19 (Mark Larson ->MMX):119 clocks
SzCpy11 (Lingo-> SSE): 81 clocks
SzCpy18 (Mark Larson ->SSE):87 clocks
Good morning Mark-
May I have a full copy of your latest 87 clock version. The bits and pieces to construct it are confusing me. Also, doesn't Lingo's version still suffer from the problem of writing outside the boundary of the destination string if it's exactly the size to receive the source string, and it is not a multiple of 8 bytes?
Mark Larson, :(
Quote
1) You modified my code, yet still list the timings as mine.....
No, I didn't modify it because your code is not so valuable for me and it is not in my style..
I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page
Quote
Why longer with XMM?
Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz
So it should be Mark Larson@AeroASM or not.. ?..
Quote
2) I made an update on the previous page to my code. With the update my code runs in 87 cycles.
Congratulations
Quote
3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4
"There's no way the SSE code I posted"
I don't know your "original" code
I just use "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page
" is running in 132 cycles on your P4"
It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...
Regards,
Lingo
Quote from: lingo on May 16, 2005, 05:14:38 PM
There is an error in usage of source and destination parameters
in the beginning of the "your" or AeroASM code and that is the reason
for "Why longer with XMM?" question.
It is my implementation of Mark's algorithm.
What is the error?
I have always found some humour in the direction that such topics develop and how far they wander from the original design. The first examples from Mark Jones were general purpose byte aligned source and destination that could be used under almost any circumstances as is characterised by unaligned string data.
I now see code designs that require complicated starting alignment correction on the source that still require aligned targets which reduce the algos to more or less novelties for general purpose byte aligned string data.
I would like to have the luxury of performing OWORD data transfers in an efficient manner but the latest hardware only barely supports it with a fudge in 64 bit and all of the later hardware requires at least natural alignment of the data size to run properly.
I would like to thank those who helped with the BYTE aligned code testing as I have ended up with an algo that is about 90% faster than the original I had in the library and it still satsfies the criterion of being a general purpose algorithm.
:lol
QuoteHey Hutch--, there was one other problem with Lingo's code. It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes. So it copies extra data past the end of the source string to the destination string. In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash. In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.
1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
(We don't know the lenght of the source string}2.How long is the buffer with destination string and how we allocated it? Is it equal of the source buffer or not?
3.Who desided the lenght of the destination buffer to be equal of the lenght of the source string (not buffer)?
(We don't know the lenght of the source string} A. Mark Larson and I saw lack of the programmer's experience here...Why?
Because as a result we must write additional slow code (to copy last bytes byte by byte). Who is gilty about the slower code?
Mark Larson and all the people that believe to him... :lol
Regards,
Lingo
Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.
Aero,
I think its because internally a 32 bit processor emulates 64 and 128 bit data transfers in 32 bit chunks.
Actually the later Pentiums have 64 bit data buses but I think you are right. Also I think that XMM is not yet very well developed and refined.
AeroASM, :lol
It is the result from your "old" file szCopyXMMX.asm
Quote
C:\0A1>szcopy~1.exe
173 cycles with MMX
132 cycles with XMM [/b]
Regards,
Lingo
Quote from: lingo on May 16, 2005, 06:14:04 PM
1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek
Quote from: AeroASM on May 16, 2005, 06:16:21 PM
Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.
That's because the latency on P4 for SSE2 is a lot slower than MMX. You can't convert it and always expect it to run faster. There are additional tricks you have to do. I'll take a look at it later. I'm swamped at work at the moment. The general trick I is to do TWO of whatever you are doing in the main loop to help break dependencies and give a speed up.
Quote from: lingo on May 16, 2005, 05:14:38 PM
Quote3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4
It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...
You missed the point Lingo. I compiled and ran your code multiple times and I got large differences in the number of cycles it took to execute. In your own SSE code, I saw a 12 cycle variance. If you are going to do code optimization for low cycle count procedures you shouldn't have more than a 1 or 2 cycle variance.
Quote from: lingo on May 16, 2005, 05:14:38 PM
Mark Larson, :(
Quote
1) You modified my code, yet still list the timings as mine.....
No, I didn't modify it because your code is not so valuable for me and it is not in my style..
I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page
That's cuz Aero made changes to my code. So when you post timings, you need to post that it's my code but modified by Aero.
JimG, here's my latest. I haven't had much chance to tweak since I've been super busy lately. I should be able to drop another 10-15 cycles maybe more after playing with it. I wanted to try PSADBW in place of PMOVMSKB on P4 and see if it's faster. And a few other things.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
mov eax,[esp+4]
mov esi,[esp+8]
align 16
qword_copy:
pxor mm1,mm1
movq mm0,[eax]
pcmpeqb mm1,mm0
add eax,8
pmovmskb ecx,mm1
or ecx,ecx
jnz finish_rest
movq [esi],mm0
add esi,8
jmp qword_copy
finish_rest:
;if 0
bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2
do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret
do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret
do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret
do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret
do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret
do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret
;cmp ecx,1
;je do_1
movzx ecx,byte ptr [eax-8]
mov [esi],cl
ret 8 ; ret
szCopyMMX endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote
Quote
Quote from: AeroASM on May 15, 2005, 05:06:03 PM
93 cycles with MMX
96 cycles with XMM
Why longer with XMM?
Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz
Took me a while to run your code, because I was so busy. I forgot earlier that you said you had a Pentium 4 M. They are more like a P3 in optimiztion than a P4. A lot of the long cycled instructions on a P4 are lower latency on a P4 M. However you still might want to try doing 2 per loop to break dependencies. I'll play with it if I get a chance.
117 cycles with MMX
97 cycles with XMM
Your modified MMX routine of my code runs 10 cycles slower on my P4 than my original code. But I am sure it runs faster on your P4 M or you wouldn't have posted it. I am better at optimizing for a standard P4, since that's what I have :)
Not Pentium 4-M, but Pentium M. After the P3, Intel souped it up to make the P4, added mobility (battery saving) stuff and made the Pentium M, then put some of the mobility stuff on the P4 to make the P4-M (AFAIK).
You can save clocks by organising the qword loop like this:
Instead of this:
;preliminary stuff
qword_copy:
;stuff 1
jnz somewhere_else
;stuff 2
jmp qword_copy
somewhere_else:
Put this:
;preliminary stuff
jmp copy_start
qword_copy:
;stuff 2
copy_start:
;stuff 1
jz qword_copy
somewhere_else:
this is faster because you do not have to fall through the jnz each loop.
The Pentium 4-M and Pentium M are the same processor. I just like using "Pentium 4 M" for people who aren't familiar with the different processor lines, so they can see it's a P4 part ( has SSE2 support). Though it was designed more from a P3 than a P4. Whereas the P4 is a significant departure from the P3. Intel has gotten worse and worse with their processors. Prescott has even longer instruction latencies ( eww) than the original P4, and slower L1 and L2 cache access times. One of the things that Intel has always beaten AMD on is their L1 cache latency access ( 2 clock cycles verus AMD having 3). The current prescott's have 3 cycles of L1 cache latency. The just released Intel parts are even worse with 4 cycles of L1 cache latency. So reads from memory even if they are in the L1 cache are getting more expensive.
I swear they are not, but I am probably mistaken.
Currently their are no desktop systems that are based off a Pentium M ( which is what you were saying). Tom's Hardware modified a desktop motherboard to plug in one of the Pentium M chips, so they could use it on the desktop. That's the closest a desktop version of it has come. The Pentium 4 M and Pentium M are the same processor. Intel has made several comments in the press that they will eventually release a Pentium M for the desktop, but none has come out yet. Because of my job I have to know low level hardware very well.
You might find this link interesting. I did a search on "Pentium 4-M" on google. Tom's Hardware is comparing 4 mobile ( read for laptops) Pentium 4-M ( calls it that exact name).
http://www.tomshardware.com/mobile/20030212/
I also did a search on "Pentium M", and found another link on Tom's Hardware doing reviews of laptop systems, where they call it "Pentium M"
http://www.tomshardware.com/mobile/20030205/
Both reviews are on the same site, and both reviews use both names.
still slow [without preserving esi) :lol
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.
128-byte string copy timing results:
SzCpy10 - > Lingo -> MMX: 123 clocks
szCopyMMX ->Mark Larson ->MMX: 135 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 113 clocks
szCopyMMX1 - > Mark Larson -> MMX-> Fast: 124 clocks
Press ENTER to exit...
[attachment deleted by admin]
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM
1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek
Would someone please address this for the clueless?
Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM
1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek
Would someone please address this for the clueless?
I missed this the first time it was posted. Been super busy at work and been working a lot of overtime. That always happens when you are at a new place, because you spend a lot of time ramping up. So that counts down on my ability to optimize. I can't spend as much time on it as I'd like to.
I have heard of cases where reading past the end of a buffer can cause a fault ( dynamically allocated buffer). However, I have not experienced it for myself. And Agner Fog's string length routine actually reads past the end of the buffer since it reads data a dword at a time. In general you should be much safer doing a read past the end of the buffer versus doing a write. I'd never do a write that could potentially go past the end of the buffer. In the Intel Optimization manual I believe it says something about reads going past a page boundary causing a fault. But I could be misremembering. It's been a while, and I've never observed the behavior. Agner Fog's strlen routine has been in use in MASM32 for some time. Hutch-- made some modifications, and I've never heard of anyone reporting a problem. So if you do read past the end of some data, just be aware that there could be a problem, and be cautious.
Quote from: lingo on May 16, 2005, 10:11:22 PM
still slow [without preserving esi) :lol
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.
128-byte string copy timing results:
SzCpy10 - > Lingo -> MMX: 123 clocks
szCopyMMX ->Mark Larson ->MMX: 135 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 113 clocks
szCopyMMX1 - > Mark Larson -> MMX-> Fast: 124 clocks
Press ENTER to exit...
Your code is using SSE2 not SSE. SSE doesn't support the full 16 byte ALU instructions such as MOVDQA, PXOR, etc. I had been limiting myself to SSE since that's what I thought you were using, and I wanted to keep apples to apples. I just converted my SSE code to use SSE2 like yours uses, and now I am running in 62 cycles. Yours runs in 76 cycles. You can do the same conversion yourself on my code by simply converting the instruction to use SSE2 ( movdqa, etc).
For those that want a visual aid, here's the main loop.
align 16
dqword_copy:
movdqa xmm0,[eax]
pxor xmm1,xmm1
pcmpeqb xmm1,xmm0
add eax,16
pmovmskb ecx,xmm1
test ecx,ecx
jnz finish_rest
movdqa [esi],xmm0
add esi,16
jmp dqword_copy
I don't know what you talking about? :naughty:
In my zip file I have just MMX code...
Where is your zip file with tests to understand what you mean?
Reading past the end of a buffer (under Windows 2000 SP4) causes no problems that I can see, at least without a debugger. I didn't have time to test buffers allocated by other means, or writes past the end.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.486 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
teststr db 4097 dup('X'),0
lpMem dd 0
junkstr db 0,0,0,0,0
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov eax,input("Read past end of statically allocated buffer, press enter to continue...")
invoke StrLen,ADDR teststr
mov ebx,eax
print chr$("Returned string length :")
print ustr$(ebx)
print chr$(13,10)
mov ebx,OFFSET teststr
add ebx,4097+16
mov eax,[ebx]
invoke GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,4096
mov lpMem,eax
invoke GlobalSize,lpMem
mov ebx,eax
print chr$("Size of dynamically allocated buffer :")
print ustr$(ebx)
print chr$(13,10)
invoke memfill,lpMem,4096,'LLIF'
mov eax,input("Read past end of dynamically allocated buffer, press enter to continue...")
mov ebx,lpMem
add ebx,4095
mov eax,[ebx]
mov edx,[ebx+16]
mov DWORD PTR junkstr,eax
print ADDR junkstr
free lpMem
mov eax,input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Quote from: lingo on May 17, 2005, 05:16:39 PM
Where is your zip file with tests to understand what you mean?
For timing purposes I added my routine to your code, so that we used the same test data. I just modified the timing to use REALTIME and added 10 times as many loops. That way I can get more accurate benchmarking against your code since we are using the same test data.
Quote from: lingo on May 17, 2005, 05:16:39 PM
I don't know what you talking about? :naughty:
In my zip file I have just MMX code...
Not true. Here's the text you print for your szCpy11 routine:
SzSzCpy11 DB "SzCpy11 (Lingo-> SSE): ",0
So you are saying it is just SSE. However here is some code from your szCpy11 routine.
comment * XMM by Lingo*
align 16
@@:
movdqa [edx+eax],XMM0
add eax,16
@@XM:
pcmpeqb XMM1, [ecx+eax]
movdqa XMM0, [ecx+eax]
MOVDQA is an SSE2 instruction, not SSE. Doing PCMPEQB with an XMM register is an SSE2 instruction, not SSE. Doing any of the old MMX instructions on an XMM register instead of an MMX register requires SSE2 support. If you are unsure in the future, check Appendix B in the 2nd instruction manual for the P4. It gives a complete listing of what instructions were added, which each type of support. It has a section for all the MMX instructions, SSE, and SSE2. So you can clearly see what processor you need to use your code when you use a given instruction. If someone without SSE2 support tries to run your code, it'll die the big death ( invalid opcode). That means most people with AMDs, since unless you have an AMD64 you won't have SSE2 support. And anyone with a P3 or older Intel processor.
Bla, bla... blah..... :naughty:
I'm afraid you mean my older zip file with AeroASM's code in it...
Use my last zip file (named StillSlow.zip) at this page (page 7)
Where is your zip file with tests to understand all people how slow your code is?
No file, just bla, bla blah :naughty:
Quote from: lingo on May 17, 2005, 08:35:08 PM
Bla, bla... blah..... :naughty:
I'm afraid you mean my older zip file with AeroASM's code in it...
Use my last zip file (named StillSlow.zip) at this page (page 7)
Where is your zip file with tests to understand all people how slow your code is?
No file, just bla, bla blah :naughty:
I got better things to do than waste my time on someone who is rude and arrogant. You are just upset because my SSE2 code is faster than yours. Grow up. And BTW the timings on my computer for the latest code you posted are the same speed. And you still haven't gotten it right, your latest code says it's MMX, and it's not. It's SSE. If you spent more time listening instead of being rude, you would have already learned that fact.
Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM
1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek
Would someone please address this for the clueless?
Well, I had to answer my own question....
; assemble and link parameters
;Assemble=/c /coff /Cp /nologo /Sn
;Link=/SUBSYSTEM:WINDOWS /RELEASE /VERSION:4.0
.586
.model flat, stdcall
option casemap :none ; case sensitive
include user32.inc
includelib user32.lib
include kernel32.inc
includelib kernel32.lib
.data
bln db ' ',0
db 4087 dup (0)
done db 'done',0
badstring db '0'
.code
Program:
invoke GetModuleHandle, 0
mov edx,offset badstring
@@: mov eax,[edx] ; get 4 bytes as most of these routines are doing
invoke MessageBox,0,addr done,addr bln,0
invoke ExitProcess, eax
end Program
The mov eax,[edx] bombs off because it trys to read past it's own address space.
This could be a very hard problem to debug.
So, what to do about all these general purpose routines????
[attachment deleted by admin]
Could the routine used at compile time be dependent on the aligning of the data structures? Say we made a macro called szCopy which would choose between a general-purpose and custom algorithm(s) at compile-time. If the data is not aligned then a default byte-wise copy algorithm is used, whereas if the coder aligned all called datatypes to DWORD offsets, then an MMX routine would be used instead? Just an idle thought, I know very little about macro capability. :red
I am locking this topic because of a number of insults that have been passed in it, not because the topic has not been interesting or useful.
This is a technical forum with rules about good manners among members and this is not going to change. Please respect this as the alternative drives skiled and useful people away which is a lose for everyone.