Well, if the subject sounds familiar: I am reviving an old thread (http://www.masm32.com/board/index.php?topic=1589.msg12747#msg12747).
The algos there are fast, but they are mmx and therefore trash the FPU (I like the FPU). So I thought of adapting one of them, actually an algo by Lingo, to produce an XMM version. And to make it more realistic, I introduced spoilers:
.data
align 16
spoil1 db 1, 2, 3 ; badly aligned source
String1 DB "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMMNnOoP",\
On my Core 2 Celeron M, differences in timings are there but not dramatic. What was more dramatic was the silent bye-bye when I removed the spoilers...
With the spoilers, the XMM version works just fine and leaves the FPU in peace. When I remove them, then both Lingo's and my adapted algo crash miserably with exception #5 at movq xmm0, qword ptr [ecx+eax]
Anybody interested to have a look into this? I also suspect that my version could be a lot improved...
[attachment deleted by admin]
It seems to overwrite the counter used by counter_end - this is always 39383736h, so the counter never gets to 0. I don't get any sort of exception.
Thanks Sinsi - that makes sense. In the meantime, I made up another xmm version:
comment * based on MMX Fast by Mark Larson *
align 16 ; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
mov eax,[esp+8]
mov esi,[esp+4]
align 16
qword_copy1b:
pxor xmm1, xmm1
movups xmm0, oword ptr [eax]
pcmpeqb xmm1, xmm0
add eax, 8+8
pmovmskb ecx, xmm1
or ecx,ecx
jnz finish_rest1
movups oword ptr [esi], xmm0
add esi, 8+8
jmp qword_copy1b
finish_rest1:
ret 8
szCopyXMM endp
512-byte string copy timing results:
szCopyXMM -> jj -> xmm: 278 clocks
szCopyMMX -> Mark Larson -> MMX: 312 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 284 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 309 clocks
szCopy 1076 clocks
lstrcpy 1202 clocks
SzCpy10 - > Lingo -> MMX: 283 clocks
MbCopy -> jj -> xmm: 384 clocks
Now one problem is that, aligned or not, these algos work in chunks of 128 bytes. So there are problems with small strings...
hi,
after a quick view, i think that the problem is caused by "test bl,bl" (in your and lingo's routine) -> There are 4 packet Bytes after "packsswb" - so you have to test for these 4 bytes with "test ebx,ebx".
regards, qWord
Thanks, qword - I am afraid it keeps choking. But the other one seems to work just fine, also for small strings and bad alignment. However, it needs a zero delimiter at the end - see mov byte ptr [esi], 0 below
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * based on MMX Fast by Mark Larson *
align 16 ; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
mov eax,[esp+8]
mov esi,[esp+4]
align 16
qword_copy1b:
pxor xmm1, xmm1
movups xmm0, oword ptr [eax]
pcmpeqb xmm1, xmm0
add eax, 8+8
pmovmskb ecx, xmm1
or ecx,ecx
jnz finish_rest1
movups oword ptr [esi], xmm0
add esi, 8+8
jmp qword_copy1b
finish_rest1:
mov byte ptr [esi], 0
ret 8
szCopyXMM endp
OK, I made a bit of cleanup and am satisfied with this version:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX Fast" by Mark Larson *
; align 16 ; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
push esi
push edi
mov edi, [esp+4+8]
mov esi, [esp+8+8]
push ecx ; preserve another valuable register
@@:
pxor xmm1, xmm1
movups xmm0, oword ptr [esi]
pcmpeqb xmm1, xmm0
pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction!
test ecx,ecx
jnz @F
movups oword ptr [edi], xmm0
add esi, 16
add edi, 16
jmp @B
@@:
.Repeat
lodsb ; relatively slow
stosb ; tail cleanup
.Until al==0
mov eax, edi ; a stringcat routine might need this one
pop ecx ; restore ecx
pop edi
pop esi
ret 8 ; cleanup
szCopyXMM endp
Testing the 16-byte boundary looks fine:
Source=B23456789012345
Dest=B23456789012345
Source=C234567890123456
Dest=C234567890123456
Source=D2345678901234567
Dest=D2345678901234567
512-byte string copy timing results (aligned):
len of source string = 512
len of szCopyXMM: 55
szCopyXMM -> jj -> xmm: 298 clocks
szCopyMMX -> Mark Larson -> MMX: 312 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
szCopy 1053 clocks
lstrcpy 1184 clocks
SzCpy10 - > Lingo -> MMX: 283 clocks
MbCopy -> jj -> xmm: 380 clocks
Three times as fast as szCopy, 55 bytes short, and does not trash the FPU. The only caveat is that your puter should be less than seven years old :green
[attachment deleted by admin]
jj, glad to see you play with simd stuff :bg
but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?
2. for unaligned data, look at lddqu instruction
Who can explain these results ?
512-byte string copy timing results:
len of source string = 512
len of szCopyXMM: 52
szCopyXMM -> jj -> xmm: 2085 clocks
szCopyMMX -> Mark Larson -> MMX: 323 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 324 clocks
szCopy 1556 clocks
lstrcpy 1573 clocks
SzCpy10 - > Lingo -> MMX: 285 clocks
MbCopy -> jj -> xmm: 284 clocks
I think the problem might be the processor it's running on. This is what I get on my P3:
len of source string = 512
len of szCopyXMM: 52
szCopyXMM -> jj -> xmm: 2090 clocks
szCopyMMX -> Mark Larson -> MMX: 319 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 281 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
szCopy 2078 clocks
lstrcpy 2384 clocks
SzCpy10 - > Lingo -> MMX: 285 clocks
MbCopy -> jj -> xmm: 282 clocks
Or not. If I comment out the tail cleanup code then szCopyXMM runs in 11 cycles and the procedure fails the function tests, implying that most or all of the work is being done by the tail cleanup code.
After more tests I think the problem is my processor. On a P3 I think pmovmskb and pcmpeqb are limited to the MMX registers. I don't see any errors when I assemble, but on the first iteration of the loop ECX is always 0FFh, when it should be 0 up to the last loop.
Or not exactly. Assembling the code with ML 6.14, 6.15, and 7.00 I get:
004019AF 660FEFC9 pxor mm1,mm1
004019B3 0F1006 movups xmm0,[esi]
004019B6 660F74C8 pcmpeqb mm1,mm0
004019BA 660FD7C9 pmovmskb cx,mm1
And with 6.15 and 7.00 the code generates an illegal instruction exception somewhere further down (in MbCopy). So there is a problem with the version of ML, but if that were fixed then there would be a problem with the processor not supporting some of the instructions.
Quote from: NightWare on February 08, 2009, 03:29:17 AM
jj, glad to see you play with simd stuff :bg
but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?
Because I shamelessly copied that from Mark's code :bg
Quote
2. for unaligned data, look at lddqu instruction
Yields the same timings, is 2 bytes longer (55->57 bytes), and decreases the maximum age of your puter.
movdqu and movups produce exactly the same timings. I chose
movups below (2 bytes shorter than movdqu), but maybe there are differences by processor type. Anyway, thanks a lot for the hint to lddqu, it made me find movups/movdqu, which both improve drastically the timings for the non-aligned strings:
512-byte string copy timing results:
len of source string = 512
alignment: offset src=4202611, dest=4203173
len of szCopyXMM: 55
szCopyXMM -> jj -> xmm: 484 clocks
szCopyMMX -> Mark Larson -> MMX: 474 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 474 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 439 clocks
szCopy 1053 clocks
lstrcpy 1214 clocks
SzCpy10 - > Lingo -> MMX: 476 clocks
MbCopy -> jj -> xmm: 560 clocks
There are some that are a few clocks faster, but remember they trash the FPU.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX Fast" by Mark Larson *
; align 16 ; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
push esi
push edi
mov edi, [esp+4+8]
mov esi, [esp+8+8]
push ecx ; preserve another valuable register
pxor xmm1, xmm1
@@:
movups xmm0, [esi]
pcmpeqb xmm1, xmm0
pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction!
test ecx,ecx
jnz @F
movups [edi], xmm0
add esi, 16
add edi, 16
jmp @B
@@:
.Repeat
lodsb ; relatively slow
stosb ; tail cleanup
.Until al==0
mov eax, edi ; a stringcat routine might need this one
pop ecx ; save ecx
pop edi
pop esi
ret 8 ; cleanup
szCopyXMM endp
Finally, as to the "strange" timings:
Try to assemble the code with ML 9.0 or with JWasm.EDIT: Here are the tiny differences between the codes generated by masm 6.14 and the others.
You might google for "size override" optimization 66h (http://www.google.it/search?num=50&hl=en&newwindow=1&safe=off&q=%22size+override%22+optimization+66h&btnG=Search)
ml v614
004019C0 ³? 0FEFC9 pxor mm1, mm1
004019C3 ³> 0F1006 Úmovups xmm0, dqword ptr [esi]
004019C6 ³. 0F74C8 ³pcmpeqb mm1, mm0
004019C9 ³. 0FD7C9 ³pmovmskb ecx, mm1
004019CC ³. 85C9 ³test ecx, ecx
004019CE ³.75 0B ³jne short SzCpy.004019DB
004019D0 ³. 0F1107 ³movups dqword ptr [edi], xmm0
ml v9
004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1
004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi]
004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0
004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1
004019D7 ³? 85C9 ³test ecx, ecx
004019D9 ³.75 0B Àjne short SzCpy.004019E6
004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0
JWasm
004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1
004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi]
004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0
004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1
004019D7 ³? 85C9 ³test ecx, ecx
004019D9 ³.75 0B Àjne short SzCpy.004019E6
004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0
[attachment deleted by admin]
Must be...
I used the new 'ml' and then the old 'link' and the differences are...
512-byte string copy timing results:
len of source string = 512
len of szCopyXMM: 55
szCopyXMM -> jj -> xmm: 358 clocks
szCopyMMX -> Mark Larson -> MMX: 323 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 325 clocks
szCopy 1564 clocks
lstrcpy 1571 clocks
SzCpy10 - > Lingo -> MMX: 284 clocks
MbCopy -> jj -> xmm: 476 clocks
Quote from: askm on February 08, 2009, 10:39:21 AM
Must be...
I used the new 'ml' and then the old 'link' and the differences are...
Yes, it's the three missing size override 66h bytes. Jwasm works fine, too.
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.
I was looking through the original thread and noticed a few references to my stings functions but for some reason I never bothered to post any code probably because I just assumed I had already posted them in other threads, here are the functions from strings.lib that Mark Larson referred to. They have mostly been dissected and rewritten over the years by people like Mark who took them and vastly improved them but for what its worth...
lszLenMMX/lszLenMMXW
NOTE: These functions require a Pentium 3 or better with SSE instructions
Calculates the length of a string, the string should be aligned.
lszLenMMXW is a Unicode variant.
Parameters:
pString = Pointer to a null terminated string
Returns the length of the supplied string not including the NULL terminator
lszCopyMMX
NOTE: This function requires a Pentium 3 or better with SSE instructions
Copies a zero terminated string using the MMX registers (not preserved)
Parameters:
Dest = Pointer to destination buffer
Source = Pointer to source string
Returns the address of the destination buffer
lszCopyMMX FRAME lpDest,lpSource
uses esi,edi
mov esi,[lpSource]
mov edi,[lpDest]
mov ecx,esi
and ecx,15
rep movsb
nop
pxor mm0,mm0
nop
pxor mm1,mm1
nop
:
movq mm0,[esi]
movq mm2,[esi]
pcmpeqb mm2,mm1
pmovmskb ecx,mm2
or ecx,ecx
jnz >
movq [edi],mm0
add edi, 8
add esi, 8
jmp <
:
emms
; Do the remainder
bsf ecx,ecx
rep movsb
mov [edi],cl
mov eax,edi
sub eax,[lpDest]
ret
ENDF
lszLenMMX FRAME pString
mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes
pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes
: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <
sub eax,[pString]
bsf ecx,ecx
sub eax,8
add eax,ecx
emms
RET
ENDF
lszLenMMXW FRAME pString
mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes
pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes
: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqw mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <
sub eax,[pString]
bsf ecx,ecx
sub eax,8
add eax,ecx
shr eax,1
emms
RET
ENDF
Quote from: MichaelW on February 08, 2009, 12:21:48 PM
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.
That is what I get with the ml614 version, see below. With JWasm and ML 9.0, this drops to 471 cycles.
I attach the latest version with the two executables.
szCopyXMM -> jj -> xmm: 2084 clocks
szCopyMMX -> Mark Larson -> MMX: 474 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 477 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 438 clocks
szCopy 1053 clocks
lstrcpy 1216 clocks
SzCpy10 - > Lingo -> MMX: 480 clocks
MbCopy -> jj -> xmm: 478 clocks
[attachment deleted by admin]
For szCpyV614:
szCopyXMM -> jj -> xmm: 2082 clocks
szCopyMMX -> Mark Larson -> MMX: 632 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
szCopy 2076 clocks
lstrcpy 2704 clocks
SzCpy10 - > Lingo -> MMX: 701 clocks
MbCopy -> jj -> xmm: 697 clocks
This is basically the results I got for the version were I manually added the prefixes and assembled with 6.14.
For szCpyV9:
szCopyXMM -> jj -> xmm: 2082 clocks
szCopyMMX -> Mark Larson -> MMX: 633 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
szCopy 2078 clocks
lstrcpy 2703 clocks
SzCpy10 - > Lingo -> MMX: 700 clocks
MbCopy -> jj -> xmm:
There is no count for the last procedure because it generates:
Exception number: c000001d (illegal instruction)
And this is basically the results I got for the versions that I assembled with 6.15 and 7.00, which BTW added the prefixes.
I would think Intel would ensure that any instruction, which would run on the processor, would produce the same result as on the later processors. I would be interested to see if other P3 processors have this problem.
Interesting. Exactly the same clocks as szCopy, but a lot slower than the MMX versions. Can post anybody results for an AMD, or other Intel processors?
Michael, the 66h seems to have a function similar to nop. Could you try replacing the 66h with a number of nops?
Just found this, indicating it's a known problem of early Pentiums:
Many compilers for IA-32 generate "repne scasb" in order
tto find the length of a given C string. However, it is possible
to implement strlen (and many other string functions) using
the SSE2 instruction set: pcmpeqb + pmovmaskb until there
is a set bit then bsf to find its index. On Core2 it is roughly
9.3 times faster and about 6.5 times faster on Pentium 4.
http://www.mydatabasesupport.com/forums/arch/252748-fast-string-functions.html (http://www.mydatabasesupport.com/forums/arch/252748-fast-string-functions.html)
66h is the operand-size prefix, or per Intel the operand-size override prefix:
Quote
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size. Use of 66H followed by 0FH is treated as a mandatory prefix by some SSE/SSE2/SSE3 instructions. Other use of the 66H prefix with MMX/SSE/SSE2/SSE3 instructions is reserved; such use may cause unpredictable behavior.
I knew it was used for the integer instructions, but I have never before noticed it on an MMX or SSE instruction, although it does make some sense that they would use it to specify the register size.
I have verified that with the prefixes in place, and the encoding exactly as MLv9 produced, on the first execution:
pmovmskb ecx, xmm1
Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation. The code looks correct according to the Intel references, but on my P3 it does not work as it is documented to work.
Am I the only cheapskate here that is still running a P3?
Quote from: MichaelW on February 08, 2009, 05:33:38 PM
pmovmskb ecx, xmm1
Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation.
Which is kind of a convenient bug, allowing a soft fall through... :-)
Quote
Am I the only cheapskate here that is still running a P3?
Don't be desperate, the attached version should be fine for you. The code should look familiar to you, just search for 80808080h ...
Timings for a P4:
alignment: offset src=4210803, dest=4211365
Source len=512
1109 clocks for szCopyXMM
1113 clocks for szCopyMMX
1158 clocks for SzCpy11
1087 clocks for szCopyMMX1
1180 clocks for SzCpy10
1711 clocks for szCopy
2164 clocks for lstrcpy
Source len=511
1096 clocks for szCopyXMM
1098 clocks for szCopyMMX
1160 clocks for SzCpy11, result NOT CORRECT
1082 clocks for szCopyMMX1, result NOT CORRECT
1165 clocks for SzCpy10
1699 clocks for szCopy
2166 clocks for lstrcpy
Source len=15
75 clocks for szCopyXMM
58 clocks for szCopyMMX
47 clocks for SzCpy11, result NOT CORRECT
31 clocks for szCopyMMX1, result NOT CORRECT
49 clocks for SzCpy10
70 clocks for szCopy
147 clocks for lstrcpy
[attachment deleted by admin]
Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?
Hi,
Hardly. While I have newer machines, _at home_ I mostly use
my PIII or Pentium systems. And I have an HP 200LX that uses
an 80186 in my pocket. I did once throw away an IBM PC... I
really ought to get rid of some of the older ones.
Regards,
Steve
Updated with some refinements - see Core Duo (Celeron M) timings below.
I included crt_strcpy into the testbed - it's remarkably fast for short strings.
An additional test for a global "CanSSE2" variable would cost ca. 3 cycles extra. My goal was a fast general purpose library routine that does not trash the FPU. The current szCopyXMM is pretty good on the Core2, but I would be grateful for more timings, especially on the old Pentiums and AMDs. The algo is a bit weak on very short strings (0-15 bytes), due to the overhead. A rough estimate says a well commented MASM source has an average line length of 30 or so...
Note: All algos (except the Masm32 szCopy) have been harmonised to copy dest, source, consistent with mov dest, src.
The attached executable was assembled with JWasm.
alignment: offset src=4214899, dest=4215461
len of szCopyXMM: 118
Source len=512
465 clocks for szCopyXMM
456 clocks for szCopyMMX mmx, trashes FPU
481 clocks for SzCpy10 mmx, trashes FPU
1054 clocks for szCopy
1098 clocks for lstrcpy
668 clocks for crt_strcpy
Source len=511
454 clocks for szCopyXMM
434 clocks for szCopyMMX mmx, trashes FPU
480 clocks for SzCpy10 mmx, trashes FPU
1055 clocks for szCopy
1086 clocks for lstrcpy
684 clocks for crt_strcpy
Source len=128
139 clocks for szCopyXMM
119 clocks for szCopyMMX mmx, trashes FPU
126 clocks for SzCpy10 mmx, trashes FPU
285 clocks for szCopy
304 clocks for lstrcpy
171 clocks for crt_strcpy
Source len=127
129 clocks for szCopyXMM
114 clocks for szCopyMMX mmx, trashes FPU
126 clocks for SzCpy10 mmx, trashes FPU
286 clocks for szCopy
304 clocks for lstrcpy
171 clocks for crt_strcpy
Source len=31
56 clocks for szCopyXMM
46 clocks for szCopyMMX mmx, trashes FPU
52 clocks for SzCpy10 mmx, trashes FPU
96 clocks for szCopy
112 clocks for lstrcpy
51 clocks for crt_strcpy
Source len=17
33 clocks for szCopyXMM
38 clocks for szCopyMMX mmx, trashes FPU
42 clocks for SzCpy10 mmx, trashes FPU
54 clocks for szCopy
94 clocks for lstrcpy
32 clocks for crt_strcpy
Source len=15
79 clocks for szCopyXMM
46 clocks for szCopyMMX mmx, trashes FPU
39 clocks for SzCpy10 mmx, trashes FPU
48 clocks for szCopy
89 clocks for lstrcpy
31 clocks for crt_strcpy
[attachment deleted by admin]
Athlon XP 2600+ (2.13GHz)
alignment: offset src=4214899, dest=4215461
len of szCopyXMM: 118
Source len=512
688 clocks for szCopyXMM
451 clocks for szCopyMMX mmx, trashes FPU
455 clocks for SzCpy10 mmx, trashes FPU
1827 clocks for szCopy
1794 clocks for lstrcpy
742 clocks for crt_strcpy
Source len=511
683 clocks for szCopyXMM
420 clocks for szCopyMMX mmx, trashes FPU
455 clocks for SzCpy10 mmx, trashes FPU
1822 clocks for szCopy
1789 clocks for lstrcpy
740 clocks for crt_strcpy
Source len=128
198 clocks for szCopyXMM
142 clocks for szCopyMMX mmx, trashes FPU
136 clocks for SzCpy10 mmx, trashes FPU
479 clocks for szCopy
483 clocks for lstrcpy
206 clocks for crt_strcpy
Source len=127
201 clocks for szCopyXMM
138 clocks for szCopyMMX mmx, trashes FPU
136 clocks for SzCpy10 mmx, trashes FPU
475 clocks for szCopy
479 clocks for lstrcpy
204 clocks for crt_strcpy
Source len=31
62 clocks for szCopyXMM
43 clocks for szCopyMMX mmx, trashes FPU
40 clocks for SzCpy10 mmx, trashes FPU
137 clocks for szCopy
151 clocks for lstrcpy
58 clocks for crt_strcpy
Source len=17
43 clocks for szCopyXMM
48 clocks for szCopyMMX mmx, trashes FPU
31 clocks for SzCpy10 mmx, trashes FPU
88 clocks for szCopy
104 clocks for lstrcpy
39 clocks for crt_strcpy
Source len=15
41 clocks for szCopyXMM
36 clocks for szCopyMMX mmx, trashes FPU
26 clocks for SzCpy10 mmx, trashes FPU
81 clocks for szCopy
97 clocks for lstrcpy
36 clocks for crt_strcpy
Thanks, Sinsi. Interesting that the algo performs much better for the 15 bytes string than on the Core2. And crt_strcpy is also remarkably good all over the place.
EDIT: Here is the innermost loop of crt_strcpy. Interesting ::)
77C160C1 8917 mov dword ptr [edi], edx
77C160C3 83C7 04 add edi, 4
77C160C6 BA FFFEFE7E mov edx, 7EFEFEFF
77C160CB 8B01 mov eax, dword ptr [ecx]
77C160CD 03D0 add edx, eax
77C160CF 83F0 FF xor eax, FFFFFFFF
77C160D2 33C2 xor eax, edx
77C160D4 8B11 mov edx, dword ptr [ecx]
77C160D6 83C1 04 add ecx, 4
77C160D9 A9 00010181 test eax, 81010100
77C160DE 74 E1 je short msvcrt.77C160C1
Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?
Well, not quite a P3 here but a PIV, a Sempron and an Athlon 64 X2, but the Sempron is the only one I do any dev work on, the A64 is for work and the PIV is just a file server.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
simplecopy proc dest:DWORD, src:DWORD
push ebx
mov ecx,[esp+8]
mov edx,[esp+12]
sub ebx,ebx
@@: mov al,[edx+ebx]
mov [ecx+ebx],al
inc ebx
test al,al
jnz @b
pop ebx
ret 8
simplecopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
23 bytes, works on a 386
Source len=512
535 clocks for szCopyXMM
444 clocks for szCopyMMX mmx, trashes FPU
469 clocks for SzCpy10 mmx, trashes FPU
559 clocks for szCopy
1071 clocks for lstrcpy
505 clocks for crt_strcpy
559 clocks for simplecopy
Source len=511
544 clocks for szCopyXMM
441 clocks for szCopyMMX mmx, trashes FPU
470 clocks for SzCpy10 mmx, trashes FPU
554 clocks for szCopy
1071 clocks for lstrcpy
505 clocks for crt_strcpy
554 clocks for simplecopy
Source len=128
145 clocks for szCopyXMM
134 clocks for szCopyMMX mmx, trashes FPU
125 clocks for SzCpy10 mmx, trashes FPU
174 clocks for szCopy
301 clocks for lstrcpy
136 clocks for crt_strcpy
172 clocks for simplecopy
Source len=127
141 clocks for szCopyXMM
123 clocks for szCopyMMX mmx, trashes FPU
124 clocks for SzCpy10 mmx, trashes FPU
167 clocks for szCopy
299 clocks for lstrcpy
135 clocks for crt_strcpy
168 clocks for simplecopy
Source len=31
67 clocks for szCopyXMM
57 clocks for szCopyMMX mmx, trashes FPU
58 clocks for SzCpy10 mmx, trashes FPU
95 clocks for szCopy
108 clocks for lstrcpy
44 clocks for crt_strcpy
94 clocks for simplecopy
Source len=17
65 clocks for szCopyXMM
50 clocks for szCopyMMX mmx, trashes FPU
34 clocks for SzCpy10 mmx, trashes FPU
53 clocks for szCopy
66 clocks for lstrcpy
26 clocks for crt_strcpy
53 clocks for simplecopy
Source len=15
91 clocks for szCopyXMM
44 clocks for szCopyMMX mmx, trashes FPU
37 clocks for SzCpy10 mmx, trashes FPU
46 clocks for szCopy
59 clocks for lstrcpy
22 clocks for crt_strcpy
46 clocks for simplecopy
Quote from: sinsi on February 10, 2009, 03:14:32 AM
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
simplecopy proc dest:DWORD, src:DWORD
push ebx
mov ecx,[esp+8]
mov edx,[esp+12]
sub ebx,ebx
@@: mov al,[edx+ebx]
mov [ecx+ebx],al
inc ebx
test al,al
jnz @b
pop ebx
ret 8
simplecopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
23 bytes, works on a 386
Cute. Celeron M:
Source len=511
454 clocks for szCopyXMM
1052 clocks for simplecopy
Source len=17
32 clocks for szCopyXMM
53 clocks for simplecopy
Source len=15
79 clocks for szCopyXMM
47 clocks for simplecopy
I am working on the ultimate solution:
.if len(src)>=32
invoke szCopyXMM, ...
.else
invoke simplecopy, ...
.endif
But jokes apart: Did you change puter? Your previous timings look a lot different.
EDIT: Sinsi, I was about accusing you of my pulling my leg because my timings are virtually identical to the Masm32lib szCopy algo, but nope they are not identical:
szCopy proc src:DWORD,dst:DWORD
push ebp
push esi
mov edx, [esp+12]
mov ebp, [esp+16]
mov eax, -1
mov esi, 1
@@:
add eax, esi
movzx ecx, BYTE PTR [edx+eax]
mov [ebp+eax], cl
test ecx, ecx
jnz @B
pop esi
pop ebp
ret 8
szCopy endp
While chasing the ultimate lstrcpy replacement, I stumbled over an interesting question: In real life, source strings can be aligned to dwords, destinations can be aligned, but rarely we can do both simultaneously. My test bed says no data aligning is a no-no, but then I decided to compare two versions of the same algo, one pre-aligning the source, the other the destination. To my surprise, there is a difference (MbCopy=src, MbCopyD=dest aligned, timings for a P4):
Source len=512
1254 clocks for MbCopy
1022 clocks for MbCopyD
Source len=63
178 clocks for MbCopy
139 clocks for MbCopyD
Source len=55
163 clocks for MbCopy
134 clocks for MbCopyD
Source len=48
150 clocks for MbCopy
128 clocks for MbCopyD
Source len=42
144 clocks for MbCopy
119 clocks for MbCopyD
Source len=37
138 clocks for MbCopy
105 clocks for MbCopyD
Source len=15
38 clocks for MbCopy
62 clocks for MbCopyD <----- the exception
Source len=7
29 clocks for MbCopy
31 clocks for MbCopyD
Now, is that a well-known phenomenon, and are there established rules to follow??
Here is the algo:
NoAlign= 0 ; clearly not a good option, but you can test it here
DestAlign= 0 ; choose if you want to align the source or the destination
MbCopy proc dest:DWORD, src:DWORD
push edi
mov edi, [esp+8]
mov ecx, [esp+12]
if NoAlign
jmp mbcMain ; neither source nor dest alignment??
endif
if DestAlign
test edi, 3 ; edi=destination address
else
test ecx, 3 ; ecx=source address
endif
je mbcMain ; dword aligned
@@: mov al, byte ptr [ecx] ; a byte from src
inc ecx
test al, al
mov byte ptr [edi], al ; does not change the flag, so we can say bye if al was zero
je mbcBye
inc edi
if DestAlign
test edi, 3 ; edi=destination address
else
test ecx, 3 ; ecx=source address
endif
jne @B
jmp mbcMain
; align 16 no good, costs cycles
@@: ; ------------ innermost loop ------------
mov [edi], eax
add edi, 4
mbcMain:
mov eax, 07EFEFEFFh
mov edx, [ecx]
add eax, edx
xor edx, eax
xor edx, 0FFFFFFFFh
mov eax, [ecx]
add ecx, 4
test edx, 81010100h
je @B ; ------------ innermost loop ------------
test al, al
je mbc1
test ah, ah
je mbc2
test eax, 00FF0000h
je mbc3
mbc4:
mov [edi], eax
jmp mbcBye
mbc3:
mov byte ptr [edi+2], 0
mbc2:
mov word ptr [edi], ax
mbc1:
mov byte ptr [edi], al
mbcBye:
mov edx, [esp+8] ; return start of buffer
pop edi
ret 8
MbCopy endp
By the way, on the P4 the algo beats my previously posted XMM/SSE2 algo hands down. Full testbed attached.
[attachment deleted by admin]
Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...
Careful JJ, you know that such a thing is impossible, right? :toothy
Quote from: Mark Jones on February 10, 2009, 04:50:22 PM
Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...
Careful JJ, you know that such a thing is impossible, right? :toothy
Hmmm... like infinity, right? But I am approaching zero cycles asymptotically. Go ahead, post your timings :bg
Quote from: NightWare on February 08, 2009, 03:29:17 AM
jj, glad to see you play with simd stuff :bg
Bad news: I have given up. SSE2 is just too slow, see my postings above for the P4 and below for the Core2... :wink
Source len=512
712 clocks for szCopyXMM
1098 clocks for szCopy
641 clocks for MbCopy
745 clocks for MbCopyD
Source len=63
210 clocks for szCopyXMM
201 clocks for szCopy
107 clocks for MbCopy
106 clocks for MbCopyD
MbCopy does not need any of these strange new registers ;-)
Quote from: sinsi on February 10, 2009, 03:14:32 AM
23 bytes, works on a 386
ebx isn't necessary :wink
align 16
simplecopy proc dest:DWORD, src:DWORD
mov ecx,src
mov edx,dest
sub edx,ecx
@@: mov al,[ecx]
mov [ecx+edx],al
inc ecx
test al,al
jnz @b
ret
simplecopy endp
Quote from: NightWare on February 10, 2009, 10:45:52 PM
Quote from: sinsi on February 10, 2009, 03:14:32 AM
23 bytes, works on a 386
ebx isn't necessary :wink
Sinsi, how dare you waste 3 bytes without any need??? :dazzled:
It's cute, and competitive, too - same timings as the library szCopy. While implementing this, I stumbled over a very odd behaviour of the chr$ macro. It's on line 190 of the attached source:
Invoke Main
.listall
print chr$(13,10,9, " --------------",13,10)
; MsgBox 0, str$(eax), offset txHi, MB_OK
MsgBox 0, str$(eax), chr$("Hello"), MB_OK ; GARBAGE instead of Hello
.nolist
The title of the MsgBox contains the test string, not "Hello". Excerpt from the list file:
00000000 1 .data
00000024 1 *_TEXT ENDS
00000804 1 *_DATA SEGMENT
1 *ASSUME CS:ERROR
00000804 48656C6C6F00 1 ??001C db "Hello",0
00000000 1 .code
0000080A 1 *_DATA ENDS
00000024 1 *_TEXT SEGMENT
1 *ASSUME CS:FLAT
invoke MessageBoxA,0,reparg(ADDR ??001B),reparg(OFFSET ??001C),MB_OK
= A 1 quot SUBSTR <ADDR ??001B>,1,1
1 .data
1 ??001D db ADDR ??001B,0
1 .code
1 EXITM <ADDR ??001D>
invoke MessageBoxA,0,ADDR ??001B,reparg(OFFSET ??001C),MB_OK
= O 1 quot SUBSTR <OFFSET ??001C>,1,1
1 .data
1 ??001E db OFFSET ??001C,0
1 .code
1 EXITM <ADDR ??001E>
00000024 invoke MessageBoxA,0,ADDR ??001B,OFFSET ??001C,MB_OK
00000024 6A00 * push MB_OK
00000026 6800000000 * push OFFSET ??001C
0000002B 6800000000 * push offset ??001B
00000030 6A00 * push 0
I tried with ml 6.14, ml 9.0 and JWasm, and they all show the same behaviour. Any clue what could cause this? I thought nothing was more straightforward than chr$()...
[attachment deleted by admin]
QuoteBut jokes apart: Did you change puter? Your previous timings look a lot different
Sorry jj, those timings were on my real computer (q6600), not the athlon.
QuoteSinsi, how dare you waste 3 bytes without any need???
It was the beer goggles...
NightWare: very clever.
heh, I hate you.
QuoteThe title of the MsgBox contains the test string, not "Hello".
It works correctly for me using MASM32 v9 or v10, ML 6.14, 6.15, or 7.00.
Quote from: sinsi on February 10, 2009, 11:58:38 PM
NightWare: very clever. heh, I hate you.
well, in fact here (if i remember well) a small correction is necessary => Jdoe : very clever :wink
Quote from: MichaelW on February 11, 2009, 12:26:42 AM
QuoteThe title of the MsgBox contains the test string, not "Hello".
It works correctly for me using MASM32 v9 or v10, ML 6.14, 6.15, or 7.00.
Odd. Very odd. Even the executable works?? Thanks for testing, Michael...
This is what I get at the bottom - everything after OK has no right to be there:
Source len=7
35 clocks for szCopyXMM
23 clocks for simplecopy
23 clocks for simplecopyNW
24 clocks for szCopy
21 clocks for crt_strcpy
19 clocks for MbCopy
22 clocks for MbCopyD
--- OK ---
gHhIiJjKkLlMMNnOoPpQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMMNnOoPpQqRrSsTtUuV
xYyZz Now I Know My ABC's, Won't You Come Play
(On Windows XP SP2, Celeron M)
Yep, garbage here too (ml614,615,7,8)
Quote from: sinsi on February 11, 2009, 01:03:04 AM
Yep, garbage here too (ml614,615,7,8)
So it's not just me...
Thanks, I was about to throw my puter into the garbage. :green
String1 length=1024
String2 length=16*32
Buffer overrun?
QuoteEven the executable works??
I didn't test the EXE, I just tested the few lines of code that you posted with the expectation that the problem was somehow the combination of macros. Now that I do test the complete code, I get garbage, but it's obviously part of the test string. If I pad the end of the data section with:
pad db 200 dup(0)
Then the problem goes away, so something is overwriting the message box title.
Quote from: MichaelW on February 11, 2009, 01:32:31 AM
If I pad the end of the data section with:
pad db 200 dup(0)
Then the problem goes away, so something is overwriting the message box title.
Thanxalot, Sinsi & Michael. That was a typical noob error - I somehow assumed that since the MsgBox comes last in the code, it could not have been overwritten by "previous" code. Plain wrong, of course - since the .data of the MsgBox title was set
before running the code.
I keep learning... :bg
Lives depend on correct and speedy code
everyday. I need not elaborate.
Let those lives diminshed due to code not do
so in vain. QA.
Quote from: askm on February 11, 2009, 04:17:21 AM
Lives depend on correct and speedy code
everyday. I need not elaborate.
Let those lives diminshed due to code not do
so in vain. QA.
Quote>>> heart monitor - Beep,beep,beep... <<<
Get the patients readouts, stat
>>>Indexing files - Please wait<<<<
We need those damn readouts
>>>Indexing files - Please wait<<<<
Use task manager to shut down the damn indexing program !!!
>>>The process is being debugged, access denied<<<
>>> heart monitor - Beeeeeeeeeeeeeee... <<<
I hope my life never depends on Windows :eek
I'm sure I read in a MS EULA that it wasn't to be used in 'nuclear reactor control systems' or 'hospital intensive care systems'
Quote from: sinsi on February 11, 2009, 06:47:20 AM
I'm sure I read in a MS EULA that it wasn't to be used in 'nuclear reactor control systems' or 'hospital intensive care systems'
I believe that's the JAVA license, it's included in the EULA of some Windows distributions.
Quote7. NOTE ON JAVA SUPPORT. THE SOFTWARE PRODUCT MAY CONTAIN SUPPORT FOR PROGRAMS WRITTEN IN JAVA.
JAVA TECHNOLOGY IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED, OR INTENDED FOR USE OR RESALE AS ON-LINE CONTROL EQUIPMENT IN HAZARDOUS ENVIRONMENTS REQUIRING FAIL-SAFE PERFORMANCE, SUCH AS IN THE OPERATION OF NUCLEAR FACILITIES, AIRCRAFT NAVIGATION OR COMMUNICATION SYSTEMS, AIR TRAFFIC CONTROL, DIRECT LIFE SUPPORT MACHINES, OR WEAPONS SYSTEMS, IN WHICH THE FAILURE OF JAVA TECHNOLOGY COULD LEAD DIRECTLY TO DEATH, PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRONMENTAL DAMAGE.
The code failed when
it realized it was MS dependent.
I put a MS copyright in the data
section and tried again. Dont
forget the $ symbol. It helps
lstrcpy concurrency.
QA saves
Folks, you are taking this too seriously. I never recommended the usage of the MybuggyCopy algo in intensive health care systems :naughty:
Stick out your tongue and say 'ah '.
I wish no ill health on anyone.
Not 'lives depend' in the narrow sense.
That narrowly implies that ALL code that appears on
these pages cant be compiled elsewhere
for whatever purposes their 'lives' depend.
There was MS-free assembly code before MS you know.
The processor came first, or more to the point,
logic was the progenitor. MS is not Big Brother.
Besides MS runs afoul of its own "corporate operating environment(s)",
but its health is maintained. Thats eula hypocrisy isnt it.
Now patient you sit on the table here and I'll test your reflexes.
You have come here complaining of difficulty copying strings eh ?
I'll prescribe a copyright to clear it up. Can I get a trial pack doc ?
http://www.theregister.co.uk/2009/01/20/sheffield_conficker/
"The decision to disble automatic security updates was taken during Christmas week after PCs in an operating theatre rebooted mid-surgery. Conficker was detected on December 29" [sic]
Oops!
Hi All:
Translation: This code was written by lawyers and does Not Work on Microsoft systems without UPS.
And Microsoft systems don't work when the power is OFF!
It was only tested when the power was on.
I knew it only takes 45 minutes to re-boot Windows if a UPS radar system fails!
Allso assuming you Have No problems re-booting?
Regards herge