lstrcpy vs szCopy

MichaelW · February 08, 2009, 02:34:51 PM

For szCpyV614:


szCopyXMM -> jj   ->               xmm: 2082 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 632 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
                                szCopy  2076 clocks
                                lstrcpy 2704 clocks
SzCpy10        - > Lingo ->        MMX: 701 clocks
MbCopy     -> jj   ->              xmm: 697 clocks

This is basically the results I got for the version were I manually added the prefixes and assembled with 6.14.

For szCpyV9:

Code Select


szCopyXMM -> jj   ->               xmm: 2082 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 633 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
                                szCopy  2078 clocks
                                lstrcpy 2703 clocks
SzCpy10        - > Lingo ->        MMX: 700 clocks
MbCopy     -> jj   ->              xmm:

There is no count for the last procedure because it generates:

Exception number: c000001d (illegal instruction)

And this is basically the results I got for the versions that I assembled with 6.15 and 7.00, which BTW added the prefixes.

I would think Intel would ensure that any instruction, which would run on the processor, would produce the same result as on the later processors. I would be interested to see if other P3 processors have this problem.

jj2007 · February 08, 2009, 03:13:28 PM

Interesting. Exactly the same clocks as szCopy, but a lot slower than the MMX versions. Can post anybody results for an AMD, or other Intel processors?

Michael, the 66h seems to have a function similar to nop. Could you try replacing the 66h with a number of nops?

jj2007 · February 08, 2009, 03:47:05 PM

Just found this, indicating it's a known problem of early Pentiums:

Many compilers for IA-32 generate "repne scasb" in order
tto find the length of a given C string. However, it is possible
to implement strlen (and many other string functions) using
the SSE2 instruction set: pcmpeqb + pmovmaskb until there
is a set bit then bsf to find its index. On Core2 it is roughly
9.3 times faster and about 6.5 times faster on Pentium 4.

http://www.mydatabasesupport.com/forums/arch/252748-fast-string-functions.html

MichaelW · February 08, 2009, 05:33:38 PM

66h is the operand-size prefix, or per Intel the operand-size override prefix:

Quote
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size. Use of 66H followed by 0FH is treated as a mandatory prefix by some SSE/SSE2/SSE3 instructions. Other use of the 66H prefix with MMX/SSE/SSE2/SSE3 instructions is reserved; such use may cause unpredictable behavior.

I knew it was used for the integer instructions, but I have never before noticed it on an MMX or SSE instruction, although it does make some sense that they would use it to specify the register size.

I have verified that with the prefixes in place, and the encoding exactly as MLv9 produced, on the first execution:

pmovmskb ecx, xmm1

Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation. The code looks correct according to the Intel references, but on my P3 it does not work as it is documented to work.

Am I the only cheapskate here that is still running a P3?

jj2007 · February 09, 2009, 11:38:01 AM

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
pmovmskb ecx, xmm1
Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation.

Which is kind of a convenient bug, allowing a soft fall through... :-)

Quote
Am I the only cheapskate here that is still running a P3?

Don't be desperate, the attached version should be fine for you. The code should look familiar to you, just search for 80808080h ...

Timings for a P4:

Code Select

 alignment: offset src=4210803, dest=4211365

Source len=512
1109     clocks for szCopyXMM
1113     clocks for szCopyMMX
1158     clocks for SzCpy11
1087     clocks for szCopyMMX1
1180     clocks for SzCpy10
1711     clocks for szCopy
2164     clocks for lstrcpy

Source len=511
1096     clocks for szCopyXMM
1098     clocks for szCopyMMX
1160     clocks for SzCpy11, result NOT CORRECT
1082     clocks for szCopyMMX1, result NOT CORRECT
1165     clocks for SzCpy10
1699     clocks for szCopy
2166     clocks for lstrcpy

Source len=15
75       clocks for szCopyXMM
58       clocks for szCopyMMX
47       clocks for SzCpy11, result NOT CORRECT
31       clocks for szCopyMMX1, result NOT CORRECT
49       clocks for SzCpy10
70       clocks for szCopy
147      clocks for lstrcpy

[attachment deleted by admin]

FORTRANS · February 09, 2009, 03:03:45 PM

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?

Hi,

Hardly. While I have newer machines, _at home_ I mostly use
my PIII or Pentium systems. And I have an HP 200LX that uses
an 80186 in my pocket. I did once throw away an IBM PC... I
really ought to get rid of some of the older ones.

Regards,

Steve

jj2007 · February 10, 2009, 12:01:49 AM

Updated with some refinements - see Core Duo (Celeron M) timings below.
I included crt_strcpy into the testbed - it's remarkably fast for short strings.

An additional test for a global "CanSSE2" variable would cost ca. 3 cycles extra. My goal was a fast general purpose library routine that does not trash the FPU. The current szCopyXMM is pretty good on the Core2, but I would be grateful for more timings, especially on the old Pentiums and AMDs. The algo is a bit weak on very short strings (0-15 bytes), due to the overhead. A rough estimate says a well commented MASM source has an average line length of 30 or so...

Note: All algos (except the Masm32 szCopy) have been harmonised to copy dest, source, consistent with mov dest, src.
The attached executable was assembled with JWasm.

Code Select

 alignment: offset src=4214899, dest=4215461
 len of szCopyXMM: 118

Source len=512
465      clocks for szCopyXMM
456      clocks for szCopyMMX   mmx, trashes FPU
481      clocks for SzCpy10     mmx, trashes FPU
1054     clocks for szCopy
1098     clocks for lstrcpy
668      clocks for crt_strcpy

Source len=511
454      clocks for szCopyXMM
434      clocks for szCopyMMX   mmx, trashes FPU
480      clocks for SzCpy10     mmx, trashes FPU
1055     clocks for szCopy
1086     clocks for lstrcpy
684      clocks for crt_strcpy

Source len=128
139      clocks for szCopyXMM
119      clocks for szCopyMMX   mmx, trashes FPU
126      clocks for SzCpy10     mmx, trashes FPU
285      clocks for szCopy
304      clocks for lstrcpy
171      clocks for crt_strcpy

Source len=127
129      clocks for szCopyXMM
114      clocks for szCopyMMX   mmx, trashes FPU
126      clocks for SzCpy10     mmx, trashes FPU
286      clocks for szCopy
304      clocks for lstrcpy
171      clocks for crt_strcpy

Source len=31
56       clocks for szCopyXMM
46       clocks for szCopyMMX   mmx, trashes FPU
52       clocks for SzCpy10     mmx, trashes FPU
96       clocks for szCopy
112      clocks for lstrcpy
51       clocks for crt_strcpy

Source len=17
33       clocks for szCopyXMM
38       clocks for szCopyMMX   mmx, trashes FPU
42       clocks for SzCpy10     mmx, trashes FPU
54       clocks for szCopy
94       clocks for lstrcpy
32       clocks for crt_strcpy

Source len=15
79       clocks for szCopyXMM
46       clocks for szCopyMMX   mmx, trashes FPU
39       clocks for SzCpy10     mmx, trashes FPU
48       clocks for szCopy
89       clocks for lstrcpy
31       clocks for crt_strcpy

[attachment deleted by admin]

sinsi · February 10, 2009, 12:41:12 AM

Athlon XP 2600+ (2.13GHz)

Code Select


 alignment: offset src=4214899, dest=4215461
 len of szCopyXMM: 118

Source len=512
688	 clocks for szCopyXMM
451	 clocks for szCopyMMX 	mmx, trashes FPU
455	 clocks for SzCpy10 	mmx, trashes FPU
1827	 clocks for szCopy
1794	 clocks for lstrcpy
742	 clocks for crt_strcpy

Source len=511
683	 clocks for szCopyXMM
420	 clocks for szCopyMMX 	mmx, trashes FPU
455	 clocks for SzCpy10 	mmx, trashes FPU
1822	 clocks for szCopy
1789	 clocks for lstrcpy
740	 clocks for crt_strcpy

Source len=128
198	 clocks for szCopyXMM
142	 clocks for szCopyMMX 	mmx, trashes FPU
136	 clocks for SzCpy10 	mmx, trashes FPU
479	 clocks for szCopy
483	 clocks for lstrcpy
206	 clocks for crt_strcpy

Source len=127
201	 clocks for szCopyXMM
138	 clocks for szCopyMMX 	mmx, trashes FPU
136	 clocks for SzCpy10 	mmx, trashes FPU
475	 clocks for szCopy
479	 clocks for lstrcpy
204	 clocks for crt_strcpy

Source len=31
62	 clocks for szCopyXMM
43	 clocks for szCopyMMX 	mmx, trashes FPU
40	 clocks for SzCpy10 	mmx, trashes FPU
137	 clocks for szCopy
151	 clocks for lstrcpy
58	 clocks for crt_strcpy

Source len=17
43	 clocks for szCopyXMM
48	 clocks for szCopyMMX 	mmx, trashes FPU
31	 clocks for SzCpy10 	mmx, trashes FPU
88	 clocks for szCopy
104	 clocks for lstrcpy
39	 clocks for crt_strcpy

Source len=15
41	 clocks for szCopyXMM
36	 clocks for szCopyMMX 	mmx, trashes FPU
26	 clocks for SzCpy10 	mmx, trashes FPU
81	 clocks for szCopy
97	 clocks for lstrcpy
36	 clocks for crt_strcpy

jj2007 · February 10, 2009, 01:23:22 AM

Thanks, Sinsi. Interesting that the algo performs much better for the 15 bytes string than on the Core2. And crt_strcpy is also remarkably good all over the place.

EDIT: Here is the innermost loop of crt_strcpy. Interesting ::)

Code Select

77C160C1               8917                          mov dword ptr [edi], edx
77C160C3               83C7 04                       add edi, 4
77C160C6               BA FFFEFE7E                   mov edx, 7EFEFEFF
77C160CB               8B01                          mov eax, dword ptr [ecx]
77C160CD               03D0                          add edx, eax
77C160CF               83F0 FF                       xor eax, FFFFFFFF
77C160D2               33C2                          xor eax, edx
77C160D4               8B11                          mov edx, dword ptr [ecx]
77C160D6               83C1 04                       add ecx, 4
77C160D9               A9 00010181                   test eax, 81010100
77C160DE              74 E1                         je short msvcrt.77C160C1

donkey · February 10, 2009, 02:33:51 AM

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?

Well, not quite a P3 here but a PIV, a Sempron and an Athlon 64 X2, but the Sempron is the only one I do any dev work on, the A64 is for work and the PIV is just a file server.

sinsi · February 10, 2009, 03:14:32 AM

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
simplecopy proc dest:DWORD, src:DWORD
    push ebx
    mov ecx,[esp+8]
    mov edx,[esp+12]
    sub ebx,ebx
 @@: mov al,[edx+ebx]
    mov [ecx+ebx],al
    inc ebx
    test al,al
    jnz @b
    pop ebx
    ret 8
simplecopy endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

23 bytes, works on a 386

Code Select


Source len=512
535      clocks for szCopyXMM
444      clocks for szCopyMMX   mmx, trashes FPU
469      clocks for SzCpy10     mmx, trashes FPU
559      clocks for szCopy
1071     clocks for lstrcpy
505      clocks for crt_strcpy
559      clocks for simplecopy

Source len=511
544      clocks for szCopyXMM
441      clocks for szCopyMMX   mmx, trashes FPU
470      clocks for SzCpy10     mmx, trashes FPU
554      clocks for szCopy
1071     clocks for lstrcpy
505      clocks for crt_strcpy
554      clocks for simplecopy

Source len=128
145      clocks for szCopyXMM
134      clocks for szCopyMMX   mmx, trashes FPU
125      clocks for SzCpy10     mmx, trashes FPU
174      clocks for szCopy
301      clocks for lstrcpy
136      clocks for crt_strcpy
172      clocks for simplecopy

Source len=127
141      clocks for szCopyXMM
123      clocks for szCopyMMX   mmx, trashes FPU
124      clocks for SzCpy10     mmx, trashes FPU
167      clocks for szCopy
299      clocks for lstrcpy
135      clocks for crt_strcpy
168      clocks for simplecopy

Source len=31
67       clocks for szCopyXMM
57       clocks for szCopyMMX   mmx, trashes FPU
58       clocks for SzCpy10     mmx, trashes FPU
95       clocks for szCopy
108      clocks for lstrcpy
44       clocks for crt_strcpy
94       clocks for simplecopy

Source len=17
65       clocks for szCopyXMM
50       clocks for szCopyMMX   mmx, trashes FPU
34       clocks for SzCpy10     mmx, trashes FPU
53       clocks for szCopy
66       clocks for lstrcpy
26       clocks for crt_strcpy
53       clocks for simplecopy

Source len=15
91       clocks for szCopyXMM
44       clocks for szCopyMMX   mmx, trashes FPU
37       clocks for SzCpy10     mmx, trashes FPU
46       clocks for szCopy
59       clocks for lstrcpy
22       clocks for crt_strcpy
46       clocks for simplecopy

jj2007 · February 10, 2009, 07:57:38 AM

Quote from: sinsi on February 10, 2009, 03:14:32 AM
Code Select Expand
OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE align 16 simplecopy proc dest:DWORD, src:DWORD push ebx mov ecx,[esp+8] mov edx,[esp+12] sub ebx,ebx @@: mov al,[edx+ebx] mov [ecx+ebx],al inc ebx test al,al jnz @b pop ebx ret 8 simplecopy endp OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef
23 bytes, works on a 386

Cute. Celeron M:

Code Select

Source len=511
454      clocks for szCopyXMM
1052     clocks for simplecopy

Source len=17
32       clocks for szCopyXMM
53       clocks for simplecopy

Source len=15
79       clocks for szCopyXMM
47       clocks for simplecopy

I am working on the ultimate solution:

Code Select

.if len(src)>=32
    invoke szCopyXMM, ...
.else
    invoke simplecopy, ...
.endif

But jokes apart: Did you change puter? Your previous timings look a lot different.

EDIT: Sinsi, I was about accusing you of my pulling my leg because my timings are virtually identical to the Masm32lib szCopy algo, but nope they are not identical:

Code Select

szCopy proc src:DWORD,dst:DWORD

    push ebp
    push esi

    mov edx, [esp+12]
    mov ebp, [esp+16]
    mov eax, -1
    mov esi, 1

  @@:
    add eax, esi
    movzx ecx, BYTE PTR [edx+eax]
    mov [ebp+eax], cl
    test ecx, ecx
    jnz @B

    pop esi
    pop ebp

    ret 8

szCopy endp

jj2007 · February 10, 2009, 04:25:35 PM

While chasing the ultimate lstrcpy replacement, I stumbled over an interesting question: In real life, source strings can be aligned to dwords, destinations can be aligned, but rarely we can do both simultaneously. My test bed says no data aligning is a no-no, but then I decided to compare two versions of the same algo, one pre-aligning the source, the other the destination. To my surprise, there is a difference (MbCopy=src, MbCopyD=dest aligned, timings for a P4):

Code Select

Source len=512
1254     clocks for MbCopy
1022     clocks for MbCopyD

Source len=63
178      clocks for MbCopy
139      clocks for MbCopyD

Source len=55
163      clocks for MbCopy
134      clocks for MbCopyD

Source len=48
150      clocks for MbCopy
128      clocks for MbCopyD

Source len=42
144      clocks for MbCopy
119      clocks for MbCopyD

Source len=37
138      clocks for MbCopy
105      clocks for MbCopyD

Source len=15
38       clocks for MbCopy
62       clocks for MbCopyD              <----- the exception

Source len=7
29       clocks for MbCopy
31       clocks for MbCopyD

Now, is that a well-known phenomenon, and are there established rules to follow??

Here is the algo:

Code Select


NoAlign=	0	; clearly not a good option, but you can test it here
DestAlign=	0	; choose if you want to align the source or the destination
MbCopy proc dest:DWORD, src:DWORD
	push edi
	mov edi, [esp+8]
	mov ecx, [esp+12]
	if NoAlign
		jmp mbcMain 			; neither source nor dest alignment??
	endif
	if DestAlign
	  test edi, 3					; edi=destination address
	else
	  test ecx, 3					; ecx=source address
	endif
	je mbcMain					; dword aligned
@@:	mov al, byte ptr [ecx]		; a byte from src
	inc ecx
	test al, al
	mov byte ptr [edi], al		; does not change the flag, so we can say bye if al was zero
	je mbcBye
	inc edi
	if DestAlign
	  test edi, 3					; edi=destination address
	else
	  test ecx, 3					; ecx=source address
	endif
	jne @B
	jmp mbcMain
	; align 16 no good, costs cycles

@@:				; ------------ innermost loop ------------
	mov [edi], eax
	add edi, 4
mbcMain:	
	mov eax, 07EFEFEFFh
	mov edx, [ecx]
	add eax, edx
	xor edx, eax
	xor edx, 0FFFFFFFFh
	mov eax, [ecx]
	add ecx, 4
	test edx, 81010100h
	je @B			; ------------ innermost loop ------------

	test al, al
	je mbc1
	test ah, ah
	je mbc2
	test eax, 00FF0000h
	je mbc3

mbc4:	
	mov [edi], eax
	jmp mbcBye
mbc3:	
	mov byte ptr [edi+2], 0
mbc2:	
	mov word ptr [edi], ax
mbc1:
	mov byte ptr [edi], al
mbcBye:	
	mov edx, [esp+8]	; return start of buffer
	pop edi
	ret 8
MbCopy endp

By the way, on the P4 the algo beats my previously posted XMM/SSE2 algo hands down. Full testbed attached.

[attachment deleted by admin]

Mark Jones · February 10, 2009, 04:50:22 PM

Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...

Careful JJ, you know that such a thing is impossible, right? :toothy

jj2007 · February 10, 2009, 05:28:26 PM

Quote from: Mark Jones on February 10, 2009, 04:50:22 PM
Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...

Careful JJ, you know that such a thing is impossible, right? :toothy

Hmmm... like infinity, right? But I am approaching zero cycles asymptotically. Go ahead, post your timings :bg

News:

lstrcpy vs szCopy