News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

lstrcpy vs szCopy

Started by jj2007, February 07, 2009, 11:02:33 PM

Previous topic - Next topic

MichaelW

For szCpyV614:

szCopyXMM -> jj   ->               xmm: 2082 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 632 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
                                szCopy  2076 clocks
                                lstrcpy 2704 clocks
SzCpy10        - > Lingo ->        MMX: 701 clocks
MbCopy     -> jj   ->              xmm: 697 clocks

This is basically the results I got for the version were I manually added the prefixes and assembled with 6.14.

For szCpyV9:

szCopyXMM -> jj   ->               xmm: 2082 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 633 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 697 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 624 clocks
                                szCopy  2078 clocks
                                lstrcpy 2703 clocks
SzCpy10        - > Lingo ->        MMX: 700 clocks
MbCopy     -> jj   ->              xmm:

There is no count for the last procedure because it generates:

Exception number: c000001d (illegal instruction)

And this is basically the results I got for the versions that I assembled with 6.15 and 7.00, which BTW added the prefixes.

I would think Intel would ensure that any instruction, which would run on the processor, would produce the same result as on the later processors. I would be interested to see if other P3 processors have this problem.
eschew obfuscation

jj2007

Interesting. Exactly the same clocks as szCopy, but a lot slower than the MMX versions. Can post anybody results for an AMD, or other Intel processors?

Michael, the 66h seems to have a function similar to nop. Could you try replacing the 66h with a number of nops?

jj2007

Just found this, indicating it's a known problem of early Pentiums:

Many compilers for IA-32 generate "repne scasb" in order
tto find the length of a given C string. However, it is possible
to implement strlen (and many other string functions) using
the SSE2 instruction set: pcmpeqb + pmovmaskb until there
is a set bit then bsf to find its index. On Core2 it is roughly
9.3 times faster and about 6.5 times faster on Pentium 4.

http://www.mydatabasesupport.com/forums/arch/252748-fast-string-functions.html

MichaelW

66h is the operand-size prefix, or per Intel the operand-size override prefix:
Quote
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size. Use of 66H followed by 0FH is treated as a mandatory prefix by some SSE/SSE2/SSE3 instructions. Other use of the 66H prefix with MMX/SSE/SSE2/SSE3 instructions is reserved; such use may cause unpredictable behavior.

I knew it was used for the integer instructions, but I have never before noticed it on an MMX or SSE instruction, although it does make some sense that they would use it to specify the register size.

I have verified that with the prefixes in place, and the encoding exactly as MLv9 produced, on the first execution:

pmovmskb ecx, xmm1

Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation. The code looks correct according to the Intel references, but on my P3 it does not work as it is documented to work.

Am I the only cheapskate here that is still running a P3?
eschew obfuscation

jj2007

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
pmovmskb ecx, xmm1
Sets ecx to 0FFh, so the following conditional jump is always taken, and the tail cleanup code performs the copy operation.

Which is kind of a convenient bug, allowing a soft fall through... :-)


Quote
Am I the only cheapskate here that is still running a P3?

Don't be desperate, the attached version should be fine for you. The code should look familiar to you, just search for 80808080h ...

Timings for a P4:
alignment: offset src=4210803, dest=4211365

Source len=512
1109     clocks for szCopyXMM
1113     clocks for szCopyMMX
1158     clocks for SzCpy11
1087     clocks for szCopyMMX1
1180     clocks for SzCpy10
1711     clocks for szCopy
2164     clocks for lstrcpy

Source len=511
1096     clocks for szCopyXMM
1098     clocks for szCopyMMX
1160     clocks for SzCpy11, result NOT CORRECT
1082     clocks for szCopyMMX1, result NOT CORRECT
1165     clocks for SzCpy10
1699     clocks for szCopy
2166     clocks for lstrcpy

Source len=15
75       clocks for szCopyXMM
58       clocks for szCopyMMX
47       clocks for SzCpy11, result NOT CORRECT
31       clocks for szCopyMMX1, result NOT CORRECT
49       clocks for SzCpy10
70       clocks for szCopy
147      clocks for lstrcpy

[attachment deleted by admin]

FORTRANS

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?

Hi,

   Hardly.  While I have newer machines, _at home_ I mostly use
my PIII or Pentium systems.  And I have an HP 200LX that uses
an 80186 in my pocket.  I did once throw away an IBM PC...  I
really ought to get rid of some of the older ones.

Regards,

Steve

jj2007

Updated with some refinements - see Core Duo (Celeron M) timings below.
I included crt_strcpy into the testbed - it's remarkably fast for short strings.

An additional test for a global "CanSSE2" variable would cost ca. 3 cycles extra. My goal was a fast general purpose library routine that does not trash the FPU. The current szCopyXMM is pretty good on the Core2, but I would be grateful for more timings, especially on the old Pentiums and AMDs. The algo is a bit weak on very short strings (0-15 bytes), due to the overhead. A rough estimate says a well commented MASM source has an average line length of 30 or so...

Note: All algos (except the Masm32 szCopy) have been harmonised to copy dest, source, consistent with mov dest, src.
The attached executable was assembled with JWasm.

alignment: offset src=4214899, dest=4215461
len of szCopyXMM: 118

Source len=512
465      clocks for szCopyXMM
456      clocks for szCopyMMX   mmx, trashes FPU
481      clocks for SzCpy10     mmx, trashes FPU
1054     clocks for szCopy
1098     clocks for lstrcpy
668      clocks for crt_strcpy

Source len=511
454      clocks for szCopyXMM
434      clocks for szCopyMMX   mmx, trashes FPU
480      clocks for SzCpy10     mmx, trashes FPU
1055     clocks for szCopy
1086     clocks for lstrcpy
684      clocks for crt_strcpy

Source len=128
139      clocks for szCopyXMM
119      clocks for szCopyMMX   mmx, trashes FPU
126      clocks for SzCpy10     mmx, trashes FPU
285      clocks for szCopy
304      clocks for lstrcpy
171      clocks for crt_strcpy

Source len=127
129      clocks for szCopyXMM
114      clocks for szCopyMMX   mmx, trashes FPU
126      clocks for SzCpy10     mmx, trashes FPU
286      clocks for szCopy
304      clocks for lstrcpy
171      clocks for crt_strcpy

Source len=31
56       clocks for szCopyXMM
46       clocks for szCopyMMX   mmx, trashes FPU
52       clocks for SzCpy10     mmx, trashes FPU
96       clocks for szCopy
112      clocks for lstrcpy
51       clocks for crt_strcpy

Source len=17
33       clocks for szCopyXMM
38       clocks for szCopyMMX   mmx, trashes FPU
42       clocks for SzCpy10     mmx, trashes FPU
54       clocks for szCopy
94       clocks for lstrcpy
32       clocks for crt_strcpy

Source len=15
79       clocks for szCopyXMM
46       clocks for szCopyMMX   mmx, trashes FPU
39       clocks for SzCpy10     mmx, trashes FPU
48       clocks for szCopy
89       clocks for lstrcpy
31       clocks for crt_strcpy

[attachment deleted by admin]

sinsi

Athlon XP 2600+ (2.13GHz)

alignment: offset src=4214899, dest=4215461
len of szCopyXMM: 118

Source len=512
688 clocks for szCopyXMM
451 clocks for szCopyMMX mmx, trashes FPU
455 clocks for SzCpy10 mmx, trashes FPU
1827 clocks for szCopy
1794 clocks for lstrcpy
742 clocks for crt_strcpy

Source len=511
683 clocks for szCopyXMM
420 clocks for szCopyMMX mmx, trashes FPU
455 clocks for SzCpy10 mmx, trashes FPU
1822 clocks for szCopy
1789 clocks for lstrcpy
740 clocks for crt_strcpy

Source len=128
198 clocks for szCopyXMM
142 clocks for szCopyMMX mmx, trashes FPU
136 clocks for SzCpy10 mmx, trashes FPU
479 clocks for szCopy
483 clocks for lstrcpy
206 clocks for crt_strcpy

Source len=127
201 clocks for szCopyXMM
138 clocks for szCopyMMX mmx, trashes FPU
136 clocks for SzCpy10 mmx, trashes FPU
475 clocks for szCopy
479 clocks for lstrcpy
204 clocks for crt_strcpy

Source len=31
62 clocks for szCopyXMM
43 clocks for szCopyMMX mmx, trashes FPU
40 clocks for SzCpy10 mmx, trashes FPU
137 clocks for szCopy
151 clocks for lstrcpy
58 clocks for crt_strcpy

Source len=17
43 clocks for szCopyXMM
48 clocks for szCopyMMX mmx, trashes FPU
31 clocks for SzCpy10 mmx, trashes FPU
88 clocks for szCopy
104 clocks for lstrcpy
39 clocks for crt_strcpy

Source len=15
41 clocks for szCopyXMM
36 clocks for szCopyMMX mmx, trashes FPU
26 clocks for SzCpy10 mmx, trashes FPU
81 clocks for szCopy
97 clocks for lstrcpy
36 clocks for crt_strcpy
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Thanks, Sinsi. Interesting that the algo performs much better for the 15 bytes string than on the Core2. And crt_strcpy is also remarkably good all over the place.

EDIT: Here is the innermost loop of crt_strcpy. Interesting ::)

77C160C1               8917                          mov dword ptr [edi], edx
77C160C3               83C7 04                       add edi, 4
77C160C6               BA FFFEFE7E                   mov edx, 7EFEFEFF
77C160CB               8B01                          mov eax, dword ptr [ecx]
77C160CD               03D0                          add edx, eax
77C160CF               83F0 FF                       xor eax, FFFFFFFF
77C160D2               33C2                          xor eax, edx
77C160D4               8B11                          mov edx, dword ptr [ecx]
77C160D6               83C1 04                       add ecx, 4
77C160D9               A9 00010181                   test eax, 81010100
77C160DE              74 E1                         je short msvcrt.77C160C1

donkey

Quote from: MichaelW on February 08, 2009, 05:33:38 PM
Am I the only cheapskate here that is still running a P3?

Well, not quite a P3 here but a PIV, a Sempron and an Athlon 64 X2, but the Sempron is the only one I do any dev work on, the A64 is for work and the PIV is just a file server.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

sinsi


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
simplecopy proc dest:DWORD, src:DWORD
    push ebx
    mov ecx,[esp+8]
    mov edx,[esp+12]
    sub ebx,ebx
@@: mov al,[edx+ebx]
    mov [ecx+ebx],al
    inc ebx
    test al,al
    jnz @b
    pop ebx
    ret 8
simplecopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

23 bytes, works on a 386


Source len=512
535      clocks for szCopyXMM
444      clocks for szCopyMMX   mmx, trashes FPU
469      clocks for SzCpy10     mmx, trashes FPU
559      clocks for szCopy
1071     clocks for lstrcpy
505      clocks for crt_strcpy
559      clocks for simplecopy

Source len=511
544      clocks for szCopyXMM
441      clocks for szCopyMMX   mmx, trashes FPU
470      clocks for SzCpy10     mmx, trashes FPU
554      clocks for szCopy
1071     clocks for lstrcpy
505      clocks for crt_strcpy
554      clocks for simplecopy

Source len=128
145      clocks for szCopyXMM
134      clocks for szCopyMMX   mmx, trashes FPU
125      clocks for SzCpy10     mmx, trashes FPU
174      clocks for szCopy
301      clocks for lstrcpy
136      clocks for crt_strcpy
172      clocks for simplecopy

Source len=127
141      clocks for szCopyXMM
123      clocks for szCopyMMX   mmx, trashes FPU
124      clocks for SzCpy10     mmx, trashes FPU
167      clocks for szCopy
299      clocks for lstrcpy
135      clocks for crt_strcpy
168      clocks for simplecopy

Source len=31
67       clocks for szCopyXMM
57       clocks for szCopyMMX   mmx, trashes FPU
58       clocks for SzCpy10     mmx, trashes FPU
95       clocks for szCopy
108      clocks for lstrcpy
44       clocks for crt_strcpy
94       clocks for simplecopy

Source len=17
65       clocks for szCopyXMM
50       clocks for szCopyMMX   mmx, trashes FPU
34       clocks for SzCpy10     mmx, trashes FPU
53       clocks for szCopy
66       clocks for lstrcpy
26       clocks for crt_strcpy
53       clocks for simplecopy

Source len=15
91       clocks for szCopyXMM
44       clocks for szCopyMMX   mmx, trashes FPU
37       clocks for SzCpy10     mmx, trashes FPU
46       clocks for szCopy
59       clocks for lstrcpy
22       clocks for crt_strcpy
46       clocks for simplecopy
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on February 10, 2009, 03:14:32 AM

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
simplecopy proc dest:DWORD, src:DWORD
    push ebx
    mov ecx,[esp+8]
    mov edx,[esp+12]
    sub ebx,ebx
@@: mov al,[edx+ebx]
    mov [ecx+ebx],al
    inc ebx
    test al,al
    jnz @b
    pop ebx
    ret 8
simplecopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

23 bytes, works on a 386

Cute. Celeron M:

Source len=511
454      clocks for szCopyXMM
1052     clocks for simplecopy

Source len=17
32       clocks for szCopyXMM
53       clocks for simplecopy

Source len=15
79       clocks for szCopyXMM
47       clocks for simplecopy


I am working on the ultimate solution:

.if len(src)>=32
    invoke szCopyXMM, ...
.else
    invoke simplecopy, ...
.endif


But jokes apart: Did you change puter? Your previous timings look a lot different.

EDIT: Sinsi, I was about accusing you of my pulling my leg because my timings are virtually identical to the Masm32lib szCopy algo, but nope they are not identical:

szCopy proc src:DWORD,dst:DWORD

    push ebp
    push esi

    mov edx, [esp+12]
    mov ebp, [esp+16]
    mov eax, -1
    mov esi, 1

  @@:
    add eax, esi
    movzx ecx, BYTE PTR [edx+eax]
    mov [ebp+eax], cl
    test ecx, ecx
    jnz @B

    pop esi
    pop ebp

    ret 8

szCopy endp

jj2007

While chasing the ultimate lstrcpy replacement, I stumbled over an interesting question: In real life, source strings can be aligned to dwords, destinations can be aligned, but rarely we can do both simultaneously. My test bed says no data aligning is a no-no, but then I decided to compare two versions of the same algo, one pre-aligning the source, the other the destination. To my surprise, there is a difference (MbCopy=src, MbCopyD=dest aligned, timings for a P4):

Source len=512
1254     clocks for MbCopy
1022     clocks for MbCopyD

Source len=63
178      clocks for MbCopy
139      clocks for MbCopyD

Source len=55
163      clocks for MbCopy
134      clocks for MbCopyD

Source len=48
150      clocks for MbCopy
128      clocks for MbCopyD

Source len=42
144      clocks for MbCopy
119      clocks for MbCopyD

Source len=37
138      clocks for MbCopy
105      clocks for MbCopyD

Source len=15
38       clocks for MbCopy
62       clocks for MbCopyD              <----- the exception

Source len=7
29       clocks for MbCopy
31       clocks for MbCopyD


Now, is that a well-known phenomenon, and are there established rules to follow??

Here is the algo:
NoAlign= 0 ; clearly not a good option, but you can test it here
DestAlign= 0 ; choose if you want to align the source or the destination
MbCopy proc dest:DWORD, src:DWORD
push edi
mov edi, [esp+8]
mov ecx, [esp+12]
if NoAlign
jmp mbcMain ; neither source nor dest alignment??
endif
if DestAlign
  test edi, 3 ; edi=destination address
else
  test ecx, 3 ; ecx=source address
endif
je mbcMain ; dword aligned
@@: mov al, byte ptr [ecx] ; a byte from src
inc ecx
test al, al
mov byte ptr [edi], al ; does not change the flag, so we can say bye if al was zero
je mbcBye
inc edi
if DestAlign
  test edi, 3 ; edi=destination address
else
  test ecx, 3 ; ecx=source address
endif
jne @B
jmp mbcMain
; align 16 no good, costs cycles

@@: ; ------------ innermost loop ------------
mov [edi], eax
add edi, 4
mbcMain:
mov eax, 07EFEFEFFh
mov edx, [ecx]
add eax, edx
xor edx, eax
xor edx, 0FFFFFFFFh
mov eax, [ecx]
add ecx, 4
test edx, 81010100h
je @B ; ------------ innermost loop ------------

test al, al
je mbc1
test ah, ah
je mbc2
test eax, 00FF0000h
je mbc3

mbc4:
mov [edi], eax
jmp mbcBye
mbc3:
mov byte ptr [edi+2], 0
mbc2:
mov word ptr [edi], ax
mbc1:
mov byte ptr [edi], al
mbcBye:
mov edx, [esp+8] ; return start of buffer
pop edi
ret 8
MbCopy endp


By the way, on the P4 the algo beats my previously posted XMM/SSE2 algo hands down. Full testbed attached.

[attachment deleted by admin]

Mark Jones

Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...

Careful JJ, you know that such a thing is impossible, right? :toothy
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on February 10, 2009, 04:50:22 PM
Quote from: jj2007 on February 10, 2009, 04:25:35 PM
While chasing the ultimate lstrcpy replacement...

Careful JJ, you know that such a thing is impossible, right? :toothy

Hmmm... like infinity, right? But I am approaching zero cycles asymptotically. Go ahead, post your timings :bg