SzCpy vs. lstrcpy

Mark_Larson · May 13, 2005, 04:24:09 AM

That is equivalent to doing an "ALIGN 16" in the inner loop, which what I kept trying to get you to try :) Here's why it is the same as an align ( without the jump of course). Let's look at all the code before the first loop. That consists of exactly two lines of code.

Code Select


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]


;THIS gets translated into the following.  The first two lines are the "prologue"
; the second two lines are the first two lines of the code.  You will notice that it
; also gives the opcodes for the instructions.  

00401260 55                   push        ebp
00401261 8B EC                mov         ebp,esp
00401263 8B 45 08             mov         eax,dword ptr [ebp+8]
00401266 8B 75 0C             mov         esi,dword ptr [ebp+0Ch]

push ebp = 55h
mov ebp,esp = 8Bh ECh
mov eax,dword ptr [ebp+8] = 8Bh 45h 08h
mov esi,dword ptr [ebp+0Ch] = 8Bh 75h 0Ch

If you add up the opcodes contained in those 4 instructions it comes out to 9. Now if you remember right you added 7 bytes to just before the label "qword_copy". 9+7 = 16 bytes. Since the first line of the routine is 16-byte aligned, this forces the "qword_copy" label to also be 16-byte aligned. It's just easier to use ALIGN 16 in front of the label.

Code Select


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]

align 16
qword_copy:

Jimg · May 13, 2005, 04:43:59 AM

Right. But it measures 8 cycles longer to work through the nop's rather than the jump.

Mark_Larson · May 13, 2005, 01:23:25 PM

Quote from: Jimg on May 13, 2005, 04:43:59 AM
Right. But it measures 8 cycles longer to work through the nop's rather than the jump.

That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.

Code Select


00401269    8D A4 24 00 00 00 00    lea         esp,[esp]

So if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.

2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.

Code Select


jmp qword_copy
align 16
qword_copy:

Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.

Code Select


align 16
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
      str2 db 128 dup(0),0

Jimg · May 13, 2005, 03:04:52 PM

Quote from: Mark_Larson on May 13, 2005, 01:23:25 PM
That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.

Code Select Expand
00401269 8D A4 24 00 00 00 00 lea esp,[esp]

Yes, that's exactly the code being inserted.

QuoteSo if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.

Yes, I was using Realtime, and went from 1000000 to 10000000 with no difference.

Quote
2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.
Code Select Expand
jmp qword_copy align 16 qword_copy:

Ok, I tried that. Still slower. In this case, it inserted the instruction
05 00000000 ADD EAX,0

Quote
Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.
Code Select Expand
align 16 str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\ "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0 align 16 str2 db 128 dup(0),0

Yes, I did that also. I've attached my test code so you can see what I'm doing. Sorry for the mess, it's a work in progress :wink

My results are printed in the description when the program runs.

align 16 only - 158

jmp and db 7 dup (0) - 149

jmp and align 16 - 159

The only explanation I can think of is the Athlon loads the lea esp,[esp] insturction in the prefetch and thinks about it for awhile, but the zero require no thought :dazzled:

[attachment deleted by admin]

Mark_Larson · May 13, 2005, 07:31:30 PM

For shoots and grins I compared both the "JMP and DB" and the "JMP and ALIGN" to the normal code execution time. They all execute at the same speed on the P4, which was what I was expecting. I know the P4 extremely well, but I don't know AMD as well, since I have never owned an AMD. I am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?

Jimg · May 13, 2005, 07:38:35 PM

QuoteI am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?

Nope. I'll take a look tonight.

Mark_Larson · May 13, 2005, 07:42:43 PM

Found the optimization PDF, and an online HTML webpage for optimization. The one I grabbed was for AMD64, it should be similar to XP.

http://63.204.158.36/amd/optimization/wwhelp/wwhimpl/js/html/wwhelp.htm
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

Mark_Larson · May 13, 2005, 08:01:20 PM

I think I found the slowdown in the central loop for AMD. I'll post more shortly.

EDIT: Here's the update.

One of the things I do, when trying to optimize for a specific processor is write down timing information ( latency, execution unit, etc). For AMD you don't want to use instructions that use Vectorpath. If you notice PMOVMSKB uses VectorPath.

Code Select


qword_copy:
   pxor	          mm1,mm1	DirectPath	FADD/FMULL		2
   movq	          mm0,[eax]	DirectPath	FADD/FMULL/FSTORE	4
   pcmpeqb	  mm1,mm0	DirectPath	FADD/FMUL		4
   add	          eax,8         DirectPath	ALU			1
   pmovmskb	  ecx,mm1	VectorPath	FADD/FMULl		3
   or		  ecx,ecx       DirectPath	ALU			1
   jnz	          finish_rest	DirectPath	ALU			1
   movq 	  [esi],mm0	DirectPath	FSTORE		        2
   add	          esi,8		DirectPath	ALU			1
   jmp	          qword_copy	DirectPath	ALU			1

finish_rest:

You can try using PSADBW instead. It is DirectPath and runs in 3 cyclces. It subtracts two MMX registers taking the absolute value, and then sums the bytes in a register. By making the second register all 0's, you basically get a way of doing a horizontal sum within a register ( COOL!). After the PCMPEQB every single byte in MM1 is going to be a "F" or a "0". Nothing else. So after the PSADBW you can move the result into a CPU register using MOVD. If the result is not 7F8h then you know you have a zero-value in there some where. The 7F8h comes from adding 8 FF's together, which is the result if all the bytes are non-zero. I have not verified the code works, so it might need testing.

Code Select


   pcmpeqb	  mm1,mm0	DirectPath	FADD/FMUL		4
;you can actually move the PXOR outside the loop.
pxor mm2,mm2		            DirectPath,    FADD/FMUL		2
   add	          eax,8         DirectPath	ALU			1
;   pmovmskb	  ecx,mm1	VectorPath	FADD/FMULl		3
psadbw         mm1,mm2		DirectPath	FADD			3
movd	        ecx,mm1	  	Double				4
;   or		  ecx,ecx       DirectPath	ALU			1
cmp             ecx,7F8h
;   jnz	          finish_rest	DirectPath	ALU			1
jb            	finish_rest

You can also use PACKUSWB which is also DirectPath and takes 2 cycles. And follow that by a MOVD to a CPU register, and compare it to a 0FFFFFFFFh, ,which is what it should be if no bytes are 0. I also have not verified this works, so use it with a grain of salt.

Code Select


   pcmpeqb	mm1,mm0		DirectPath	FADD/FMUL		4
   add	eax,8 		DirectPath	ALU			1
;   pmovmskb	ecx,mm1		VectorPath	FADD/FMULl		3
packuswb mm1,mm2		DirectPath	FADD/FMUL		2
movd	ecx,mm1		Double				4
;   or		ecx,ecx 		DirectPath	ALU			1
cmp ecx,0FFFFFFFFh
   jnz	finish_rest		DirectPath	ALU			1

Jimg · May 14, 2005, 01:05:35 AM

Quite entertaining, but several times slower :'(
Edited-
Scratch that, I need to do more testing.

Jimg · May 15, 2005, 03:45:19 AM

Ok, I think I got it now.

First example:

Code Select

    pxor mm1,mm1
    pxor mm2,mm2
qword_copy:
    movq mm0,[eax]
    pcmpeqb   mm1,mm0 ;DirectPath FADD/FMUL 4
    add   eax,8  ;       DirectPath ALU   1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    psadbw mm1,mm2 ;DirectPath FADD 3
    movd  ecx,mm1   ;Double 4
    or   ecx,ecx      ; DirectPath ALU 1
;   cmp  ecx,7F8h
    jnz  finish_rest ;DirectPath ALU 1
;   jb            finish_rest
    movq [esi],mm0
   	add esi,8
   	jmp qword_copy

the cmp ecx,7F8h/jb always jumped so I changed it as above.
runs in 156 vs. previous best of 149

Example 2:

Code Select

qword_copy:
    pxor mm1,mm1
    movq mm0,[eax]
    pcmpeqb mm1,mm0  ;DirectPath FADD/FMUL  4
   	add eax,8        ;DirectPath ALU 1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    packuswb mm1,mm2    ;DirectPath FADD/FMUL 2
    movd ecx,mm1        ;Double 4
    or ecx,ecx ;DirectPath ALU 1
;   cmp ecx,0FFFFFFFFh
   	jnz finish_rest
   	movq [esi],mm0
   	add esi,8
   	jmp qword_copy

I either don't understand hou packuswb if supposed to work or it doesn't work in this context.
packuswb of the word 7700h gives 00h, not 77h

Jimg · May 15, 2005, 04:18:57 PM

I analyzed your new ending, and here the same thing rewritten as an alternate ending for you. Probably slightly slower in strings not a multiple of 4 bytes including the 0 byte terminator. Runs in 130 here.

Code Select

finish_rest:
bsf ecx,ecx
cmp ecx,4
jb lowerx
mov edx,[eax-8]
mov [esi],edx

lowerx:
mov edx,[eax+ecx-11]   ; misaligned 3/4 of the time but a lot less code.
mov [esi+ecx-3],edx
ret

AeroASM · May 15, 2005, 05:06:03 PM

93 cycles with MMX
96 cycles with XMM

Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

[attachment deleted by admin]

Jimg · May 15, 2005, 06:28:15 PM

I just realized my previous post of a different ending could corrupt the string preceding the destination for source strings less than 4 bytes long. Not good at all :(

MichaelW · May 15, 2005, 11:39:08 PM

Aero,

Did you try to run the code that you posted? On my system (a P3) the MMX procedure hangs, and when it does, with REALTIME_PRIORITY_CLASS, it takes Windows down with it. This is why I have recently been posting code with HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, to help the user recover from any mistakes I may have made.

BTW, you could save at least some of us some time and effort if you would indicate that the code requires MASM 7, or add-on macro support for the earlier versions.

Jimg · May 16, 2005, 02:30:49 AM

Ok, here's an ending without the bugs:

Code Select

finish_rest:
    bsf ecx,ecx
    cmp ecx,3
    jb lowerx
    mov edx,[eax-8]        ; save first 4 bytes
    mov [esi],edx

    mov edx,[eax+ecx-11]   ; do the rest
    mov [esi+ecx-3],edx    ; faster to do it than test
    ret

lowerx:		; was 2 or 1 or 0
    test cl,cl
    jnz ItsOneOrTwo
    mov byte ptr [esi],0    ; was zero, just save terminator
    ret
ItsOneOrTwo:
    movzx ecx,word ptr[eax-8]   ; either xx0?  or x0??  
    mov [esi],cx
    cmp ecx,2
    je do_2
    ret	
do_2:            ; xx0? ????
    mov byte ptr [esi+2],0
    ret  ; ret

News:

SzCpy vs. lstrcpy

AeroASM