News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SzCpy vs. lstrcpy

Started by Mark Jones, May 09, 2005, 01:23:45 AM

Previous topic - Next topic

Mark_Larson

 That is equivalent to doing an "ALIGN 16" in the inner loop, which what I kept trying to get you to try :)  Here's why it is the same as an align ( without the jump of course).  Let's look at all the code before the first loop.  That consists of exactly two lines of code.


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[src_str]
   mov esi,[dst_str]


;THIS gets translated into the following.  The first two lines are the "prologue"
; the second two lines are the first two lines of the code.  You will notice that it
; also gives the opcodes for the instructions. 

00401260 55                   push        ebp
00401261 8B EC                mov         ebp,esp
00401263 8B 45 08             mov         eax,dword ptr [ebp+8]
00401266 8B 75 0C             mov         esi,dword ptr [ebp+0Ch]

push ebp = 55h
mov ebp,esp = 8Bh ECh
mov eax,dword ptr [ebp+8] = 8Bh 45h 08h
mov esi,dword ptr [ebp+0Ch] = 8Bh 75h 0Ch





   If you add up the opcodes contained in those 4 instructions it comes out to 9.  Now if you remember right you added 7 bytes to just before the label "qword_copy".  9+7 = 16 bytes.  Since the first line of the routine is 16-byte aligned, this forces the "qword_copy" label to also be 16-byte aligned.  It's just easier to use ALIGN 16 in front of the label.



align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[src_str]
   mov esi,[dst_str]

align 16
qword_copy:
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

Right.  But it measures 8 cycles longer to work through the nop's rather than the jump.

Mark_Larson

Quote from: Jimg on May 13, 2005, 04:43:59 AM
Right.  But it measures 8 cycles longer to work through the nop's rather than the jump.

  That's a common fallacy that people have with how ALIGN works in assemblers.  They don't always stick NOPs in there to "pad" it to a certain alignment.  They have different sized "NOP"s that they use for different numbers of bytes they use.  One of the 7 bytes ones they use looks like this.  That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that.  Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing.  Not necessarily the NOP instruction.  The below instruction does nothing, but it uses up 7 bytes.


00401269    8D A4 24 00 00 00 00    lea         esp,[esp]


  So if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles.  Try two different things to make sure it is extremely accurate.  Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10.  I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more.  It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class.  If it's still running a lot faster than that, bump up the number of loops again.  It takes a long time to run, but it will give you an extremely accurate number of cycles.

2)  do the ALIGN 16 at the loop but add a JMP right before the instruction.  That does exactly the same thing you are doing with the JMP and DBs.  You did TWO things to try and make it faster, not one.  So I am removing one of them to see, which one made it faster.  The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right.  Make sense?  I just woke up, so my brain isn't awake yet.  So I am not sure I am explaining it right.


jmp qword_copy
align 16
qword_copy:   




  Getting back to optimization, also make sure the strings are aligned as well.  I do it on a 16-byte boundary in case I do SSE/SSE2 later.

align 16
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
      str2 db 128 dup(0),0


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

Quote from: Mark_Larson on May 13, 2005, 01:23:25 PM
  That's a common fallacy that people have with how ALIGN works in assemblers.  They don't always stick NOPs in there to "pad" it to a certain alignment.  They have different sized "NOP"s that they use for different numbers of bytes they use.  One of the 7 bytes ones they use looks like this.  That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that.  Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing.  Not necessarily the NOP instruction.  The below instruction does nothing, but it uses up 7 bytes.


00401269    8D A4 24 00 00 00 00    lea         esp,[esp]

Yes, that's exactly the code being inserted.

QuoteSo if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles.  Try two different things to make sure it is extremely accurate.  Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10.  I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more.  It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class.  If it's still running a lot faster than that, bump up the number of loops again.  It takes a long time to run, but it will give you an extremely accurate number of cycles.

Yes, I was using Realtime, and went from 1000000 to 10000000 with no difference.

Quote
2)  do the ALIGN 16 at the loop but add a JMP right before the instruction.  That does exactly the same thing you are doing with the JMP and DBs.  You did TWO things to try and make it faster, not one.  So I am removing one of them to see, which one made it faster.  The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right.  Make sense?  I just woke up, so my brain isn't awake yet.  So I am not sure I am explaining it right.

jmp qword_copy
align 16
qword_copy:   


Ok, I tried that.  Still slower.  In this case, it inserted the instruction
05 00000000        ADD EAX,0

Quote
  Getting back to optimization, also make sure the strings are aligned as well.  I do it on a 16-byte boundary in case I do SSE/SSE2 later.

align 16
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
      str2 db 128 dup(0),0

Yes, I did that also.  I've attached my test code so you can see what I'm doing.  Sorry for the mess, it's a work in progress  :wink

My results are printed in the description when the program runs.

align 16 only - 158

jmp and db 7 dup (0) - 149

jmp and align 16 - 159

The only explanation I can think of is the Athlon loads the     lea  esp,[esp]   insturction in the prefetch and thinks about it for awhile, but the zero require no thought :dazzled:






[attachment deleted by admin]

Mark_Larson


  For shoots and grins I compared both the "JMP and DB" and the "JMP and ALIGN" to the normal code execution time.  They all execute at the same speed on the P4, which was what I was expecting.  I know the P4 extremely well, but I don't know AMD as well, since I have never owned an AMD.  I am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst.  Have you tried either to see if you can find a clue in there?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

QuoteI am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst.  Have you tried either to see if you can find a clue in there?
Nope.  I'll take a look tonight.

Mark_Larson


  Found the optimization PDF, and an online HTML webpage for optimization.  The one I grabbed was for AMD64, it should be similar to XP.

http://63.204.158.36/amd/optimization/wwhelp/wwhimpl/js/html/wwhelp.htm
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  I think I found the slowdown in the central loop for AMD.  I'll post more shortly.

EDIT: Here's the update.

   One of the things I do, when trying to optimize for a specific processor is write down timing information ( latency, execution unit, etc). For AMD you don't want to use instructions that use Vectorpath.  If you notice PMOVMSKB uses VectorPath.


qword_copy:
   pxor           mm1,mm1 DirectPath FADD/FMULL 2
   movq           mm0,[eax] DirectPath FADD/FMULL/FSTORE 4
   pcmpeqb   mm1,mm0 DirectPath FADD/FMUL 4
   add           eax,8         DirectPath ALU 1
   pmovmskb   ecx,mm1 VectorPath FADD/FMULl 3
   or   ecx,ecx       DirectPath ALU 1
   jnz           finish_rest DirectPath ALU 1
   movq   [esi],mm0 DirectPath FSTORE         2
   add           esi,8 DirectPath ALU 1
   jmp           qword_copy DirectPath ALU 1

finish_rest:


   You can try using PSADBW instead.  It is DirectPath and runs in 3 cyclces.  It subtracts two MMX registers taking the absolute value, and then sums the bytes in a register.  By making the second register all 0's, you basically get a way of doing a horizontal sum within a register ( COOL!).  After the PCMPEQB every single byte in MM1 is going to be a "F" or a "0".  Nothing else.  So after the PSADBW you can move the result into a CPU register using MOVD.    If the result is not 7F8h then you know you have a zero-value in there some where.  The 7F8h comes from adding 8 FF's together, which is the result if all the bytes are non-zero.  I have not verified the code works, so it might need testing.



   pcmpeqb   mm1,mm0 DirectPath FADD/FMUL 4
;you can actually move the PXOR outside the loop.
pxor mm2,mm2             DirectPath,    FADD/FMUL 2
   add           eax,8         DirectPath ALU 1
;   pmovmskb   ecx,mm1 VectorPath FADD/FMULl 3
psadbw         mm1,mm2 DirectPath FADD 3
movd         ecx,mm1   Double 4
;   or   ecx,ecx       DirectPath ALU 1
cmp             ecx,7F8h
;   jnz           finish_rest DirectPath ALU 1
jb            finish_rest


   You can also use PACKUSWB which is also DirectPath and takes 2 cycles.  And follow that by a MOVD to a CPU register, and compare it to a 0FFFFFFFFh, ,which is what it should be if no bytes are 0.  I also have not verified this works, so use it with a grain of salt.


   pcmpeqb mm1,mm0 DirectPath FADD/FMUL 4
   add eax,8 DirectPath ALU 1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
packuswb mm1,mm2 DirectPath FADD/FMUL 2
movd ecx,mm1 Double 4
;   or ecx,ecx DirectPath ALU 1
cmp ecx,0FFFFFFFFh
   jnz finish_rest DirectPath ALU 1

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

#68
Quite entertaining, but several times slower :'(
Edited-
Scratch that, I need to do more testing.

Jimg

Ok, I think I got it now.

First example:
    pxor mm1,mm1
    pxor mm2,mm2
qword_copy:
    movq mm0,[eax]
    pcmpeqb   mm1,mm0 ;DirectPath FADD/FMUL 4
    add   eax,8  ;       DirectPath ALU   1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    psadbw mm1,mm2 ;DirectPath FADD 3
    movd  ecx,mm1   ;Double 4
    or   ecx,ecx      ; DirectPath ALU 1
;   cmp  ecx,7F8h
    jnz  finish_rest ;DirectPath ALU 1
;   jb            finish_rest
    movq [esi],mm0
    add esi,8
    jmp qword_copy

the cmp ecx,7F8h/jb always jumped so I changed it as above.
runs in 156 vs. previous best of 149

Example 2:
qword_copy:
    pxor mm1,mm1
    movq mm0,[eax]
    pcmpeqb mm1,mm0  ;DirectPath FADD/FMUL  4
    add eax,8        ;DirectPath ALU 1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    packuswb mm1,mm2    ;DirectPath FADD/FMUL 2
    movd ecx,mm1        ;Double 4
    or ecx,ecx ;DirectPath ALU 1
;   cmp ecx,0FFFFFFFFh
    jnz finish_rest
    movq [esi],mm0
    add esi,8
    jmp qword_copy

I either don't understand hou packuswb if supposed to work or it doesn't work in this context.
  packuswb of the word 7700h gives 00h, not 77h

Jimg

I analyzed your new ending, and here the same thing rewritten as an alternate ending for you.  Probably slightly slower in strings not a multiple of 4 bytes including the 0 byte terminator.  Runs in 130 here.
finish_rest:
bsf ecx,ecx
cmp ecx,4
jb lowerx
mov edx,[eax-8]
mov [esi],edx

lowerx:
mov edx,[eax+ecx-11]   ; misaligned 3/4 of the time but a lot less code.
mov [esi+ecx-3],edx
ret


AeroASM

93 cycles with MMX
96 cycles with XMM


Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

[attachment deleted by admin]

Jimg

I just realized my previous post of a different ending could corrupt the string preceding the destination for source strings less than 4 bytes long.  Not good at all :(

MichaelW

Aero,

Did you try to run the code that you posted? On my system (a P3) the MMX procedure hangs, and when it does, with REALTIME_PRIORITY_CLASS, it takes Windows down with it. This is why I have recently been posting code with HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, to help the user recover from any mistakes I may have made.

BTW, you could save at least some of us some time and effort if you would indicate that the code requires MASM 7, or add-on macro support for the earlier versions.
eschew obfuscation

Jimg

Ok, here's an ending without the bugs:
finish_rest:
    bsf ecx,ecx
    cmp ecx,3
    jb lowerx
    mov edx,[eax-8]        ; save first 4 bytes
    mov [esi],edx

    mov edx,[eax+ecx-11]   ; do the rest
    mov [esi+ecx-3],edx    ; faster to do it than test
    ret

lowerx: ; was 2 or 1 or 0
    test cl,cl
    jnz ItsOneOrTwo
    mov byte ptr [esi],0    ; was zero, just save terminator
    ret
ItsOneOrTwo:
    movzx ecx,word ptr[eax-8]   ; either xx0?  or x0?? 
    mov [esi],cx
    cmp ecx,2
    je do_2
    ret
do_2:            ; xx0? ????
    mov byte ptr [esi+2],0
    ret  ; ret