News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SzCpy vs. lstrcpy

Started by Mark Jones, May 09, 2005, 01:23:45 AM

Previous topic - Next topic

lingo

AeroASM,  :lol
You can use my zip file...


OPTION   PROLOGUE:NONE
OPTION   EPILOGUE:NONE
comment  *MMX by Lingo*
align    16
@@:
        movq     MM1, [ecx+eax]
        movq     [edx+eax-8], MM0
@@2:
        movq     MM0, [ecx+eax]
        pcmpeqb  MM1, MM7
        packsswb MM1, MM1
        movd     ebx, MM1
        test     bl,  bl
        lea      eax, [eax+8]
        je       @B
@@:
        movzx    ebx, byte ptr [ecx+eax-8]
        add      eax, 1
        test     bl,  bl
        mov      [edx+eax-9], bl
        jnz      @B
        mov      ebx, [esp+2*4]
        ret      2*4
align    16
SzCpy10  proc    SzDest:DWORD, SzSource:DWORD
        mov      ecx, [esp+2*4]   ; ecx = source
        xor      eax, eax
        mov    edx, [esp+1*4]   ; edx= destination
        pxor     MM7, MM7
        movq     MM1, [ecx]
        mov      [esp+2*4], ebx   ; save ebx register
        je       @@2
SzCpy10  endp
OPTION   PROLOGUE:PrologueDef
OPTION   EPILOGUE:EpilogueDef


Here is the results:
P4 3.6GHz Prescott

Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.


128-byte string copy timing results:

SzCpy10 (Lingo -> MMX): 120 clocks
SzCpy19 (Mark Larson ->MMX):171 clocks
SzCpy11 (Lingo-> SSE): 116 clocks
SzCpy18 (Mark Larson ->SSE):132 clocks

Press ENTER to exit...


Regards,
Lingo

[attachment deleted by admin]

AeroASM

MichaelW: I did test it many times, and it takes about 15 seconds each time on a 1.5GHz.

MichaelW

Aero,

Correction, the procedure does not hang, but with REALTIME_PRIORITY_CLASS Windows did. After a change to HIGH_PRIORITY_CLASS the procedure takes ~40 seconds to run (P3-500), so perhaps this was just too long for REALTIME_PRIORITY_CLASS.

I added a function test using your source string, and another one that that substituted a high-ascii character at the start of the source string, and in both cases the string copied OK. The cycle count for the MMX version was a uniform 139. The XMM version generated errors, so I'm guessing it contains at least one instruction that my P3 does not support.
eschew obfuscation

Mark_Larson

Quote from: lingo on May 16, 2005, 02:35:51 AM


SzCpy10 (Lingo -> MMX): 120 clocks
SzCpy19 (Mark Larson ->MMX):171 clocks
SzCpy11 (Lingo-> SSE): 116 clocks
SzCpy18 (Mark Larson ->SSE):132 clocks

Press ENTER to exit...


Regards,
Lingo

  Three things Lingo.

  1) You modified my code, yet still list the timings as mine.  If you are going to benchmark my code, and list it as my code, don't change the code. ( szpy18 routine)  Not to mention this throws off any other comparative timings people are doing with my code.

  2) I made an update on the previous page to my code.  With the update my code runs in 87 cycles.  You didn't pick that up either.

  3) You need to run your code in realtime with a few more loops.  There's no way the SSE code I posted is running in 132 cycles on your P4.  Considering the original timing without the update was running at about 100 cycles.  When I updated your timing code to do REALTIME and updated the number of loops, the updated code I posted last week runs in 87 cycles as I was expecting.

EDIT:  Here are the timings after I changed my code back to its original form and added the code update from last week, along with the more accurate timings . Now when I run it mutliple times, I only get a 1 cycle clock variance.


128-byte string copy timing results:

SzCpy10 (Lingo -> MMX): 116 clocks
SzCpy19 (Mark Larson ->MMX):119 clocks
SzCpy11 (Lingo-> SSE): 81 clocks
SzCpy18 (Mark Larson ->SSE):87 clocks

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

Good morning Mark-

May I have a full copy of your latest 87 clock version. The bits and pieces to construct it are confusing me.  Also, doesn't Lingo's version still suffer from the problem of writing outside the boundary of the destination string if it's exactly the size to receive the source string, and it is not a multiple of 8 bytes?

lingo

Mark Larson,  :(

Quote
  1) You modified my code, yet still list the timings as mine.....

No, I didn't  modify it because your code is not so valuable for me and it is not in my style..

I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page

Quote
Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

So it should be Mark Larson@AeroASM or not.. ?..


Quote
2) I made an update on the previous page to my code.  With the update my code runs in 87 cycles.

Congratulations

Quote
3) You need to run your code in realtime with a few more loops.  There's no way the SSE code I posted is running in 132 cycles on your P4

"There's no way the SSE code I posted"
I don't  know your "original" code
I just use "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page


" is running in 132 cycles on your P4" 

It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...

Regards,
Lingo




AeroASM

Quote from: lingo on May 16, 2005, 05:14:38 PM
There is an error in usage of source and destination parameters
in the beginning of the "your" or AeroASM code and that is the reason
for "Why longer with XMM?" question.

It is my implementation of Mark's algorithm.
What is the error?

hutch--

I have always found some humour in the direction that such topics develop and how far they wander from the original design. The first examples from Mark Jones were general purpose byte aligned source and destination that could be used under almost any circumstances as is characterised by unaligned string data.

I now see code designs that require complicated starting alignment correction on the source that still require aligned targets which reduce the algos to more or less novelties for general purpose byte aligned string data.

I would like to have the luxury of performing OWORD data transfers in an efficient manner but the latest hardware only barely supports it with a fudge in 64 bit and all of the later hardware requires at least natural alignment of the data size to run properly.

I would like to thank those who helped with the BYTE aligned code testing as I have ended up with an algo that is about 90% faster than the original I had in the library and it still satsfies the criterion of being a general purpose algorithm.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

 :lol

QuoteHey Hutch--, there was one other problem with Lingo's code.  It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes.  So it copies extra data past the end of the source string to the destination string.  In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash.  In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.


1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
(We don't know the lenght of the source string}

2.How long is the buffer with destination string  and  how we allocated it? Is it equal of the source buffer or not?

3.Who desided the lenght of the destination buffer to be equal of the lenght of the source string (not buffer)?
  (We don't know the lenght of the source string}
  A. Mark Larson and I saw lack of the programmer's experience here...Why?
      Because as a result we must write additional slow code (to copy last bytes byte by byte). Who is gilty about the slower code?
      Mark Larson and all the people that believe to him... :lol

Regards,
Lingo

AeroASM

Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.

hutch--

Aero,

I think its because internally a 32 bit processor emulates 64 and 128 bit data transfers in 32 bit chunks.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

AeroASM

Actually the later Pentiums have 64 bit data buses but I think you are right. Also I think that XMM is not yet very well developed and refined.

lingo

AeroASM,  :lol

It is the result from your "old" file szCopyXMMX.asm
Quote
C:\0A1>szcopy~1.exe
173 cycles with MMX
132 cycles with XMM [/b]

Regards,
Lingo

Jimg

Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question.  Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Mark_Larson

Quote from: AeroASM on May 16, 2005, 06:16:21 PM
Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.

  That's because the latency on P4 for SSE2 is a lot slower than MMX.  You can't convert it and always expect it to run faster.  There are additional tricks you have to do.  I'll take a look at it later.  I'm swamped at work at the moment.  The general trick I is to do TWO of whatever you are doing in the main loop to help break dependencies and give a speed up.



Quote from: lingo on May 16, 2005, 05:14:38 PM

Quote3) You need to run your code in realtime with a few more loops.  There's no way the SSE code I posted is running in 132 cycles on your P4


It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...


  You missed the point Lingo.  I compiled and ran your code multiple times and I got large differences in the number of cycles it took to execute.  In your own SSE code, I saw a 12 cycle variance.  If you are going to do code optimization for low cycle count procedures you shouldn't have more than a 1 or 2 cycle variance. 



Quote from: lingo on May 16, 2005, 05:14:38 PM
Mark Larson,  :(

Quote
  1) You modified my code, yet still list the timings as mine.....

No, I didn't  modify it because your code is not so valuable for me and it is not in my style..

I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page

  That's cuz Aero made changes to my code.  So when you post timings, you need to post that it's my code but modified by Aero.



  JimG, here's my latest.  I haven't had much chance to tweak since I've been super busy lately.  I should be able to drop another 10-15 cycles maybe more after playing with it.  I wanted to try PSADBW in place of PMOVMSKB on P4 and see if it's faster.  And a few other things.


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[esp+4]
   mov esi,[esp+8]


align 16
qword_copy:
   pxor mm1,mm1
   movq mm0,[eax]
   pcmpeqb mm1,mm0
   add eax,8
   pmovmskb ecx,mm1
   or ecx,ecx
   jnz finish_rest
   movq [esi],mm0
   add esi,8
   jmp qword_copy

finish_rest:

;if 0
bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2

do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret

do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret

do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret


do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret


do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret

do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret

;cmp ecx,1
;je do_1
movzx ecx,byte ptr [eax-8]
mov [esi],cl
ret 8 ; ret

szCopyMMX endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Quote
Quote
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm