News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

BlackVortex

Runs fine under Olly for me.

jj2007

#271
Good and bad news:

First, the bad news: The "fast len() with SSE2" package attached below will not work with the ml.exe version 6.14 that gets installed when you download the Masm32 package. The reason is simply that the old Masm 6.14 (Copyright (C) Microsoft Corp 1981-1997) does not yet understand SSE2.

Now the good news:

1. It will work perfectly with JWasm (freeware), and with any later Masm version that comes along with the various VC express etc. downloads (see masm 6.14 or 6.15? - I have tested it only on ml.exe versions 6.15 and 9.0).

2. The default algo is now fully compatible with the Masm32lib len() macro. This means in practice that you can speed up existing projects that use len() simply by adding the include line:

include \masm32\include\masm32rt.inc
include \masm32\include\slenSSE2.inc

I should explain why I put now in red. There was an exchange of views between Lingo and myself on the value of preserving edx and ecx (Lingo: Who preserves ecx and edx registers in "this sort of algo"?). In the end, I kept saving ecx (a valuable counter register) and trashed edx. And, bang, my RichMasm project misbehaved. Intense bug chasing revealed that I had previously and unwillingly relied on a non-documented feature of the Masm32lib szLen routine - the one that is behind the len() macro. It does preserve ecx and edx. Therefore, the new version attached below does the same, in order not to break existing code: ecx and edx are preserved. The same applies to NightWare's version (SlenUseAlgos = 4) but not for Lingo's version (SlenUseAlgos = 2).

Enjoy,
jj2007

[attachment deleted by admin]

hutch--

JJ,

The szLen algo is correct in its register usage. It only uses EAX and the stack pointer. If you have had problems using it with RichMASM it is because your register usage is non standard.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .486
    .model flat, stdcall  ; 32 bit memory model
    option casemap :none  ; case sensitive

    .code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

szLen proc src:DWORD

    mov eax, [esp+4]
    sub eax, 4

  @@:
    add eax, 4
    cmp BYTE PTR [eax], 0
    je lb1
    cmp BYTE PTR [eax+1], 0
    je lb2
    cmp BYTE PTR [eax+2], 0
    je lb3
    cmp BYTE PTR [eax+3], 0
    jne @B

    sub eax, [esp+4]
    add eax, 3
    ret 4
  lb3:
    sub eax, [esp+4]
    add eax, 2
    ret 4
  lb2:
    sub eax, [esp+4]
    add eax, 1
    ret 4
  lb1:
    sub eax, [esp+4]
    ret 4

szLen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on March 16, 2009, 01:36:56 PM
JJ,

The szLen algo is correct in its register usage. It only uses EAX and the stack pointer. If you have had problems using it with RichMASM it is because your register usage is non standard.


Hutch,

1. My register usage is standard,

2. I did not say that szLen was incorrect:

Quote from: jj2007 on March 15, 2009, 10:33:08 PM
I had previously and unwillingly relied on a non-documented feature of the Masm32lib szLen routine - the one that is behind the len() macro. It does preserve ecx and edx.

In contrast to most other Masm32lib functions, len() does preserve ecx and edx. But it is not documented. And when I wrote "unwillingly", it means that I had forgotten (in only one of 55 uses of len) to follow the ABI convention saying you must preserve ecx and edx yourself if you need them after an API or library call. It also means that, not knowing that len preserves ecx and edx, I reflected 54 times unnecessarily whether I needed to preserve them myself :(

"Feature" is a positive word, and there was no irony involved. For version 11, you might consider mentioning this in the documentation. It's good that a function so frequently used does preserve the registers, that's why in the end I chose to do the same in my implementation of len().

So can you please accept my friendly clap on the shoulder?
:U

PBrennick

JJ,

I am trying to understand so please help. Do you mean that szLen preserves ECX and EDX by virtue of the fact it does not use them? I am a little confused here.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

ToutEnMasm


There is only one rule,esi edi and ebx must be preserved when a proc used them.That's all.
If your code use others registers than this one ,you must preserve them ,before a call to a subroutine.
If he don't made this,modify your code ,not the subroutine.It's a bad practice.

jj2007

Quote from: ToutEnMasm on March 16, 2009, 03:40:46 PM

There is only one rule,esi edi and ebx must be preserved when a proc used them.That's all.
If your code use others registers than this one ,you must preserve them ,before a call to a subroutine.
If he don't made this,modify your code ,not the subroutine.It's a bad practice.


You are right, in principle. However, since I wrote code that claims to be a replacement for len() aka invoke szLen, offset My$, and since there a lots of newbies and oldbies around who might have written code that relies on this undocumented feature of szLen, I think it's better to modify the subroutine rather than the code. I have added include \masm32\include\slenSSE2.inc as line 2 of my 9,500 lines of RichMasm source, and it works perfectly. That was the goal: give SSE2 speed to an existing application without rewriting it.

MichaelW

JJ,

There are multiple procedures in the MASM32 library that like szLen alter only EAX. Why should they be documented as preserving ECX and EDX when they are following the documented register-preservation conventions? If your code is depending on EAX, ECX, or EDX to be preserved, then your register usage is non-standard by the conventions of the mainstream 32-bit x86 world.

eschew obfuscation

jj2007

Quote from: MichaelW on March 16, 2009, 04:14:01 PM
JJ,

There are multiple procedures in the MASM32 library that like szLen alter only EAX. Why should they be documented as preserving ECX and EDX when they are following the documented register-preservation conventions? If your code is depending on EAX, ECX, or EDX to be preserved, then your register usage is non-standard by the conventions of the mainstream 32-bit x86 world.


Michael,

You are right. However, my normal register usage is standard. I had a bug in my source, but I would never had noticed it if my new version of len() had not trashed edx.

However, my goal was to be compatible with the current len() implementation, and be sure that it won't break any existing code.
I invite everybody who uses the len() macro to add a few lines at the top of their biggest source:

len MACRO ptr
  invoke szLen, ptr
  xor ecx, ecx  ; trash two registers that can be legally trashed
  xor edx, edx  ; according to the convention
  EXITM <eax>
ENDM

According to the convention, nobody should experience any problems :boohoo:

P.S.: In \masm32\include\slenSSE2.inc, I added a TestMasmVersion for those who try to assemble with ml 614 (it would assemble with 614, but the code may fail unexpectedly, so I decided to throw an error).

New code attached above.

ToutEnMasm


A little publicite for my ide,If someone had too much trouble modifying a few lines,there is a tool in my ide who can help with this.
His name is cherche (search in engilsh).For example he can find a word in each header file of the sdk and give a result with the name of the file and the line(s) where he found the word.A right clic on the named file,is enough to view the file with notepad and modify it.
The search take about,30 seconds.
There is about 1200 header files in the sdk and i haven't make a count of the lines.


herge

 Hi There:

Some intresting results:

Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32s=124strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
strlen32s            1467 cycles
strlen64LingoB       1213 cycles
NWStrLen             1323 cycles
_strlen (Agner Fog)  2804 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            394 cycles
strlen64LingoB       321 cycles
NWStrLen             342 cycles
_strlen (Agner Fog)  712 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1055 cycles
  crt strlen         618 cycles
strlen32s            114 cycles
strlen64LingoB       85 cycles
NWStrLen             113 cycles
_strlen (Agner Fog)  197 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   106 cycles
  crt strlen         69 cycles
strlen32s            17 cycles
strlen64LingoB       11 cycles
NWStrLen             20 cycles
_strlen (Agner Fog)  21 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   106 cycles
  crt strlen         105 cycles
strlen32s            17 cycles
strlen64LingoB       11 cycles
NWStrLen             18 cycles
_strlen (Agner Fog)  21 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         17 cycles
strlen32s            4 cycles
strlen64LingoB       1 cycles
NWStrLen             9 cycles
_strlen (Agner Fog)  7 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         16 cycles
strlen32s            3 cycles
strlen64LingoB       2 cycles
NWStrLen             10 cycles
_strlen (Agner Fog)  7 cycles
-- hit any key --


And Under Windbg I can't wait till it finishes?

See Attachment.

It;s VERy VERY SLOW!

Regards herge



[attachment deleted by admin]
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

Mark Jones

Latest:

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
codesizes: strlen32s=132strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
strlen32s            3206 cycles
strlen64LingoB       3188 cycles
NWStrLen             3198 cycles
_strlen (Agner Fog)  14239 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            842 cycles
strlen64LingoB       826 cycles
NWStrLen             842 cycles
_strlen (Agner Fog)  3560 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1240 cycles
  crt strlen         843 cycles
strlen32s            254 cycles
strlen64LingoB       240 cycles
NWStrLen             255 cycles
_strlen (Agner Fog)  917 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   140 cycles
  crt strlen         99 cycles
strlen32s            55 cycles
strlen64LingoB       40 cycles
NWStrLen             53 cycles
_strlen (Agner Fog)  139 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   140 cycles
  crt strlen         109 cycles
strlen32s            58 cycles
strlen64LingoB       43 cycles
NWStrLen             56 cycles
_strlen (Agner Fog)  103 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   22 cycles
  crt strlen         26 cycles
strlen32s            25 cycles
strlen64LingoB       22 cycles
NWStrLen             38 cycles
_strlen (Agner Fog)  36 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   23 cycles
  crt strlen         21 cycles
strlen32s            24 cycles
strlen64LingoB       21 cycles
NWStrLen             40 cycles
_strlen (Agner Fog)  35 cycles
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Jimg

AMD Athlon(tm) XP 3000+ (SSE1)
ERROR in StrSizeA at ebx=4096: 20535 bytes instead of 4096
ERROR in strlen32c at ebx=4096: 0 bytes instead of 4096
ERROR in strlen64B at ebx=4096: 20535 bytes instead of 4096
ERROR in NWStrLen at ebx=4096: 20535 bytes instead of 4096
codesizes: strlen32s=132strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
strlen32s            14573 cycles
strlen64LingoB       2782 cycles
NWStrLen             2783 cycles
_strlen (Agner Fog)  22914 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            3661 cycles
strlen64LingoB       3453 cycles
NWStrLen             3470 cycles
_strlen (Agner Fog)  28603 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1574 cycles
  crt strlen         931 cycles
strlen32s            944 cycles
strlen64LingoB       225 cycles
NWStrLen             227 cycles
_strlen (Agner Fog)  1487 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   169 cycles
  crt strlen         112 cycles
strlen32s            125 cycles
strlen64LingoB       40 cycles
NWStrLen             51 cycles
_strlen (Agner Fog)  193 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   169 cycles
  crt strlen         121 cycles
strlen32s            132 cycles
strlen64LingoB       40 cycles
NWStrLen             50 cycles
_strlen (Agner Fog)  161 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         29 cycles
strlen32s            40 cycles
strlen64LingoB       32 cycles
NWStrLen             38 cycles
_strlen (Agner Fog)  50 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         26 cycles
strlen32s            37 cycles
strlen64LingoB       33 cycles
NWStrLen             46 cycles
_strlen (Agner Fog)  72 cycles
                                        -- hit any key --

jj2007

Quote from: Jimg on March 17, 2009, 04:29:48 PM
AMD Athlon(tm) XP 3000+ (SSE1)
ERROR in StrSizeA at ebx=4096: 20535 bytes instead of 4096
ERROR in strlen32c at ebx=4096: 0 bytes instead of 4096
ERROR in strlen64B at ebx=4096: 20535 bytes instead of 4096
ERROR in NWStrLen at ebx=4096: 20535 bytes instead of 4096


Yeah, it's mostly SSE2 only. My strlen32s is not in the error list because it reverts to crt_strlen for SSE<2 (compare the timings :bg)
But I wonder whether it would run on an SSE1 CPU...? The instructions I used (movups, pmovmskb, pcmpeqb) seem to be SSE1 ::)

Could you please make a test by adding CheckSSE2 = 0 before the include line, i.e.

CheckSSE2 =0
include \masm32\include\slenSSE2.inc
include \masm32\macros\timers.asm

in slen_timings.asm?

Jimg

Sure-
AMD Athlon(tm) XP 3000+ (SSE1)
ERROR in StrSizeA at ebx=4096: 20535 bytes instead of 4096
ERROR in strlen32c at ebx=4096: 0 bytes instead of 4096
ERROR in strlen32s at ebx=4096: 2 bytes instead of 4096
ERROR in strlen64B at ebx=4096: 20535 bytes instead of 4096
ERROR in NWStrLen at ebx=4096: 20535 bytes instead of 4096
codesizes: strlen32s=88strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 0 bytes
strlen32s            25 cycles
strlen64LingoB       2782 cycles
NWStrLen             2785 cycles
_strlen (Agner Fog)  22940 cycles

-- test 4k, misaligned 11, 0 bytes
strlen32s            29 cycles
strlen64LingoB       3453 cycles
NWStrLen             3474 cycles
_strlen (Agner Fog)  28719 cycles

-- test 1k, misaligned 15, 0 bytes
  Masm32 lib szLen   1577 cycles
  crt strlen         933 cycles
strlen32s            29 cycles
strlen64LingoB       226 cycles
NWStrLen             227 cycles
_strlen (Agner Fog)  1489 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   169 cycles
  crt strlen         112 cycles
strlen32s            25 cycles
strlen64LingoB       40 cycles
NWStrLen             51 cycles
_strlen (Agner Fog)  193 cycles

-- test 1, misaligned 1, 0 bytes
  Masm32 lib szLen   170 cycles
  crt strlen         122 cycles
strlen32s            29 cycles
strlen64LingoB       40 cycles
NWStrLen             50 cycles
_strlen (Agner Fog)  161 cycles

-- test 5, misaligned 5, 0 bytes
  Masm32 lib szLen   27 cycles
  crt strlen         29 cycles
strlen32s            29 cycles
strlen64LingoB       32 cycles
NWStrLen             38 cycles
_strlen (Agner Fog)  50 cycles

-- test 15, misaligned 15, 0 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         25 cycles
strlen32s            29 cycles
strlen64LingoB       33 cycles
NWStrLen             46 cycles
_strlen (Agner Fog)  72 cycles
                                        -- hit any key --