News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

jj2007

#255
Quote from: lingo on March 13, 2009, 04:03:00 PM
It seems...  you prefer to steal other's ideas and algos (for example: from lesson 35, Iczelion) rather than to  use your own automatic highlighting algo for .RTF  files to resolve the problems. :wink

Tut 35 has 1265 lines, my RichMasm source has over 9500. No need to steal. Besides, I also wrote already that I don't like Xmas trees. I am beyond that age :bg

Quote
"and one or two cycles faster."
Read A.Fog: Which one is faster - jump to register or jump to memory

For me:  mov ecx, [esp-8] ; this instruction is for free!!! :lol
       .......   
               jmp  ecx

is faster than
   jmp dword ptr [esp-8]

If you disagree just ask herge or sinsi to make tests for you (due to archaic type of your  CPUs).  :lol
Theirs CPUs are OK.

I am a fan of Agner, but I am an even greater fan of MichaelW's timer.asm :U
47      cycles jmp directly
48      cycles mov+jmp

47      cycles jmp directly
48      cycles mov+jmp

47      cycles jmp directly
48      cycles mov+jmp


Divide by ten. And before you shoot from the hip again: I never said that jumping directly is much faster.

EDIT: Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3): 40 for both of them, i.e. 4.00 cycles

NightWare

hmm... someone here said once "reading asm related posts, is better than smoking marijuna" or something like that... seriously he was under the truth...

just few things, before it degenerate more :

concerning the algos, even if 1000+ instructions in x86, the simd instructions to compare byte are not numberous, so using it is quite logical. since we generally use the same programming schemes, it seams quite logical to obtain a similar conception/algo/result... nothing "strange".

plus, if we post an algo we give to others the possibility to improve it... we implicitly encourage the "copy/use" of the original algo as basement... (it's the purpose of the laboratory, no ?).

to finish, like said by someone else, the benefit of this sort of algo is quite limited... nothing serious to fight for...


jj2007

Quote from: NightWare on March 14, 2009, 04:12:12 AM
concerning the algos, even if 1000+ instructions in x86, the simd instructions to compare byte are not numberous, so using it is quite logical. since we generally use the same programming schemes, it seams quite logical to obtain a similar conception/algo/result... nothing "strange".
Indeed, I was not hinting at any copyleft issues :wink - it was merely an observation that apparently we (you, Lingo, myself) have pushed the CPU to its limits; so our algos must look almost identical. Tonight, I managed to squeeze out a few cycles by moving a line up or down, and then had the bright idea to unroll the inner loop, but nope, not a single cycle less, this is the limit. What counts in the end is a factor 5 improvement on szLen and crt_strlen, and a factor 10 on lstrlenA. For my part, this thread can be closed peacefully.

herge

 Hi jj2007:

We can't close yet we havn't got to 20 yet?
We at 18 we can do it!

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

lingo

"so using it is quite logical. since we generally use the same programming schemes, it seams quite logical to obtain a similar conception/algo/result... nothing "strange".
plus, if we post an algo we give to others the possibility to improve it... we implicitly encourage the "copy/use" of the original algo as basement... (it's the purpose of the laboratory, no ?)."


I implicitly encourage everyone to improve it too, but not to make it bad to worse.. :wink
In this case as a free human being I have a human right to tell my opinion too... :lol

What about criteria who is right or wrong?
The results!
But we have different results on different CPUs
as a jj respectfully stated "Ever heard about hardware differences?" :wink
Who makes code optimization for archaic CPUs?  IMO sick people... :lol
Who preserves ecx and edx registers in "this sort of algo"? IMO lame people...
I can continue with who and IMO... :wink


"the benefit of this sort of algo is quite limited ..."

A lot of people have similar opinion but fortunately some people from Intel
created new faster instructions exactly for "this sort of algo"...
The speed is never enough.

"nothing serious to fight for..."
As  an engineer I believe in numbers rather than in emotions and empty words as a serious,
unserious, etc...


jj2007

#260
For those who have followed this thread, here finally a "library package". All you really need is to extract the file slenSSE2.inc to \masm32\include\slenSSE2.inc

Here is the most basic usage example:

include \masm32\include\masm32rt.inc
include \masm32\include\slenSSE2.inc

.code
ShortString db "My short string", 0

start:
print offset ShortString, " has "
print str$(len(offset ShortString)), " bytes"

print chr$(13, 10, 10, "-- hit any key --")

getkey
exit

end start


If you use the len macro in your code, then the only difference to ordinary Masm32 code is line 2, i.e. you can make entire projects a bit faster just by adding this line.
By default, my own strlen32s algo will be used for len. Lingo's and Nightware's algos can be forced by adding...
SlenUseAlgos = 2 ; Lingo
SlenUseAlgos = 4 ; NightWare
...before the include (see strlenSSE2.asm for more detail, and benchmarks comparing all three).
These two are equally fast; however, only the default algo (strlen32s) has a check if the CPU allows SSE2 code. If that check fails, len will revert to crt_strlen - slow but still a factor 2 faster than the standard Masm32lib szLen.

Cheers, jj


EDIT: I removed the attachment in favour of the new version posted on page 19. See remarks on preserving edx.

mitchi

Nice work, bit artisans  :bg

herge

 Hi jj2207:

Eh "Houston we Have Liftoff!".

Great work jj2007.

I almost used the wrong assembler, you have to
use the assembler that comes with VC2005 Express.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: herge on March 14, 2009, 11:20:47 PM
Hi jj2207:

Eh "Houston we Have Liftoff!".

Great work jj2007.

I almost used the wrong assembler, you have to
use the assembler that comes with VC2005 Express.

Regards herge

Thanxalot, herge. The credits go also to NightWare and Lingo, of course, whose algos can be activated easily as shown above.
@NightWare & Lingo: If you consider adding the CheckSSE2 to your algos, please let me know. The check costs only about one cycle (see below, bottom of tests: 5 instead of 4 cycles for the 15 byte string), and makes sure that code works fine on whatever archaic CPU the user runs :green

Re VC2005 Express: SSE2 code should also work with Masm 6.15, and it definitely works fine with JWasm.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32s=124strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
strlen32s            2918 cycles
strlen64LingoB       2921 cycles
NWStrLen             2935 cycles
_strlen (Agner Fog)  4264 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            753 cycles
strlen64LingoB       740 cycles
NWStrLen             757 cycles
_strlen (Agner Fog)  1096 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1308 cycles
  crt strlen         971 cycles
strlen32s            198 cycles
strlen64LingoB       192 cycles
NWStrLen             208 cycles
_strlen (Agner Fog)  272 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   132 cycles
  crt strlen         110 cycles
strlen32s            27 cycles
strlen64LingoB       25 cycles
NWStrLen             32 cycles
_strlen (Agner Fog)  34 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   132 cycles
  crt strlen         132 cycles
strlen32s            28 cycles
strlen64LingoB       25 cycles
NWStrLen             32 cycles
_strlen (Agner Fog)  34 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   24 cycles
  crt strlen         28 cycles
strlen32s            5 cycles
strlen64LingoB       4 cycles
NWStrLen             15 cycles
_strlen (Agner Fog)  14 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   25 cycles
  crt strlen         25 cycles
strlen32s            5 cycles
strlen64LingoB       4 cycles
NWStrLen             15 cycles
_strlen (Agner Fog)  14 cycles

ToutEnMasm

Hello,
That's a very good work,in masm syntax.
just a few words about compiled it.
It need a masm32rt_586.inc ,the one in masm32 is .486.
Write "   include slenSSE2.inc  ;include it in your masm32\include directory " in the lensse2.asm,avoid to search it.
Compile it in a console application with at least ml 7.0
That's all.

herge


Hi jj2007:

I seem to have problems debugging it in windbg.
The EXE works great from dos. But I don't think
windbg likes CPUID for some reason.


strslensse2!start+0x1ab [C:\Program Files\Microsoft Visual Studio 8\VC\bin\strslensse2.asm @ 235]:
00401330 33c0            xor     eax,eax
00401332 0fa2            cpuid
00401334 0f31            rdtsc
00401336 52              push    edx
00401337 50              push    eax
00401338 c705dcb7400020a10700 mov dword ptr [strslensse2!__counter__loop__counter__ (0040b7dc)],7A120h
00401342 33c0            xor     eax,eax
00401344 0fa2            cpuid


It's not your code it's windbg acting up?
It's screwing up on either a T or P ?
C:\Documents and Settings\User\My Documents\My Pictures\401332.zip

See attachment JPG EXE ASM

Regards herge



[attachment deleted by admin]
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: herge on March 15, 2009, 11:06:10 AM

Hi jj2007:

I seem to have problems debugging it in windbg.
The EXE works great from dos. But I don't think
windbg likes CPUID for some reason.


I remember having the same problem with OllyDbg, but right now I can't reproduce it. Any Olly experts around who could explain what's going on?

herge

 Hi jj2007:

We got the wrong EXE there oops!

Attachment EXE ASM

Regards herge

// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

herge

hi jj2007:

Will try that agan

Let me know if you got it.

I am not having much luck with

Winrar today ir's a pain in the butt.

Regards herge

[attachment deleted by admin]
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

herge

 Hi jj2007:

I am making some progress but I could be going backwards?

Application popup: windbg.exe - Application Error : The instruction at "0x65e36abb" referenced memory at "0x00106130".
The memory could not be "read".

Click on OK to terminate the program

Application popup: windbg.exe - Application Error : The instruction at "0x65e36abb" referenced memory at "0x000fe190".
The memory could not be "read".

And when you put a breakpoint on 65e36abb you get a 299 error.

whick get's you this.


Details
Product: Windows Operating System
ID: 26
Source: Application Popup
Version: 5.2
Symbolic Name: STATUS_LOG_HARD_ERROR
Message: Application popup: %1 : %2
   
Explanation
The program could not load a driver because the program user doesn't have sufficient privileges to access
the driver or because the drive is missing or corrupt.

   
User Action
To correct this problem:

Ensure that the program user has sufficient privileges to access the directory in which the driver is installed.
Reinstall the program to restore the driver to the correct location.
If these solutions do not work, contact Product Support Services.

//
// MessageId: STATUS_SHARED_POLICY
//
// MessageText:
//
// The policy object is shared and can only be modified at the root
//
#define STATUS_SHARED_POLICY             ((NTSTATUS)0xC0000299L)

Unable to insert breakpoint 10000 at 65e36abb, Win32 error 0n299
    "Only part of a ReadProcessMemory or WriteProcessMemory request was completed."
The breakpoint was set with BP.  If you want breakpoints
to track module load/unload state you must use BU.
go bp10000 at 65e36abb failed


I will keep you posted if we can get help from Microsoft,
but I won't hold my breath.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy