The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: KeepingRealBusy on July 07, 2010, 12:57:11 AM

Title: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 12:57:11 AM
I was looking at the SSE conversions we were developing and thought about some of the consequences of using SSE (or a general purpose register like eax) when scanning a string instead of just a BYTE (or WORD for Wide Character Unicode).

I wanted to check what would happen at the end of a buffer allocated by VirtualAlloc. I created this small test case and verified that movdqu from the start of a short (in my case 7 BYTES) string at the end of the buffer would fault before you can check for a null. I then add the code to allow this to work correctly.

Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

If you allocate a VirtualAlloc Buffer, and reserve the last 16 BYTES and do not put any data there (not even the last trailing null), then you will never overrun the buffer - you will always find the null first. Thus, you do not need this type of initial checking.

Note further:

This test case is coded for BYTE compares, but the same solution can be used when checking 8 wide characters in an xmm reg.

The method:

The plan is to force the first load to be at a mod 16 bound by ANDing the real start by -16. This will never overrun a VirtualAlloc Buffer because the start is always on a mod page size bound, and the length is always a mod page size length. Page size is 4KB with XP (maybe larger on 64 bit systems) so as long as you start at a mod 16 bound within the buffer, you will never overrun the buffer. This gets you more data than you wanted (the BYTES or WORDS preceeding the string) so you can calculate the number of bytes to skip and the bit mask to use on the extracted bits from pmovmskb. Using a bsf on the extracted and masked bits in the register will then give you a correct value or the first instance of the desired character. My test was a simple check for nulls in a buffer full of nulls.

Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

All:

The same problem can occur if you use mov eax,[esi] to scan a string, checking for nulls as you go. Instead, force the esi to the prior mod 4 bound, load the DWORD containing the desired starting character, and use the difference of actual start vs aligned start to position eax left 8 bits per skip character, and then rol eax,8 for each remaining character to be checked in al looking for a null. From then on, just adjust esi by 4 bytes and you will not overrun the buffer. You have to find some way to not overrun a VirtualAlloc buffer on a first check.

For scanning a Unicode strings, you are dealing with 2 WORDS and not 4 BYTES, so make adjustments and just deal with it.

This test case is a good place to test out your favorite code fragment to insure that the first load of your routine will not overrun the buffer.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: drizz on July 07, 2010, 01:42:51 AM
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.

Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 07, 2010, 02:29:39 AM
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 03:13:58 AM
Quote from: drizz on July 07, 2010, 01:42:51 AM
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.



I read some of the comments in your link.  The common thinking was that it is faster to use the bad algos than worry about an exception once in a great while. In my case, I have acrually supplied the very few instructions it takes to eliminate this possibility, and these could be executed just once, not in the main loop (I did imply that they were in the main loop by including the or edx,0FFFFh to force acceptance of all matches after the first, but this could be dropped and a separate compare loop coded that had no extra code). This fix only needs to affect the first access, all others use aligned accesses which will not fault if you are checking for nulls.

You can have the best of both worlds, speed and safety.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 03:22:14 AM
Quote from: hutch-- on July 07, 2010, 02:29:39 AM
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.

Hutch,

I haven't yet tried this with the crt__ routines, but the code seems to be all WORD oriented for Unicode, so if a Unicode string is set to start on an odd BYTE bound, the crt__ routines should work. I have been looking at this with SSE in mind, including my initial check. With odd BYTE alignment to avoid buffer overrun , loading an xmm reg at an aligned boundary would put the characters split between words, not too good for pcmpeqw compares. Is such alignment allowable?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 07, 2010, 03:59:22 AM
Instead of guessing, load a unicode string on a 2 byte boundary and you never have the problem. 4 byte alignment is even better.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 04:10:10 AM
Hutch,

I agree, use aligned strings. But what should a generalized library function do if passed such a string?

I will test this and see if itwill work at all for CRT__ routines, and get back.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 07, 2010, 04:50:59 AM
I think that a generalized library should throw an exception on unaligned data, and yes that would included 16-bit unicode, which should of course be 2-byte aligned.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 07, 2010, 07:49:34 AM
Quote from: KeepingRealBusy on July 07, 2010, 12:57:11 AM
Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

Dave, your point is valid - see also this post for a concrete example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375). We did test the other method, i.e. anding the first address and masking out the leading bits, but fail to remember why we didn't continue that road. Maybe because the masking out costs some cycles? Does anybody have a better idea than a shr/shl pair?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
36      cycles for 10*shr/shl eax, cl
15      cycles for 10*shr/shl eax, 15
6       cycles for 10*and eax, nn

Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 07, 2010, 08:14:25 AM
If you are thinking of using SSE instructions on unicode data, make sure you use the required alignment as some SSE instructions require 16 BYTE alignment and will crash if the data is not 16 byte aligned. You check this on an instruction by instruction basis from the Intel manual.
Title: Re: Possible problems with SSE usage.
Post by: asmfan on July 07, 2010, 09:05:13 AM
Everything depends on nature of input data: strings/buffers/sizes. And what's known/unknown.
As an example is copying standard functions: memcpy, memmove. Some regions may/may not overlap.

In this case what if we have some aligned/unaligned buffer as input? in some cases we cannot "and eax, -16" and "movdqa/movaps" - we'll have to movdqu/movups. And what's length known/unknown? The firs zero byte is signaling - then only byte access allowed in this case. Boundaries before and after, crossing the 16 byte alignment and cache.
String start =15
size = 19
String end = 33
how to work this situation out? 2 boundaries crossed /16, 32/. How to load it? Using unaligned and most common way with tail cases processing or else... with common GPR processing for head and tail or else...?
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 09, 2010, 02:02:09 AM
Here is a procedure, WordAlign, that will align an unaligned (odd BYTE bound) Wide Character string.

I even tested with the wcs_ routine, at least CRT__wcscpy, CRT__wcscmp, CRT__wcschr, and they all worked correctly. I had examined the crt source code and it all appeared to be just Wide Character oriented. Where is there any restriction documented. It all seems to work on BYTE bounds.

This is not my final version of WordAlign, still some cleanup to remove stalls, but as coded here, the logic is more readable.

Enjoy,

Dave.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 10, 2010, 01:22:37 AM
Why not just use SEH to check for faults? idk how slow SEH is, but it'd get'r done  :U
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.

If your code is *expecting* an exception, there is something wrong.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 10, 2010, 03:55:21 PM
i have noticed - that's the way C programmers do it - lol
they catch everything with an exception handler
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 11, 2010, 11:57:53 PM
Is there any way to force PROC stack variables to be 16 byte aligned? The best I can do is:


    Local Var1OWORD
    Local Var2OWORD
    Local Var3OWORD

   movdqa OWORD PTR Var1,xmm0
   movdqa OWORD PTR Var2,xmm1
   movdqa OWORD PTR Var3,xmm2


This assembles and reserves space, but no attempt to align.

All I can see to do is:


    Local Var1[16]:DWORD    ; need 3 OWORDS or 12 BYTES

   lea edx,var1
   lea edx,[edx+15]
   and edx,-16
   movdqa OWORD PTR [edx], xmm0
   movdqa OWORD PTR [edx+16], xmm1
   movdqa OWORD PTR [edx+32], xmm2


Note: This takes 3 instructions just to set the address of the aligned variables, just to save doing an unaligned movdqu (and later the same for the restore).

Just one more question. I have been using the following command line option, "/Sg" which I found documented in Kip's book, 4th ed. This was for MASM 6.15. The option said "Turn on listing of assembly-generated code." MASM 9.0 and JWASM both accept the option. This is not documented in the current MSDN that came with MASM 9.0 Visual Studio 8.0), nor in the help file in MASM32 lib. Is this option just accepted and ignored, or is it just not documented? Does this option do anything?
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 02:55:55 AM
that's strange
i don't see /Sg in my list (6.14)
to generate assembler listings, i have been using /Fl  <--------- that's a lower case L
C:\ => ml /help
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997.  All rights reserved.

        ML [ /options ] filelist [ /link linkoptions ]

/AT Enable tiny model (.COM file)         /nologo Suppress copyright message
/Bl<linker> Use alternate linker          /Sa Maximize source listing
/c Assemble without linking               /Sc Generate timings in listing
/Cp Preserve case of user identifiers     /Sf Generate first pass listing
/Cu Map all identifiers to upper case     /Sl<width> Set line width
/Cx Preserve case in publics, externs     /Sn Suppress symbol-table listing
/coff generate COFF format object file    /Sp<length> Set page length
/D<name>[=text] Define text macro         /Ss<string> Set subtitle
/EP Output preprocessed listing to stdout /St<string> Set title
/F <hex> Set stack size (bytes)           /Sx List false conditionals
/Fe<file> Name executable                 /Ta<file> Assemble non-.ASM file
/Fl[file] Generate listing                /w Same as /W0 /WX
/Fm[file] Generate map                    /WX Treat warnings as errors
/Fo<file> Name object file                /W<number> Set warning level
/FPi Generate 80x87 emulator encoding     /X Ignore INCLUDE environment path
/Fr[file] Generate limited browser info   /Zd Add line number debug info
/FR[file] Generate full browser info      /Zf Make all symbols public
/G<c|d|z> Use Pascal, C, or Stdcall calls /Zi Add symbolic debug info
/H<number> Set max external name length   /Zm Enable MASM 5.10 compatibility
/I<name> Add include path                 /Zp[n] Set structure alignment
/link <linker options and libraries>      /Zs Perform syntax check only


wow - i am not using the version i thought i was   :lol
and i spent all that time patching it, too
now, i hafta go patch 6.15
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 12, 2010, 03:24:29 AM
Dave,

I think the difference is that /Fl produces a listing (if you have .list in the source), but if a macro is used, output of generated code could be suppressed, unless /Sg was used. I have not tried to use /Sg without .list (I just did - /Sg does not override a missing .list). I will check for macro expansion. Note: Still do not know whether /Sg is actually supported, or just tolerated.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 12, 2010, 03:29:26 AM
The reference in Kip's book said the information "came from the last printed documentation from the MASM 6.11 reference manual", ... "with updated from MASM 6.14 readme.txt file".

Dave
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 12, 2010, 03:31:33 AM
Dave,

Do you have any words of wisdom about stack alignment of OWORDS?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 12, 2010, 04:18:21 AM
It is listed in MASMREF.DOC that came with the Processor Pack for VC 6.0 that included ML.EXE 6.15.
Quote/Sg   Turns on listing of assembly-generated code.


Regarding alignment, how about using malloc_align().
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 04:28:05 AM
ok - patched 6.15 and put together a new ML615 package
it contains the document from VS6, as well as the ReadMe's and a few others

http://www.4shared.com/file/-QIUp-BF/ml615.html

be sure to read the ML_ver.txt file
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 04:38:00 AM
QuoteDo you have any words of wisdom about stack alignment of OWORDS?

it's nice to be wanted, but i am probably not the guy to ask - lol
i would think that Jochen, MichaelW, qWord and the other guys are far more qualified to help on this one
the probelm is - many of the guys that have experience writing alignment macros aren't using 64-bit machines   :P

but, i would think that a creatively designed macro could replace the INVOKE macro/functionality of ml32 in ml64
these guys are good at macros - i bet they could write one that would align the stack in and out
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 12, 2010, 04:59:15 AM
Well now I know where I stand.  :wink

I am not an expert at SSE, but what's wrong with ALIGN 16 to align the LOCALs?
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 05:08:46 AM
lol Greg - i meant nothing like that at all
i just happen to know those guys have played specifically with alignment macros   :P
and, i thought i covered my ass bases with...
Quote...and the other guys
Title: Re: Possible problems with SSE usage.
Post by: Queue on July 12, 2010, 05:30:26 AM
dedndave, I'm confused by what the patch to /help does; /? and /help both show the list of switches for vanilla ML.EXE 6.15. Is /help meant to do something else? And does it do something else post-patch, because trying your patched ML.EXE, /? and /help still both show the list of switches.

Queue
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 05:50:11 AM
be careful that you are executing the right copy of ML   :bg
try ML615 /?
if you have ML in the path, you are looking at whatever version you have in the bin folder (probably)
i tested version 6.15.8803 and it fails for "/?", but works for "/help"
i didn't really patch any code on that
all i did was change the displayed string from
usage: ML [ options ] filelist [ /link linkoptions]
Run "ML /help" or "ML /?" for more info

to
usage: ML [ options ] filelist [ /link linkoptions]
Run "ML /help" for more info

for some reason, the parser sees "/?" as "/r"

the original (unpatched) 6.15.8803 displays the following...
C:\=> ml /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000.  All rights reserved.

MASM : warning A4018: invalid command-line option : /R
MASM : fatal error A1017: missing source filename

"?" is a filenaming wildcard character, i guess - lol
i dunno
i looked at the code that parses it to see about fixing it
it was over-complicated, if you ask me - lol
so, i just changed the displayed string
that is a good way to verify you have the patched version, i suppose
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 12, 2010, 06:03:35 AM

C:\Utils>ml /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000.  All rights reserved.


        ML [ /options ] filelist [ /link linkoptions ]

/AT Enable tiny model (.COM file)         /omf generate OMF format object file
/Bl<linker> Use alternate linker          /Sa Maximize source listing
/c Assemble without linking               /Sc Generate timings in listing
/Cp Preserve case of user identifiers     /Sf Generate first pass listing
/Cu Map all identifiers to upper case     /Sl<width> Set line width
/Cx Preserve case in publics, externs     /Sn Suppress symbol-table listing
/coff generate COFF format object file    /Sp<length> Set page length
/D<name>[=text] Define text macro         /Ss<string> Set subtitle
/EP Output preprocessed listing to stdout /St<string> Set title
/F <hex> Set stack size (bytes)           /Sx List false conditionals
/Fe<file> Name executable                 /Ta<file> Assemble non-.ASM file
/Fl[file] Generate listing                /w Same as /W0 /WX
/Fm[file] Generate map                    /WX Treat warnings as errors
/Fo<file> Name object file                /W<number> Set warning level
/FPi Generate 80x87 emulator encoding     /X Ignore INCLUDE environment path
/Fr[file] Generate limited browser info   /Zd Add line number debug info
/FR[file] Generate full browser info      /Zf Make all symbols public
/G<c|d|z> Use Pascal, C, or Stdcall calls /Zi Add symbolic debug info
/H<number> Set max external name length   /Zm Enable MASM 5.10 compatibility
/I<name> Add include path                 /Zp[n] Set structure alignment
/link <linker options and libraries>      /Zs Perform syntax check only
/nologo Suppress copyright message

Title: Re: Possible problems with SSE usage.
Post by: Queue on July 12, 2010, 06:07:54 AM
I double-checked, and even with your patched ML615.EXE, /? shows the switches, so I don't think I'm even encountering the problem you are. I have a vanilla 6.15 and a patched 6.15; the difference between my patched 6.15 and yours (ignoring the changes to the string you modified and the version number you tweaked) is a single byte at F6B8; in mine it's 0E, and in yours it's 7F. I'm mainly curious as to what that single byte difference is.

Queue
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 06:11:12 AM
ok - i downloaded the original file from

http://win32assembly.online.fr/download.html

it is version 6.15.8803
is that what you have ?

the byte at offset F6B8 is 0E

i renamed ml.exe to mx.exe and ml.err to mx.err to avoid confusion with the pathed copy...
C:\=> mx /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000.  All rights reserved.

MASM : warning A4018: invalid command-line option : /R
MASM : fatal error A1017: missing source filename


modifying that byte to 7F has no affect on the "/?" output
Title: Re: Possible problems with SSE usage.
Post by: Queue on July 12, 2010, 06:11:59 AM
Yes, and I've done hex comparisons, byte-for-byte. I'm definitely working with the same version of ML as you.

Queue
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 12, 2010, 06:43:20 AM
Any alignment of the stack beyond the default 4 bytes will have to be done with some sort of code, a custom prologue for example, you can't just automatically align it.
If that was the case then we poor ml64 users would have it a lot easier...
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 12, 2010, 02:54:38 PM
Sinsi,

Thank you for the information. That is exactly what I did, but wondered what else could be done to eliminate the 3 instructions it takes to do this.

Dave.

Quote from: Greg Lyon on July 12, 2010, 04:59:15 AM
Well now I know where I stand.  :wink

I am not an expert at SSE, but what's wrong with ALIGN 16 to align the LOCALs?

Greg,

ALIGN 16 is an assembly time directive, it just adjusts the address at which the next data or instruction will be located, and in the case of instructions, it pads the code with dummy instructions that do not affect either the registers or the flags (instructions like lea ebx,[ebx]). What I needed was something to reserve space in the stack, and also something to selectively, at execution time, select the actual stack address to use for an aligned store (movdqa). It turns out that you must do this yourself since the stack can be aligned differently at each call.

I guess I will have to time this both ways, using code to align for a movdqa, or no code and use movdqu.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 12, 2010, 03:17:17 PM
Dave,

I see, I thought you wanted to align the OWORD data variables. Sorry, I misunderstood.


dedndave Dave,

I was kidding, hence the wink.  I am not as sharp as I used to be, some of these guys run circles around me.

Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 12, 2010, 03:20:07 PM
shoot, Greg - some of the newbies can make me look bad - lol
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 12, 2010, 03:52:53 PM
Well, here is my final version of WordAlign, actually KRBWordAlign (both functions are there, KRBWordAlign is tested) The KRBWordAlign version has the code shuffled around as much as possible to confuse the reader, WordAlign might be easier to understand. The attached zip contains modified tests that test all 32768 different lengths of a wide character test string, all stuffed at the end of a VirtualAlloc buffer with an odd BYTE start. Three error return conditions are tested.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 12, 2010, 10:18:50 PM
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.

If your code is *expecting* an exception, there is something wrong.

that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 12, 2010, 11:53:22 PM
Quote from: E^cube on July 12, 2010, 10:18:50 PM
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.

If your code is *expecting* an exception, there is something wrong.

that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.

You are not *expecting* an exception in normal code execution with those things. You are expecting exceptions in *abnormal* execution in most of them, and in the other are trying to intentionally create abnormal behavior.

Rare events are not part of normal flow control.

Exceptions dont help you debug anything when you are using them for flow control. They make it much much harder. You could have checked the size of the buffer, but intead you waited for an exception.. really? thats easier to debug?
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 03:06:01 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
Quote from: E^cube on July 12, 2010, 10:18:50 PM
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.

If your code is *expecting* an exception, there is something wrong.

that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.

You are not *expecting* an exception in normal code execution with those things. You are expecting exceptions in *abnormal* execution in most of them, and in the other are trying to intentionally create abnormal behavior.

Rare events are not part of normal flow control.

Exceptions dont help you debug anything when you are using them for flow control. They make it much much harder. You could have checked the size of the buffer, but intead you waited for an exception.. really? thats easier to debug?

actually again that's not true, in the case of nanomites it intentionally puts exception code in the place of say calls so that when the program comes across it, it throws an exception whichthe SEH handles, looks up the location in its database and runs the correct code to continue on.

In the terms of debuggers how do you think it breaks on certain parts of code? it's not magic...no, it sets a int 3 on the address, which throws an exception when ran, that the debuggers auto handles and allows you to pause the program flow and see what's in registers etc...

and you're missing the point of all this, ideally he should just check the size of the buffer, sure, but what if a buffer accidently isn't given, and instead a integar is? CRASH! this is unacceptable in vital programs running at system level such as services or programs that need to continue to run. His code could be incorporated in such a scenario is my point. Also it's not just about size, to give the example of wsprintf, if you have countless %s and very few string inputs, there's a crash right there which can run shellcode and all that non-sense, they did it with ollydbg.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 13, 2010, 06:38:45 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.

Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...

But I agree that, beyond ideology, you need a damn good justification to use SEH that way. By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 13, 2010, 07:16:40 AM
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
To be fair, the docs say (now, maybe not before?)
QuoteUsing this function incorrectly can compromise the security of your application. This function uses structured exception handling (SEH) to catch access violations and other errors. When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error. The caller is not safe to assume that insufficient space is the error condition.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 07:22:51 AM
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.

Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...

But I agree that, beyond ideology, you need a damn good justification to use SEH that way. By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?

i've already explained why, if your program is of any importance as far as it running or being safe then SEH is required, otherwise you just contribute to the countless "exploits" skiddies discover in badly written software, which leaves users more open to attackers and makes windows look worse. Any kind of server software, any kind of service, or similar.Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 13, 2010, 08:13:39 AM
Quote from: sinsi on July 13, 2010, 07:16:40 AM
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
To be fair, the docs say (now, maybe not before?)
QuoteUsing this function incorrectly can compromise the security of your application. This function uses structured exception handling (SEH) to catch access violations and other errors. When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error. The caller is not safe to assume that insufficient space is the error condition.

Interesting: So they know it. On the other hand, it supports the "let it crash" line. Do you have an error check after each lstrcpy or lstrcat? I have 226 occurrencies of lstrc*** in the RichMasm source, none has an error check. So their SEH is a safe recipe for an unchaseable bug that occurs in very rare circumstances - the chance is about 1:4096. Fortunately most if not all of these lstrc*** deal with short messages well below the size of a page, but imagine you use it for file handling on a huge number of files? And we are not talking about a hobby coder's algo here - lstrcpy is part of the OS, and it crashes silently because they decided to "handle" the exception ::)
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 08:22:45 AM
well instead of using lstrcpy you can rename to lstrcpyx for all instances and just have it be your own function with SEH that trys to call lstrcpy ;D 2 second fix. But yeah hobby coder or the code being used in a program that has no importance, SEH isn't needed. SEH is also nice in debugging because you log to a textfile the registers contents, the params passed when it crashed and ofcourse what function.
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 13, 2010, 08:37:44 AM
Well, it is a function, which returns a value, so some error-checking would be in order...
Does anyone check the return values of RegisterClassEx/CreateWindowEx? Only when your code doesn't show a window I'll bet, then you get rid of the check.

Return values are there for a reason, and how bad is it to branch after a "test eax,eax" anyway?
Then we can get one of those "unexpected error" or "internal error" message boxes  :bdg
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 13, 2010, 08:46:27 AM
well - i have just seen several C examples where the exception was the rule (slap me if that's a bad pun - lol)
i am kind of an old-school guy, i know
but, i have always tried to write code so that errors don't happen to begin with
in the case of peripherals or some other hardware, of course, you can't always do that
but, generally, i try to test for and force correction of an error before allowing it to occur

here is a simple example:
i want to use the DIV instruction
prior to using it, i insure the dividend is within range and the divisor is non-zero
in most cases, the logic of the code is such that these error conditions cannot occur
if they do occur, i allow the user to alter input parameters to fix the problem or whatever steps are appropriate
if divide-by-zero does occur in my program, it indicates that i have a bug in my code
i don't use the exception handler to catch the mistakes - lol

now - that isn't very modern, perhaps
but, when i see exception handlers used that way, the term that comes to mind is "lazy coder"
i suppose there are many modern cases where my perception is off-base   :P
Title: Re: Possible problems with SSE usage.
Post by: asmfan on July 13, 2010, 09:57:51 AM
Why you need to use SEH if nobody except you won't process raised exceptions?
YOur app is anyway SEHed my Windows before it starts. UnhandledExceptionFilter etc.
If it designed to crash it will crash anyway with self-made SEH or without and with system-made SEH.
Bad design and unhandled SEH is the reason of known result.
Windows is smart enough to handle properly such designed apps. So either you use SEH or you don't Windows anyway will terminate your/someone's/ buggy app.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 13, 2010, 10:31:39 AM
Quote from: E^cube on July 13, 2010, 07:22:51 AM
Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.

A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 10:50:03 AM
Quote from: MichaelW on July 13, 2010, 10:31:39 AM
Quote from: E^cube on July 13, 2010, 07:22:51 AM
Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.

A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.


If you're just writing hobby code that you're not going to use in any program of importance then you don't have to use SEH. When you write programs that many users use however, and your lack of SEH  puts the users system at risk, like the countless apps i've seen on the "exploits" lists, then that's a problem.

Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 13, 2010, 11:40:09 AM
Quote from: MichaelW on July 13, 2010, 10:31:39 AM
A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.

IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 12:32:23 PM
Microsoft already gets enough heat from skiddies exploiting some of their API functions, inturn exploiting users systems, that's in part why they created managed code and the safe API. I think lstrcpy not crashing is a good thing, all errors should be handled gracefully, because keep in mind the general users can barely check their email much less know about programming etc...they don't want to see a program crash...it scares them.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 13, 2010, 02:25:28 PM
Rockoon is right here, exception handling is for code that must deal with events that cannot be predicted at compile/assembly time, hardware, internet connections and the like, if something is not physically available then you must have a way to deal with the lack of response but outside of those circumstances you should write code that does not have faults in it for its target market. Better to write suicide code that explodes in you face with an error than to have undebuggable junk that hides the problem.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 13, 2010, 03:27:26 PM
Quote from: jj2007 on July 13, 2010, 11:40:09 AM
IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.

I agree. My point was that the few more clocks justification is not valid.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 13, 2010, 04:38:08 PM
Quote from: MichaelW on July 13, 2010, 03:27:26 PM
Quote from: jj2007 on July 13, 2010, 11:40:09 AM
IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.

I agree. My point was that the few more clocks justification is not valid.


The justifications have already been pointed out. but let me reiterate and also clarify something for you, the clock cycles you posted does NOT increase incrementally, that is, you can write a great deal of addition code in side of your session handler for example and it wouldn't be astronomical in cycles just because it's in a session handle, it's just the inital setup that takes the cycles. Also I recommend VEH over SEH as it's a lot more intelligent and I bet it's faster too.

As far as its use, i'm aware a lot of you are seasoned programmers, but this is definitely not the early 90's anymore, a lot has changed, including the programmers responsibility to write safe/reliable code, not just in terms of your program using it, but from outside influences as I mentioned earlier. Also SEH,VEH etc... set to log function/code crashes is a very nice/fast way to narrow down bugs in your program, much faster than debugging. Especially on x64 where they're aren't a lot of good debuggers out yet. And thinking in the future, how great would it be for a user to be able to run your program on windows 13 that you wrote for windows xp, and have a log generated of the apis/functions crashing so that you can easily write a fix :)
Title: Re: Possible problems with SSE usage.
Post by: oex on July 13, 2010, 05:01:39 PM
:lol By Windows 13 my computer will be fixing it's own damn bugs
Title: Re: Possible problems with SSE usage.
Post by: asmfan on July 13, 2010, 05:13:47 PM
Seh is only needed when dealing with smth. system-wide or global. That's shared among all processes - system settings, events as said, etc. which surely must be processed at app termination - cleanup time. The rest can be handled by checking return values GetLastError or by separate thread & synchronization (wait/until) - the weirdest nonblocking case.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 13, 2010, 05:59:06 PM
Quote from: E^cube on July 13, 2010, 04:38:08 PM
The justifications have already been pointed out. but let me reiterate and also clarify something for you, the clock cycles you posted does NOT increase incrementally, that is, you can write a great deal of addition code in side of your session handler for example and it wouldn't be astronomical in cycles just because it's in a session handle, it's just the inital setup that takes the cycles.

The cycles I posted were for handling the exception. My test consumed ~3000 cycles if there was no exception, or ~14000 cycles if there was an exception.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 13, 2010, 08:49:47 PM
Jez, guys! I'm sorry I raised such a firestorm. I was just trying to insure that I wouldn't walk off of a VIrtualAlloc buffer using SSE loads.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 06:54:14 AM
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.

Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...

If the "good" checks cost more than the exception, then its a very rare event. You do know how costly exceptions are, right? :) First it goes to the OS's exception handler, then possibly it gets offloaded to yours, and then maybe back to the OS again for the ones still unhandled.

As far as the vast majority of page faults listed in task manager, they are being handled by the OS's virtual memory subsystem. I believe only two scenarios encompass the entire count:

memory that was swapped out
memory mapped files

Neither of these is handled by your application, so actually falls under my other observation: not being handled by the local procedure

If there are other faults included in the count, I'm all ears. I do not believe that the faults that your program catches are included in the count, but meh..
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 07:30:30 AM

Ah, now here is the rub.

On the one hand we have code that is going to overshoot its buffer on purpose, and then from time to time it is going to just catch an exception if one is raised because it not only overshot the buffer, it also overshot the contiguous memory pages the buffer resides in.

On the other we have code that will divide by zero from time to time, where the programmer is going to just catch the exception if one is raised, and then execute default-value semantics, error out, or whatever.

One of these is not like the other. In the buffer case, not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too? Wow. Just wow. In the divide by zero case, its accidental or incidental, and not on purpose.

One area where I normally just want to swallow exceptions is fsqrt. Luckily the FPU lets us do just that.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 14, 2010, 01:23:41 PM
Quote from: Rockoon on July 14, 2010, 07:30:30 AM

Ah, now here is the rub.

On the one hand we have code that is going to overshoot its buffer on purpose, and then from time to time it is going to just catch an exception if one is raised because it not only overshot the buffer, it also overshot the contiguous memory pages the buffer resides in.

On the other we have code that will divide by zero from time to time, where the programmer is going to just catch the exception if one is raised, and then execute default-value semantics, error out, or whatever.

One of these is not like the other. In the buffer case, not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too? Wow. Just wow. In the divide by zero case, its accidental or incidental, and not on purpose.

One area where I normally just want to swallow exceptions is fsqrt. Luckily the FPU lets us do just that.

Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 05:26:22 PM
Quote from: KeepingRealBusy on July 14, 2010, 01:23:41 PM
Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?

Dave.

I'm saying that you normally shouldn't read 16 bytes from a buffer that ends in less than 16 bytes, and absolutely never read any bytes beyond the end of your allocation space without something having gone terribly wrong.

If you are intent on reading 16 bytes at a time then clearly performance is your concern. Since performance is your concern..

(A) align your strings so that they are 16-byte aligned.
(B) allocate space in 16-byte multiples so that you never overshoot your buffer.
(C) stop relying on NULL to terminate your strings. Store the length instead. You can still use a NULL to make them compatible with other routines.

Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 14, 2010, 05:32:55 PM
i think that "overshoot" happens quite often
a lot of functions are passed string/buffer pointers without buffer size values
they assume the null terminator to be valid, i guess   :bg

it probably also happens when functions try to dword align themselves inside a buffer
care isn't always taken to insure that accesses a few bytes above and below the buffer is avoided

when i wrote the ling long kai fang routines, i was extra careful to avoid this, and you have to specify both in and out sizes as parms
those routines dword-align themselves inside both buffers
i may have specified an aligned input buffer base - i don't remember at the moment
but, there is some code in there to avoid overshoot above the input value buffer
and some more code to avoid overshoot at the end of the output buffer
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 14, 2010, 06:03:00 PM
Quote from: Rockoon on July 14, 2010, 05:26:22 PM
Quote from: KeepingRealBusy on July 14, 2010, 01:23:41 PM
Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?

Dave.

I'm saying that you normally shouldn't read 16 bytes from a buffer that ends in less than 16 bytes, and absolutely never read any bytes beyond the end of your allocation space without something having gone terribly wrong.

If you are intent on reading 16 bytes at a time then clearly performance is your concern. Since performance is your concern..

(A) align your strings so that they are 16-byte aligned.
(B) allocate space in 16-byte multiples so that you never overshoot your buffer.
(C) stop relying on NULL to terminate your strings. Store the length instead. You can still use a NULL to make them compatible with other routines.

You missed the operative sentence in my first post:

Quote
Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 06:11:53 PM
Quote from: KeepingRealBusy on July 14, 2010, 06:03:00 PM
You missed the operative sentence in my first post:

Quote
Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

Dave.


No, I didnt. A general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.

You have made the decision to not be general purpose when you started reading 16-bytes at a time.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 14, 2010, 06:53:13 PM
Quote from: Rockoon on July 14, 2010, 06:11:53 PMA general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.

lstrcpy is a general purpose routine... and I fully agree, it should not swallow protection violations

QuoteYou have made the decision to not be general purpose when you started reading 16-bytes at a time.

Although there are still a few non-SSE2 machines around, it might be time to declare reading 16-bytes at a time "normal".
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 14, 2010, 07:32:00 PM
Quote from: Rockoon on July 14, 2010, 06:11:53 PM
Quote from: KeepingRealBusy on July 14, 2010, 06:03:00 PM
You missed the operative sentence in my first post:

Quote
Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

Dave.


No, I didnt. A general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.

You have made the decision to not be general purpose when you started reading 16-bytes at a time.

My routines:

    Don't swallow protection violations.
    Read 16 bytes at a time.
    Require valid null terminated strings.
    Handle both Wide character (Unicode) and normal character strings.

There is nothing that says that a 16 BYTE reading function cannot be a general purpose routine, they are not mutually exclusive.

What, exactly, do you not like about  my routines?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 08:39:15 PM
Quote from: jj2007 on July 14, 2010, 06:53:13 PM
Quote from: Rockoon on July 14, 2010, 06:11:53 PMA general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.

lstrcpy is a general purpose routine...

Not if it swallows protection violations.


Quote from: jj2007 on July 14, 2010, 06:53:13 PM
Although there are still a few non-SSE2 machines around, it might be time to declare reading 16-bytes at a time "normal".

It really isnt an issue of "support." This is about design. If you swallow the page faults, then you are special purpose.

My last post was in error, however, since clearly the routine could be constructed to only make aligned reads even for unaligned input (result: never cross a page boundary in error when valid data was supplied to it) and that would make it general purpose.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 14, 2010, 08:41:56 PM
Quote from: KeepingRealBusy on July 14, 2010, 07:32:00 PM
What, exactly, do you not like about  my routines?

I never said that I didn't like your routines. I never even looked at them prior to just now. I said that swallowing the page faults is not general purpose, which some posters seemed to consider a valid strategy (that a page fault wasnt an error, that the string could still have been terminated validly)
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 14, 2010, 09:00:43 PM
I stand corrected. I thought you were addressing your comments to my code and not to the other side discussion about SEH and faults.

What you see here of my code was only a little piece to handle unaligned (odd BYTE aligned) wide characters. From what I have learned, I will redo all my character routines, and implement wide character routines as well.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 15, 2010, 12:43:14 AM
So here is an attempt to start a "16-bit safe collection": good ol ' string len.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
44      cycles for StrLen1 (safe)
47      cycles for StrLen2 (safe)
25      cycles for MasmBasic (unsafe)
132     cycles for MasmLib

44      cycles for StrLen1
47      cycles for StrLen2
25      cycles for MasmBasic
132     cycles for MasmLib

Results:
100 bytes for StrLen1
100 bytes for StrLen2
100 bytes for MasmBasic

Code sizes:
75      for StrLen1
75      for StrLen2
87      for MasmBasic
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 15, 2010, 01:57:56 AM
JJ,

I have looked at the code, and have one question. You pop the first 2 stack parameters into eax, leaving the return address on the stack, but it is unprotected by the esp value. I am not familiar with what happens during interrupts, but I do not think this is safe. If an interrupt comes in, where can the CPU save the current ip or any regs that need to be used?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 15, 2010, 02:02:37 AM
JJ,

A second problem. In the iteration loop you get two xmm regs (from [eax] and [eax+16) without checking the first for nulls. This is not safe for a the end of a VirtualAlloc buffer.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 15, 2010, 06:33:42 AM
Dave,
Thanks for looking at that.

Re lingo's pop the ret address technicque: Interrupt seem not to be a problem, although it is apparently nowhere documented.

Re point 2: You are perfectly right, there is a risk at the end of a VirtualAlloc buffer. Any suggestions? The routine is already a bit slow ::)
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 15, 2010, 07:08:52 AM
try this



invoke AddVectoredExceptionHandler,1,handlexcept

;do everything...


handlexcept proc pExceptionInfo
mov edi, pExceptionInfo
mov eax, [edi].EXCEPTION_POINTERS.pExceptionRecord
mov edx, [edi].EXCEPTION_POINTERS.ContextRecord
cmp [eax].EXCEPTION_RECORD.ExceptionCode,STATUS_BREAKPOINT
jne @F
;cmp [eax].EXCEPTION_RECORD.ExceptionAddress,; is it our code address
;jne @F ;if not let others have a go

add [edx].CONTEXT.regEip,1
mov eax,EXCEPTION_CONTINUE_EXECUTION
ret
@@:
mov eax,EXCEPTION_CONTINUE_SEARCH ;let others have a go
ret
handlexcept endp


VEH is xp+ only but it's a beautiful thing, it gets exceptions before SEH and others do, and you can add as many different handlers as you like, but only 1 is really needed. When you do the EXCEPTION_CONTINUE_SEARCH it passes it on to the other handlers then onto SEH, etc... down the list.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 15, 2010, 01:50:59 PM
just a thought, here - it may or may not offer a speed advantage
copy the buffer contents into a "safe" buffer that is known to have adequate tail-end space for the over-shoot
i know it takes time to copy, but at least you wouldn't have to test inside the loop
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 15, 2010, 09:59:01 PM
Quote from: jj2007 on July 15, 2010, 06:33:42 AM
Dave,
Thanks for looking at that.

Re lingo's pop the ret address technicque: Interrupt seem not to be a problem, although it is apparently nowhere documented.

Re point 2: You are perfectly right, there is a risk at the end of a VirtualAlloc buffer. Any suggestions? The routine is already a bit slow ::)

As far as the pop two args from the stack, I remember Lingo doing:


    pop ecx    ;    Get return.
    pop eax    ;    Get first.
    pop ebx    ;    Get second.
    push eax  ;     Save relocated  return.


You now have a protected return and two unprotected args but in eax and ebx. This will work, but don't count on the unprotected args on the stack. Another trick was


    mov    eax,[esp+4]    ;    Get arg
    mov    [esp+4],esi     ;    Save esi over the arg.


This will also work.

If you can wait for just a bit, I am working on an entire set of string routines that are safe and (mostly) SSE. Right now I have towlower towupper wcslwr wcsupr wcslwr_s wcsupr_s, and am working on wcscpy (a modification of WordAlign from my zip here), then wcslen then wcschr then wcscmp, then wcsstr, then the wcsn.... Then I'll work on the normal string versions. These are all for my own use, but I'll publish in a source zip for others to blatantly steal (right, Lingo, isn't that what they do to yours?).

I have a question about what to do with error returns such as the crt__ functions return. I was thinking about returning error codes in edx and the normal return in eax. The end of the functions would end with an "or edx,edx" so that the caller could just "jz Good" or "jnz Bad". Since these are not CDECL, the flags would not be destroyed by INVOKE's add esp,n.

I have even more questions about some of the crt_ comments in \crt\src like "the return string can be shorter or longer than the input string". Maybe for MBCS, but for Unicode?

The following are some of my times:

AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
558     cycles for wInstr (MasmBasic)
16444   cycles for StrStrIW
37      cycles for crt_towlower
10      cycles for KRBtowlower
1002    cycles for crt__wcslwr
723     cycles for KRBwcslwr
383     cycles for KRBwcslwr2
32      cycles for crt_towupper
10      cycles for KRBtowupper
576     cycles for crt__wcsupr
829     cycles for KRBwcsupr
411     cycles for KRBwcsupr2
--- done ---

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 15, 2010, 11:06:49 PM
Quote from: KeepingRealBusy on July 15, 2010, 09:59:01 PM
As far as the pop two args from the stack, I remember Lingo doing:


    pop ecx    ;    Get return.
    pop eax    ;    Get first.
    pop ebx    ;    Get second.
    push eax  ;     Save relocated  return.


Dave,
You probably meant push ecx, not eax.
However, is it really needed?

Q. Why does MS-DOS switch stacks for hardware interrupts? (http://support.microsoft.com/kb/82774)
QuoteAPPLIES TO Microsoft Windows 3.1 Standard Edition

http://en.wikipedia.org/wiki/Task_State_Segment#Inner_Level_Stack_Pointers
QuoteThe TSS contains 6 fields for specifying the new stack pointer when a privilege level change happens. The field SS0 contains the stack segment selector for CPL=0, and the field ESP0/RSP0 contains the new ESP/RSP value for CPL=0. When an interrupt happens in protected (32-bit) mode, the x86 CPU will look in the TSS for SS0 and ESP0 and load their values into SS and ESP respectively. This allows for the kernel to use a different stack than the user program, and also have this stack be unique for each user program.

http://stackoverflow.com/questions/866672/switching-stacks-in-c
QuoteOn 16-bit DOS, an interrupt could occur and this interrupt would be initially running on the same stack. If you got interrupted in the middle of the operation, the interrupt could crash because you only updated ss and not sp.

On Windows, and any other modern environment, each user mode thread gets its own stack. If your thread is interrupted for whatever reason, it's stack and context are safely preserved
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 15, 2010, 11:38:01 PM
JJ,

But how do you get into ring 0 from you program, how does the system know how to get back to you? The CPU needs to save your return information somewhere, and that somewhere is your current stack, THEN it can swap the stacks and insure that the interrupt stack is enough for the processing.

Anyone else, am I wrong here?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 16, 2010, 12:54:28 AM
Yes.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 16, 2010, 01:48:08 AM
Quote from: E^cube on July 16, 2010, 12:54:28 AM
Yes.
In what way?, I mean, how does the hardware change the stack on the fly without destroying any registers?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 16, 2010, 01:52:12 AM
it doesn't, I just really felt like saying yes :) I apologize
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 16, 2010, 01:55:59 AM
Apology accepted, but not necessary.

Any experts around that understand and can explain a privilege level switch.
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 16, 2010, 02:23:55 AM
Intel manuals, especially volume 3a chapter 6.3 "task switching"
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 16, 2010, 04:05:51 AM
Quote from: sinsi on July 16, 2010, 02:23:55 AM
Intel manuals, especially volume 3a chapter 6.3 "task switching"


sinsi,

Thank you. I knew that someday I would have to go through all of this. About 40 pages of documentation and diagrams later (AMD PDF's), I can safely say that anything we are doing here will not be affected by a task switch. The first thing that happens is that stack pointer is saved in the TSS (system) and loaded with an appropriate new stack pointer, then the flags and eip are pushed on to the NEW frame, then all regs are saved in the TSS. An opposite set of actions cause the task to be restarted.

Only something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.

So, JJ, your code is safe, and yes, I meant "push ecx", and I would use this instead of leaving the return address unprotected. With some of the MASM32 macros, I would not trust that some invocation wouldn't push a register for a calculation or a call and wipe out a unprotected return address ("print" comes to mind).

Dave.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 16, 2010, 04:42:09 AM
Dave...
this topic has been beat to death a few times
it seems the members are split (50-50 ?) on this issue
some say it is ok to use space under [ESP] - some say it is not
the best we seem to do is - we agree to disagree  :P

out of old-school habit, i avoid using stack space under the stack pointer
those who argue it is ok say that windows protects that space, as interrupts, other threads, etc, are never allowed to access it
you'll have to decide for yourself   :bg
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 16, 2010, 05:10:56 AM
I've used parameters as storage before with no problems

myproc:
  xchg ebx,[esp+4]
  xchg esi,[esp+8]
  ...
  pop ecx
  pop ebx
  pop esi
  jmp ecx

I figure that if you reserve space (sub esp,xxx) it's yours but pushing params ([esp+x]) means they are fair game.
In the same way, anything below esp ([esp-x]) is undefined and likely to get zapped at some stage (is that what you mean by 'under [esp]' dedndave?), especially using a proc with a stack frame or simply forgetting what you did 50 lines ago  :bdg

It's all personal, that's why we have the freedom of asm and not the constraints of a hll.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 16, 2010, 06:14:58 AM
Quote from: KeepingRealBusy on July 16, 2010, 04:05:51 AMOnly something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.

So, JJ, your code is safe, and yes, I meant "push ecx", and I would use this instead of leaving the return address unprotected. With some of the MASM32 macros, I would not trust that some invocation wouldn't push a register for a calculation or a call and wipe out a unprotected return address ("print" comes to mind).

Dave.

Dave, thanks for reading this up in the "official" manuals. My Wiki quote on TSS said something similar, but Intel is a more reliable source.
So it boils down to "yes, you can do it but make sure you know what you are doing in that proc". And, for example, print obviously pushes parameters.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 16, 2010, 06:29:57 AM
QuoteOnly something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.

hang on - is that a quote from the intel manual ?
and - if so - the OS could possibly alter that, no ?
Title: Re: Possible problems with SSE usage.
Post by: clive on July 16, 2010, 09:17:40 AM
Quote from: sinsi
I've used parameters as storage before with no problems

myproc:
  xchg ebx,[esp+4]
  xchg esi,[esp+8]


I would worry about the speed of XCHG on memory, it is atomic and exposes the speed of the underlying memory (DRAM)

On my 3 GHz Prescott the "xchg ebx,[ebp+8]" takes ~100 machine cycles, or 33 ns, the memory access speed is ~17ns
Title: Re: Possible problems with SSE usage.
Post by: sinsi on July 16, 2010, 09:23:29 AM
I know xchg is slow but how does a push/mov compare?
I also used esp, not ebp, would that make a difference?

It's all voodoo anyway eh?
Title: Re: Possible problems with SSE usage.
Post by: clive on July 16, 2010, 09:57:34 AM
Quote from: sinsi
I know xchg is slow but how does a push/mov compare?

They would go via the write buffer, and cache. PUSH/POP pairs, figure 6 cycles. MOV EAX,[EBP+x]; XCHG EAX,EBX; MOV [EBP+x],EAX; also around 6 cycles (P4 Prescott) in some synthetic testing.

Quote
I also used esp, not ebp, would that make a difference?

No, XCHG reg,mem is intrinsically locked, ESP or EBP, etc all perform the same.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 16, 2010, 12:05:31 PM
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:

182 cycles, (xchg reg,reg)*100
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100
183 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
Title: Re: Possible problems with SSE usage.
Post by: clive on July 16, 2010, 02:28:01 PM
Quote from: MichaelW
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100

How fast is the P3 running?

I'll note that the encoding for both is XCHG mem,reg

00000000  87 45 08         xchg    eax,[ebp+8]
00000003  87 45 08         xchg    [ebp+8],eax



00000000 874508                 xchg    [ebp+8],eax
00000003 874508                 xchg    [ebp+8],eax

Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 16, 2010, 02:57:40 PM
QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

QuoteI'll note that the encoding for both is XCHG mem,reg

I did it both ways to see if there would be any significant difference in the cycle counts. On my P3 there wasn't, the difference in the results is within the run-to-run variation that is typical for cycle counts in the thousands.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 16, 2010, 03:11:40 PM
I have not bothered to benchmark the following test piece but from memory within an algorithm XCHG was usually slow and could be replaced by MOV with a faster result. The 3 tests are mem-mem, reg-mem, reg-reg with the 1st being the slowest and the last being fastest. I have mainly seen this operation in exchange sorts (pointers or values) and usually XCHG is off the pace.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL var1  :DWORD
    LOCAL var2  :DWORD

    push esi
    push edi

  ; ---------
  ; mem - mem
  ; ---------
    mov var1, 1234
    mov var2, 5678

    mov eax, var1
    mov ecx, var2
    mov var1, ecx
    mov var2, eax

    print str$(var1),13,10
    print str$(var2),13,10

  ; ---------
  ; reg - mem
  ; ---------
    mov esi, 1234
    mov var1, 5678

    mov eax, var1
    mov var1, esi
    mov esi, eax

    print str$(esi),13,10
    print str$(var1),13,10

  ; ---------
  ; reg - reg
  ; ---------
    mov esi, 1234
    mov edi, 5678

    mov edx, esi
    mov esi, edi
    mov edi, edx

    print str$(esi),13,10
    print str$(edi),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 16, 2010, 03:22:58 PM
Quote from: dedndave on July 16, 2010, 04:42:09 AM
out of old-school habit, i avoid using stack space under the stack pointer

Are we sure that no debuggers trash the area under the stack?

I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 16, 2010, 03:26:36 PM
Quote from: MichaelW on July 16, 2010, 12:05:31 PM
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:

Prescott P4:
146 cycles, (xchg reg,reg)*100
9247 cycles, (xchg reg,mem)*100
9277 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
306 cycles, (exchange reg,mem)*100 using mov
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]


The latter are intermediate cases using the stack:
        push edx
        mov edx, [ebx]
        pop [ebx]
...
        push [ebx]
        mov [ebx], edx
        pop edx


Slower than exchange reg,mem using mov but a lot faster than xchg.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 16, 2010, 03:38:26 PM
Quote from: Rockoon on July 16, 2010, 03:22:58 PM
Are we sure that no debuggers trash the area under the stack?

I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.

In the 16-bit RM days hardware interrupts would use whatever stack was active when the interrupt occurred.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 16, 2010, 03:49:28 PM

IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    call main
    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL var1  :DWORD
    LOCAL var2  :DWORD

    push esi
    push edi

    invoke Sleep, 4000

  ; ---------
  ; mem - mem
  ; ---------
    mov var1, 1234
    mov var2, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov eax, var1
    mov ecx, var2
    mov var1, ecx
    mov var2, eax
    ENDM

    counter_end
    print str$(eax)," cycles, mem - mem",13,10

    ;print str$(var1),13,10
    ;print str$(var2),13,10

  ; ---------
  ; reg - mem
  ; ---------
    mov esi, 1234
    mov var1, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov eax, var1
    mov var1, esi
    mov esi, eax
    ENDM

    counter_end
    print str$(eax)," cycles, reg - mem",13,10

    ;print str$(esi),13,10
    ;print str$(var1),13,10

  ; ---------
  ; reg - reg
  ; ---------
    mov esi, 1234
    mov edi, 5678

    counter_begin 1000, HIGH_PRIORITY_CLASS

    REPEAT 8
    mov edx, esi
    mov esi, edi
    mov edi, edx
    ENDM

    counter_end
    print str$(eax)," cycles, reg - reg",13,10

    ;print str$(esi),13,10
    ;print str$(edi),13,10

    pop edi
    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


Running on a P3:

35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg

Title: Re: Possible problems with SSE usage.
Post by: Queue on July 16, 2010, 06:45:02 PM
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on an old Athlon 1.3 GHz:

147 cycles, (xchg reg,reg)*100
1630 cycles, (xchg reg,mem)*100
1631 cycles, (xchg mem,reg)*100
148 cycles, (exchange reg,reg)*100 using mov
270 cycles, (exchange reg,mem)*100 using mov
406 cycles, (exchange reg,mem)*100 using pop [ebx]
406 cycles, (exchange reg,mem)*100 using push [ebx]


Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a Core2Duo 2.8 Ghz:

219 cycles, (xchg reg,reg)*100
1842 cycles, (xchg reg,mem)*100
1835 cycles, (xchg mem,reg)*100
184 cycles, (exchange reg,reg)*100 using mov
299 cycles, (exchange reg,mem)*100 using mov
507 cycles, (exchange reg,mem)*100 using pop [ebx]
507 cycles, (exchange reg,mem)*100 using push [ebx]


Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a P4 2.8 Ghz:

146 cycles, (xchg reg,reg)*100
9271 cycles, (xchg reg,mem)*100
9158 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
312 cycles, (exchange reg,mem)*100 using mov
1005 cycles, (exchange reg,mem)*100 using pop [ebx]
497 cycles, (exchange reg,mem)*100 using push [ebx]


Why would xchg mem,reg be so extra costly on a P4?

Queue
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 16, 2010, 07:32:09 PM
that has always been that way - even on the 8088
i am a little surprised to see the xchg reg,reg comparison, though
for a long time, i have used XCHG EAX,reg32 (AX,reg16 in DOS) because it is a single byte op-code
still - it doesn't compare too badly against MOV

i see the test uses XCHG EDX,ECX - a 2-byte instruction
Title: Re: Possible problems with SSE usage.
Post by: clive on July 16, 2010, 07:44:28 PM
Quote from: MichaelW
QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz

Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?

As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.

In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.

It's not so much a cycles issue, than a time issue.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 16, 2010, 07:51:14 PM
Which P4 core is that? A Northwood?

Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 16, 2010, 08:03:22 PM
Celeron M timings:
165 cycles, (xchg reg,reg)*100
1910 cycles, (xchg reg,mem)*100
1910 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
495 cycles, (exchange reg,mem)*100 using pop [ebx]
495 cycles, (exchange reg,mem)*100 using push [ebx]


Note the symmetry of the last two, in contrast to the Prescott P4:
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 16, 2010, 10:21:57 PM
Quote from: clive on July 16, 2010, 07:44:28 PM
Trying to quantify the memory speed.

PC133 SDRAM and IIRC I set it up to use the fastest supported timings.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 16, 2010, 11:08:52 PM
dedndave,

Quote
hang on - is that a quote from the intel manual ?
and - if so - the OS could possibly alter that, no ?

No ,this is not a quote, but an observation from the documentation. If you are at a user task level, anything you do to get back to the OS must cause a task switch, and this automatically saves your stack pointer (and selector)  in the TSS, then loads the stack pointer and selector with appropriate values depending on the reason for the switch (fault, interrupt, call), then starts saving your IP on the NEW (system) stack. Anything that happens to YOUR stack must happen at user task level, i.e., push, pop, mov and call (to your local procedure). With multiple threads, I believe, each thread has its own stack.

Could the OS possibly alter that? The OS is capable of putting anything anywhere in memory once it gets control. With single core, a task switch must happen for the OS to get control, but multi-core means the other core may be the OS. Yes it could change something while you are running. Are we talking virus conditions here? If so, anything could happen, otherwise, I doubt it will. To quote "Pogo" "We has met the enemy, and he is us."

Watch where you step, it gets pretty deep in some places.

Dave.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 16, 2010, 11:14:40 PM
Quote from: clive on July 16, 2010, 07:44:28 PM
Quote from: MichaelW
QuoteHow fast is the P3 running?

If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.

Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz

Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?

As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.

In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.

It's not so much a cycles issue, than a time issue.

Thank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 17, 2010, 12:50:41 AM
as i mentioned, i am from the old-school side of the fence on this issue
so far, i have seen no harm come from using the stack that way
but, it seems to me that leaving the barn door open doesn't mean the horses are going to leave   :P
i feel more comfortable by adjusting the stack pointer
and, let's face it - it doesn't cost that much in terms of code size or clock cylces
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 01:00:13 AM
I totally agree.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: FORTRANS on July 17, 2010, 01:57:34 PM
Hi,

QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

   Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions.  An old reference mentions
the following.


BT, BTC, BTR, BTS   mem, reg/imm
XCHG   reg, mem
ADD, ADC, AND, OR, SBB, SUB, XOR   mem, reg/imm
DEC, INC, NEG, NOT   mem


Regards,

Steve N.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 04:08:59 PM
Quote from: FORTRANS on July 17, 2010, 01:57:34 PM
Hi,

QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?

   Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions.  An old reference mentions
the following.


BT, BTC, BTR, BTS   mem, reg/imm
XCHG   reg, mem
ADD, ADC, AND, OR, SBB, SUB, XOR   mem, reg/imm
DEC, INC, NEG, NOT   mem


Regards,

Steve N.

Thank you, thank you, thank you. I'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 17, 2010, 04:27:27 PM
Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 04:37:57 PM
Quote from: jj2007 on July 17, 2010, 04:27:27 PM
Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"

You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink

JJ,

I think the restriction is mem, reg/imm, reg,reg should be ok. Another thing I have to read up in the specs.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 17, 2010, 04:49:50 PM
Here is a snippet:
     counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (inc mem)*100",13,10


... and various results:
170 cycles, (xchg reg,reg)*100
1909 cycles, (xchg reg,mem)*100
1909 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
307 cycles, (exchange reg,mem)*100 using mov
494 cycles, (exchange reg,mem)*100 using pop [ebx]
499 cycles, (exchange reg,mem)*100 using push [ebx]
594 cycles, (and mem)*100
594 cycles, (or mem)*100
594 cycles, (inc mem)*100
594 cycles, (inc dec mem)*100
594 cycles, (inc mem)*100


xchg seems to be the worst case.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 17, 2010, 04:50:02 PM
I think the ability of an instruction to have a lock prefix is not the problem, it's the presence of the prefix, or for

XCHG mem, reg

Per the Intel manual:

"If a memory operand is referenced, the processor's locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL."

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 05:23:09 PM
Quote from: jj2007 on July 17, 2010, 04:49:50 PM
Here is a snippet:
     counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (inc mem)*100",13,10


... and various results:
170 cycles, (xchg reg,reg)*100
1909 cycles, (xchg reg,mem)*100
1909 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
307 cycles, (exchange reg,mem)*100 using mov
494 cycles, (exchange reg,mem)*100 using pop [ebx]
499 cycles, (exchange reg,mem)*100 using push [ebx]
594 cycles, (and mem)*100
594 cycles, (or mem)*100
594 cycles, (inc mem)*100
594 cycles, (inc dec mem)*100
594 cycles, (inc mem)*100


xchg seems to be the worst case.

JJ, could you post the .zip, I'll try on my AMD. Dave
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 17, 2010, 05:42:57 PM

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================
    .data
        mem dd 0
    .code
;==============================================================================
start:
;==============================================================================
    invoke Sleep, 3000

    REPEAT 3

    counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (inc mem)*100",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      lea ebx, mem
      REPEAT 100
        lock inc dword ptr [ebx]
      ENDM
    counter_end
    print ustr$(eax)," cycles, (lock inc mem)*100",13,10

    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100
638 cycles, (inc mem)*100
2246 cycles, (lock inc mem)*100
627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100

Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 17, 2010, 06:10:16 PM
Quote from: KeepingRealBusy on July 17, 2010, 05:23:09 PM
JJ, could you post the .zip, I'll try on my AMD. Dave
Here it is.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 08:54:12 PM
JJ,

Here are my timings. I added my cpuid for identification - (why else would I add it?):

AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
144 cycles, (xchg reg,reg)*100
1853 cycles, (xchg reg,mem)*100
1819 cycles, (xchg mem,reg)*100
149 cycles, (exchange reg,reg)*100 using mov
506 cycles, (exchange reg,mem)*100 using mov
553 cycles, (exchange reg,mem)*100 using pop [ebx]
552 cycles, (exchange reg,mem)*100 using push [ebx]
793 cycles, (and mem)*100
777 cycles, (or mem)*100
819 cycles, (inc mem)*100
792 cycles, (inc mem)*100 using eax
808 cycles, (inc dec mem)*100
687 cycles, (inc mem)*100
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 10:17:32 PM
I tried to add a call to crt__wcslwr_s to my Timings. First, it would not assemble. I then added the code to msvcrt.inc (copy of crt__wcslwr with _s added). This fails to link. The source in VS at \vc\crt\src shows both of these functions defined in the same source module, both actually call a common subfunction, crt__wcslwr with a -1 for the size parameter and crt__wcslwr_s with the supplied size.

Why does crt__wcslwr_s not exist?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 17, 2010, 11:09:15 PM
if you look inside masm32\include\msvcrt.inc...
    externdef _imp___wcslwr:PTR c_msvcrt
    crt__wcslwr equ <_imp___wcslwr>

try adding this to the beginning of your program
    externdef _imp___wcslwr_s:PTR c_msvcrt
    crt__wcslwr_s equ <_imp___wcslwr_s>
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 11:21:48 PM
I did this, and got around the assembly error, but right into the linker error.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 17, 2010, 11:26:14 PM
It must be present in \masm32\lib\msvcrt.lib, too (and it's not there).
Add it to \masm32\tools\makecimp\msvcrt.txt...
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 17, 2010, 11:29:02 PM
JJ,

Thank you. I knew that someone here would have the answer.

Dave
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 17, 2010, 11:29:36 PM
oh i see - he's just making up names   :lol
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 17, 2010, 11:30:51 PM
It seems Hutch uses \masm32\tools\makecimp\makecimp.exe to modify the crt library.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 17, 2010, 11:55:44 PM
I think this is all correct:

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================
    .data
      hmsvcr80   dd 0
      ws         dw "A","B","C",0
    .code
;==============================================================================
start:
;==============================================================================

    invoke LoadLibrary, chr$("msvcr80.dll")
    mov hmsvcr80, eax
    print hex$(eax),13,10

    invoke GetProcAddress, hmsvcr80, chr$("_wcslwr_s")
    mov esi, eax
    print hex$(eax),13,10

    invoke crt_printf,cfm$("%S\n"), ADDR ws
    push SIZEOF ws
    push OFFSET ws
    call esi
    add esp, 8
    invoke crt_printf,cfm$("%S\n"), ADDR ws

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
        invoke crt__wcslwr, ADDR ws
    counter_end
    print str$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
        push SIZEOF ws
        push OFFSET ws
        call esi
        add esp, 8
    counter_end
    print str$(eax)," cycles",13,10

    invoke FreeLibrary, hmsvcr80
    print str$(eax),13,10,13,10

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


78130000
7817FCAB
ABC
abc
101 cycles
181 cycles
1


It should be no big deal to create an import library for msvcr80.dll.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 18, 2010, 01:00:38 AM
I don't know whether I blew it or not. I had tried to rebuild the library using "make" in \m32lib. Two modules had 2 errors, fptoa.asm and fptoa2.asm. I fixed the first error in each module from


    fbstp    [esp]


to


    fbstp    TBYTE PTR [esp]


That cleared up the first error in each, but the second error remains. A2006 undefined symbol PowerOf10. There is a proto for PowerOf10 in these modules, but the error remains.

Any good thoughts? Bad thoughts?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 18, 2010, 01:06:08 AM
OBTW, the make file has an error check for the assembly step, but apparently MASM does not set the error code for assembly errors using a response file.  I think I remember reporting this to MS long ago, but will have to dig to see what their response was.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 18, 2010, 01:26:19 AM
JJ,

> It seems Hutch uses \masm32\tools\makecimp\makecimp.exe to modify the crt library.

No, the tool just makes an import library that avoids naming conflicts with masm reserve words. the content is determined by MSVCRT.DLL.

Now with the function that KeepingRealBusy  wants, the idea is to test if its there first and you do this by the normal LoadLibrary(), GetProcAddress() and see if you can call it that way. If so then you simply add the name to the import list and build the import library.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 18, 2010, 02:03:19 AM
Well, I  modifyied the crt library with makecimp.exe. No assembly or link problems now. When I try to execute the .exe, I get "The procedure entry point _wcslwr_s could not be located in the dynamic link library msvcrt.dll.

Now what?

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 18, 2010, 06:14:09 AM
Dave,
MichaelW found the reason: This is msvcr80.dll...
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 18, 2010, 06:31:11 AM
Check to see if the function you are after is in the standard MSVCRT.DLL
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 18, 2010, 12:02:33 PM
The attachment includes an import library and include file for msvcr80.dll and a small test app. Starting with the full module definition file (see msvcr80_full.def), I removed a number of exports that did not appear to me to be usable/useful. That left 1349 exports, versus 730 for msvcrt.lib. Note that I modified the generated include file substituting "cr8_" for "crt_" to avoid conflicts.

There is a msvcr80.dll version 8.0.50727.1433 on my Windows 2000 system. The file is dated Wednesday, October 24, 2007 so I have no idea how it got there, or if it will be present on other systems.
Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 18, 2010, 02:34:26 PM
Hutch, Michael,

I have the zip, will try it.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 18, 2010, 02:37:47 PM
After successful build, test.exe gives me a Visual C++ Runtime Library error:
R6034
An app has made an attempt to load the C runtime library incorrectly.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 18, 2010, 03:06:50 PM
There is an R6034 and that same error message in the DLL. The app runs no problem under Windows 2000. How did you build the EXE, and what happens for my previous example where I used run-time dynamic linking?

I think the solution is  here (http://msdn.microsoft.com/en-us/library/ms235560.aspx), but I have no way to test it short of installing Windows XP on a spare system.

Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 18, 2010, 03:15:03 PM
Both versions show the same error. The msvcr80.dll sits in the same folder and is version 8.0.50727.42 of 23 sept 2005
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 18, 2010, 04:03:30 PM
msvcr80.dll is the C run-time library for Visual C++ 2005. It is a "side-by-side" DLL. Your executable should have a manifest to use it.

If you ask me, Microsoft really screwed things up when they went this route. It's the same situation in VC 2008 and VC 2010.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 18, 2010, 05:32:37 PM
Including a manifest eliminated the R6034 message, but now I get:

"The application failed to initialize properly (0xc0000142). Click..."

0xc0000142 is STATUS_DLL_INIT_FAILED

Any ideas?
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 18, 2010, 06:06:06 PM
You need to run mainCRTStartup?    ...    I don't know.

I ran your program (http://www.masm32.com/board/index.php?topic=14353.msg115460#msg115460) MichaelW, and I was surprised when it ran with no errors for me. I either use MSVCRT (usually) or the MSVCRxxx DLLs, I have never tried mixing them, I figured it would just be too problematic.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 18, 2010, 09:00:44 PM
I know this is an interesting process, and I am interested in its conclusion. At least I (we) will know how to do this in the future.

BUT, I was just trying to drain the swamp! I do not really need crt__wcslwr_s. I was just trying to time it for comparison to my functions.

Dave.
Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 19, 2010, 01:06:44 AM
I think the problem may be with the manifest that I am using, but since I don't have VS I have no good idea what the manifest should contain for this specific purpose, and I just don't have enough interest to delve very far into Microsoft's ridiculous manifest thing.
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 19, 2010, 02:46:40 AM
The manifest needs to contain the following:

<dependency>
    <dependentAssembly>
      <assemblyIdentity type='win32' name='Microsoft.VC80.CRT' version='8.0.50608.0' processorArchitecture='x86' publicKeyToken='1fc8b3b9a1e18e3b' />
    </dependentAssembly>
</dependency>


The version must match your DLL version exactly and the publicKeyToken changes with the version.

I agree, it is ridiculous.


Title: Re: Possible problems with SSE usage.
Post by: MichaelW on July 19, 2010, 04:06:01 AM
So basically it looks like, for Windows XP and later, to use the DLL you must have VS.
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 19, 2010, 11:19:20 PM
MichaelW,

Not necessarily, you just need the correct entry in the manifest.

[Edit] And having VC++ makes things a lot easier.  :wink
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 21, 2010, 02:57:42 PM
useless dll, I dread even having to link to msvcrt.dll occassionally to cater to C code. We just need equivilant ASM functions for all the C ones to save the hassel. unfortunately while MASM32 and other ASM sdk's are great they still lack a lot of formatting functions.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 22, 2010, 01:22:05 AM
Cube,

The trick is to use MSVCRT, not the later side by side DLL. MSVCRT is a standard "known" DLL since Win9x where the side by side versions are all over the place on later versions.
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 22, 2010, 02:39:32 AM
E^cube,

I really disagree. msvcrt.dll is a standard system DLL and has a ton of useful and time-tested functions.  If they meet your needs, there is no reason to not use them.  Plus, if  you have ever programmed C, you are already familiar with them.  I would be more inclined to use a CRT function than to use some function Joe Blow came up with.

Title: Re: Possible problems with SSE usage.
Post by: ecube on July 22, 2010, 03:30:55 AM
Greg,
I disagree with your disagreement, msvcrt is fine in the C/C++ world, but we're part of the ASM world gentlemen, a better, more efficient world where mediocracy is left at the door. Sure if you just want sub par functions you can use msvcrt, but myself, i'd rather use a blazing fast hand written ASM equivalent. Why? well partly because I enjoy the speed, but also because it's green technology, faster speed means less clock cycles which means more energy conserved.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?
Title: Re: Possible problems with SSE usage.
Post by: oex on July 22, 2010, 04:04:16 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?

What!!!!!! Who invented that????
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 22, 2010, 04:04:40 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?

Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.
Title: Re: Possible problems with SSE usage.
Post by: dedndave on July 22, 2010, 04:14:20 AM
the msvcrt does some under-the-hood stuff that is difficult to emulate or replace
partly because it has been around long enough to have the bugs worked out
partly because ms may have used proprietary knowledge in some of the code
most of the functions i have played with seem to compare well, performance wise, against anything i can write
all in all, it's fast and well-behaved - and.... it's already written !!!   :P

i suppose, now that most CPU's support SSE2, it could be improved upon for some time-intensive functions
but - i think most of that has been (or is being) hashed over in the forum
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 22, 2010, 04:26:11 AM
E^cube,

I love programming with MASM too, but that's no reason to exclude other languages. I like to program in C, (Power)BASIC and shudder, oh my God, C# and even PowerShell.  The idea that since you're an ASM programmer you must exclude all other programming languages just doesn't sit well with me at all. The CRT functions are far from mediocre, they are usually slower than ASM written functions, but that's because they do a lot of error checking etc. that the ASM functions don't do.  Think about it, these functions have been beaten to death and tested over the years.  They're are very reliable and very stable functions. It has been very infrequently that I have needed the speed of a hand tuned assembly procedure, but when I do, I can do it.  ASM and C just go together, most C compilers include an assembler.  You have every right to use only ASM and no CRT functions, knock yourself out, but don't push your limitations on everyone else.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 22, 2010, 04:51:16 AM
Quote from: Greg Lyon on July 22, 2010, 04:26:11 AM
E^cube,

I love programming with MASM too, but that's no reason to exclude other languages. I like to program in C, (Power)BASIC and shudder, oh my God, C# and even PowerShell.  The idea that since you're an ASM programmer you must exclude all other programming languages just doesn't sit well with me at all. The CRT functions are far from mediocre, they are usually slower than ASM written functions, but that's because they do a lot of error checking etc. Think about it, these functions have been beaten to death and tested over the years.  They're are very reliable and very stable functions. It has been very infrequently that I have needed the speed of a hand tuned assembly procedure, but when I do, I can do it.  ASM and C just go together, most C compilers include an assembler.  You have every right to use only ASM and no CRT functions, knock yourself out, but don't push your limitations on everyone else.


i'm not pushing "my limitations" on anyone, i'm simply voicing my opinion, as you are. And the fact is,you and I are a completely different breed, myself, i'm an ASM warrior to the core, unrelenting dedication and devoting to the language I love. You as stated are not, more of a neutral programmer. The problem is this is not a time of neutrality, this is a time of war, language war, Microsoft has already fired the first shot by horrifically crippling masm64, and by completely removing inline asm support in visual studio x64. In response the community has fired back with projects like jwasm, fasm and goasm. That all work great on x64. So you enjoy hopping around to different languages, thats your progative, but don't push your nonchalant attitude on those who just want to code in ASM, on an asm forum...
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 22, 2010, 06:44:42 AM
Quote from: E^cube on July 22, 2010, 04:04:40 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?

Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.

sigh...

why so defensive all of a sudden?

...was it because you dont really write in asm to be "green?"

I mean, its a good "point" .. but its not why you write in ASM. Really. Its not. You know it. I know it.
Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 22, 2010, 06:46:34 AM
Quote from: E^cube on July 22, 2010, 04:51:16 AM
i'm not pushing "my limitations" on anyone, i'm simply voicing my opinion, as you are. And the fact is,you and I are a completely different breed, myself, i'm an ASM warrior to the core, unrelenting dedication and devoting to the language I love. You as stated are not, more of a neutral programmer. The problem is this is not a time of neutrality, this is a time of war, language war, Microsoft has already fired the first shot by horrifically crippling masm64, and by completely removing inline asm support in visual studio x64. In response the community has fired back with projects like jwasm, fasm and goasm. That all work great on x64. So you enjoy hopping around to different languages, thats your progative, but don't push your nonchalant attitude on those who just want to code in ASM, on an asm forum...

Come on now, fasm, and goasm were not started for the reasons you are declaring. In fact, they were started before those reasons could even have existed.
Title: Re: Possible problems with SSE usage.
Post by: ecube on July 22, 2010, 07:49:13 AM
Quote from: Rockoon on July 22, 2010, 06:44:42 AM
Quote from: E^cube on July 22, 2010, 04:04:40 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?

Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.

sigh...

why so defensive all of a sudden?

...was it because you dont really write in asm to be "green?"

I mean, its a good "point" .. but its not why you write in ASM. Really. Its not. You know it. I know it.


Actually in part it is, i'm a big fan of the green, money, green tea, various plants... and yes green tech. i'm defensive because people are on the offense.

Also Jeremy is a smart guy, he recognized the ASM community had a large void that he could fill with Goasm, and choose to do so. That may not be his primary reason, but he took the time to write all that documentation etc, and spent years of his life on it, all for the public, so clearly he cares.  Unlike most C/C++ GPL projects, where the authors are disrespectful and rude, looking only for self promotion and most of the time donations, Jeremy on the other hand has asked for nothing, and IMO deserves everything. RadASM and EasyASM are similar. These are prime examples of the power of ASM, the power of devoted developers and users who don't give up when faced with difficult engineering challenges. These are the kind of people I respect,and appreciate.  :thumbu
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 22, 2010, 07:57:44 AM
Folks,

Can we avoid the "assembler wars" here, different people have their reasons for using different tools, I simply respect their choice and don't inflict this stuff on them.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 22, 2010, 09:08:09 AM
;-)

Just in case somebody is still interested in the original topic: I have made the Recall macro "SSE2 safe".

Quoteinclude \masm32\MasmBasic\MasmBasic.inc   ; Download (http://www.masm32.com/board/index.php?topic=12460.0)
   Init[/size]
   Recall "\masm32\include\windows.inc", MyArray$(), -1, lc
   Print Str$("%i lines found in Windows.inc\n", lc)
   For_ n=0 To Min(lc-1, 15)
      mov ecx, n                   ; we need some proof that
      lea ecx, [2*ecx+27]        ; this is assembler ;-)
      Print Str$("\nLine %i\t", n+1), Left$(MyArray$(n), ecx)
   Next
   
Exit
end start

Output:
22272 lines found in Windows.inc

Line 1  comment * -=-=-=-=-=-=-=-=-
Line 2
Line 3        WINDOWS.INC for 32 bit MA
Line 4
Line 5        This version is compatible wi
Line 6
Line 7        Project WINDOWS.INC at the Masm F
Line 8
Line 9        http://www.masm32.com/board/index.php
Line 10
Line 11       WINDOWS.INC is copyright software licence
Line 12       MASM32 project. It is available completely
Line 13       for any person to use for purposes including
Line 14       commercial software but the file must not be in
Line 15       commercial package and the file may not be redist
Line 16       without express permission from the MASM32 project.


> This version is compatible with ML.EXE Version 8.0

Hutch,
It's compatible with 6.14, 6.15, 9.0 and JWasm, too.
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 22, 2010, 02:42:58 PM
JJ,

> It's compatible with 6.14, 6.15, 9.0 and JWasm, too.

Do you think you can expand on this. I use all versions of ML.EXE from 6.14 to 10.0 and don't have any problems with the current form of Windows.inc. Noting that I don't get paid for maintaining this file, perhaps you could share with me what the problem is.
Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 22, 2010, 04:29:10 PM
> Noting that I don't get paid for maintaining this file, perhaps you could share with me what the problem is.

Cool down young friend :bg

I just wanted to encourage you to mention that it is compatible not only with 8.0. It's kind of a compliment, Sir Hutch :U
Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 22, 2010, 04:51:43 PM
 :bg

After midnight readings,  :U
Title: Re: Possible problems with SSE usage.
Post by: GregL on July 22, 2010, 05:28:11 PM
E^cube,

You should probably avoid the Windows API also, as it was written mostly in C.

Title: Re: Possible problems with SSE usage.
Post by: ecube on July 23, 2010, 02:17:13 AM
Quote from: Greg Lyon on July 22, 2010, 05:28:11 PM
E^cube,

You should probably avoid the Windows API also, as it was written mostly in C.



Hutch has spoken, that means you don't, unless it's about the thread topic. Thanks
Title: Re: Possible problems with SSE usage.
Post by: Antariy on August 12, 2010, 07:42:49 PM
Quote
After successful build, test.exe gives me a Visual C++ Runtime Library error:
R6034
An app has made an attempt to load the C runtime library incorrectly.

After 2 weeks nobody is interesting with one version of MSVCR80.DLL, so... :P