I was looking at the SSE conversions we were developing and thought about some of the consequences of using SSE (or a general purpose register like eax) when scanning a string instead of just a BYTE (or WORD for Wide Character Unicode).
I wanted to check what would happen at the end of a buffer allocated by VirtualAlloc. I created this small test case and verified that movdqu from the start of a short (in my case 7 BYTES) string at the end of the buffer would fault before you can check for a null. I then add the code to allow this to work correctly.
Note:
This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.
If you allocate a VirtualAlloc Buffer, and reserve the last 16 BYTES and do not put any data there (not even the last trailing null), then you will never overrun the buffer - you will always find the null first. Thus, you do not need this type of initial checking.
Note further:
This test case is coded for BYTE compares, but the same solution can be used when checking 8 wide characters in an xmm reg.
The method:
The plan is to force the first load to be at a mod 16 bound by ANDing the real start by -16. This will never overrun a VirtualAlloc Buffer because the start is always on a mod page size bound, and the length is always a mod page size length. Page size is 4KB with XP (maybe larger on 64 bit systems) so as long as you start at a mod 16 bound within the buffer, you will never overrun the buffer. This gets you more data than you wanted (the BYTES or WORDS preceeding the string) so you can calculate the number of bytes to skip and the bit mask to use on the extracted bits from pmovmskb. Using a bsf on the extracted and masked bits in the register will then give you a correct value or the first instance of the desired character. My test was a simple check for nulls in a buffer full of nulls.
Lingo and JJ Note:
Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls
All:
The same problem can occur if you use mov eax,[esi] to scan a string, checking for nulls as you go. Instead, force the esi to the prior mod 4 bound, load the DWORD containing the desired starting character, and use the difference of actual start vs aligned start to position eax left 8 bits per skip character, and then rol eax,8 for each remaining character to be checked in al looking for a null. From then on, just adjust esi by 4 bytes and you will not overrun the buffer. You have to find some way to not overrun a VirtualAlloc buffer on a first check.
For scanning a Unicode strings, you are dealing with 2 WORDS and not 4 BYTES, so make adjustments and just deal with it.
This test case is a good place to test out your favorite code fragment to insure that the first load of your routine will not overrun the buffer.
Dave.
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.
Quote from: drizz on July 07, 2010, 01:42:51 AM
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.
I read some of the comments in your link. The common thinking was that it is faster to use the bad algos than worry about an exception once in a great while. In my case, I have acrually supplied the very few instructions it takes to eliminate this possibility, and these could be executed just once, not in the main loop (I did imply that they were in the main loop by including the or edx,0FFFFh to force acceptance of all matches after the first, but this could be dropped and a separate compare loop coded that had no extra code). This fix only needs to affect the first access, all others use aligned accesses which will not fault if you are checking for nulls.
You can have the best of both worlds, speed and safety.
Dave.
Quote from: hutch-- on July 07, 2010, 02:29:39 AM
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.
Hutch,
I haven't yet tried this with the crt__ routines, but the code seems to be all WORD oriented for Unicode, so if a Unicode string is set to start on an odd BYTE bound, the crt__ routines should work. I have been looking at this with SSE in mind, including my initial check. With odd BYTE alignment to avoid buffer overrun , loading an xmm reg at an aligned boundary would put the characters split between words, not too good for pcmpeqw compares. Is such alignment allowable?
Dave.
Instead of guessing, load a unicode string on a 2 byte boundary and you never have the problem. 4 byte alignment is even better.
Hutch,
I agree, use aligned strings. But what should a generalized library function do if passed such a string?
I will test this and see if itwill work at all for CRT__ routines, and get back.
Dave.
I think that a generalized library should throw an exception on unaligned data, and yes that would included 16-bit unicode, which should of course be 2-byte aligned.
Quote from: KeepingRealBusy on July 07, 2010, 12:57:11 AM
Lingo and JJ Note:
Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls
Dave, your point is valid - see also this post for a concrete example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375). We did test the other method, i.e. anding the first address and masking out the leading bits, but fail to remember why we didn't continue that road. Maybe because the masking out costs some cycles? Does anybody have a better idea than a shr/shl pair?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
36 cycles for 10*shr/shl eax, cl
15 cycles for 10*shr/shl eax, 15
6 cycles for 10*and eax, nn
If you are thinking of using SSE instructions on unicode data, make sure you use the required alignment as some SSE instructions require 16 BYTE alignment and will crash if the data is not 16 byte aligned. You check this on an instruction by instruction basis from the Intel manual.
Everything depends on nature of input data: strings/buffers/sizes. And what's known/unknown.
As an example is copying standard functions: memcpy, memmove. Some regions may/may not overlap.
In this case what if we have some aligned/unaligned buffer as input? in some cases we cannot "and eax, -16" and "movdqa/movaps" - we'll have to movdqu/movups. And what's length known/unknown? The firs zero byte is signaling - then only byte access allowed in this case. Boundaries before and after, crossing the 16 byte alignment and cache.
String start =15
size = 19
String end = 33
how to work this situation out? 2 boundaries crossed /16, 32/. How to load it? Using unaligned and most common way with tail cases processing or else... with common GPR processing for head and tail or else...?
Here is a procedure, WordAlign, that will align an unaligned (odd BYTE bound) Wide Character string.
I even tested with the wcs_ routine, at least CRT__wcscpy, CRT__wcscmp, CRT__wcschr, and they all worked correctly. I had examined the crt source code and it all appeared to be just Wide Character oriented. Where is there any restriction documented. It all seems to work on BYTE bounds.
This is not my final version of WordAlign, still some cleanup to remove stalls, but as coded here, the logic is more readable.
Enjoy,
Dave.
Why not just use SEH to check for faults? idk how slow SEH is, but it'd get'r done :U
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.
If your code is *expecting* an exception, there is something wrong.
i have noticed - that's the way C programmers do it - lol
they catch everything with an exception handler
Is there any way to force PROC stack variables to be 16 byte aligned? The best I can do is:
Local Var1OWORD
Local Var2OWORD
Local Var3OWORD
movdqa OWORD PTR Var1,xmm0
movdqa OWORD PTR Var2,xmm1
movdqa OWORD PTR Var3,xmm2
This assembles and reserves space, but no attempt to align.
All I can see to do is:
Local Var1[16]:DWORD ; need 3 OWORDS or 12 BYTES
lea edx,var1
lea edx,[edx+15]
and edx,-16
movdqa OWORD PTR [edx], xmm0
movdqa OWORD PTR [edx+16], xmm1
movdqa OWORD PTR [edx+32], xmm2
Note: This takes 3 instructions just to set the address of the aligned variables, just to save doing an unaligned movdqu (and later the same for the restore).
Just one more question. I have been using the following command line option, "/Sg" which I found documented in Kip's book, 4th ed. This was for MASM 6.15. The option said "Turn on listing of assembly-generated code." MASM 9.0 and JWASM both accept the option. This is not documented in the current MSDN that came with MASM 9.0 Visual Studio 8.0), nor in the help file in MASM32 lib. Is this option just accepted and ignored, or is it just not documented? Does this option do anything?
that's strange
i don't see /Sg in my list (6.14)
to generate assembler listings, i have been using /Fl <--------- that's a lower case L
C:\ => ml /help
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997. All rights reserved.
ML [ /options ] filelist [ /link linkoptions ]
/AT Enable tiny model (.COM file) /nologo Suppress copyright message
/Bl<linker> Use alternate linker /Sa Maximize source listing
/c Assemble without linking /Sc Generate timings in listing
/Cp Preserve case of user identifiers /Sf Generate first pass listing
/Cu Map all identifiers to upper case /Sl<width> Set line width
/Cx Preserve case in publics, externs /Sn Suppress symbol-table listing
/coff generate COFF format object file /Sp<length> Set page length
/D<name>[=text] Define text macro /Ss<string> Set subtitle
/EP Output preprocessed listing to stdout /St<string> Set title
/F <hex> Set stack size (bytes) /Sx List false conditionals
/Fe<file> Name executable /Ta<file> Assemble non-.ASM file
/Fl[file] Generate listing /w Same as /W0 /WX
/Fm[file] Generate map /WX Treat warnings as errors
/Fo<file> Name object file /W<number> Set warning level
/FPi Generate 80x87 emulator encoding /X Ignore INCLUDE environment path
/Fr[file] Generate limited browser info /Zd Add line number debug info
/FR[file] Generate full browser info /Zf Make all symbols public
/G<c|d|z> Use Pascal, C, or Stdcall calls /Zi Add symbolic debug info
/H<number> Set max external name length /Zm Enable MASM 5.10 compatibility
/I<name> Add include path /Zp[n] Set structure alignment
/link <linker options and libraries> /Zs Perform syntax check only
wow - i am not using the version i thought i was :lol
and i spent all that time patching it, too
now, i hafta go patch 6.15
Dave,
I think the difference is that /Fl produces a listing (if you have .list in the source), but if a macro is used, output of generated code could be suppressed, unless /Sg was used. I have not tried to use /Sg without .list (I just did - /Sg does not override a missing .list). I will check for macro expansion. Note: Still do not know whether /Sg is actually supported, or just tolerated.
Dave.
The reference in Kip's book said the information "came from the last printed documentation from the MASM 6.11 reference manual", ... "with updated from MASM 6.14 readme.txt file".
Dave
Dave,
Do you have any words of wisdom about stack alignment of OWORDS?
Dave.
It is listed in MASMREF.DOC that came with the Processor Pack for VC 6.0 that included ML.EXE 6.15.
Quote/Sg Turns on listing of assembly-generated code.
Regarding alignment, how about using
malloc_align().
ok - patched 6.15 and put together a new ML615 package
it contains the document from VS6, as well as the ReadMe's and a few others
http://www.4shared.com/file/-QIUp-BF/ml615.html
be sure to read the ML_ver.txt file
QuoteDo you have any words of wisdom about stack alignment of OWORDS?
it's nice to be wanted, but i am probably not the guy to ask - lol
i would think that Jochen, MichaelW, qWord and the other guys are far more qualified to help on this one
the probelm is - many of the guys that have experience writing alignment macros aren't using 64-bit machines :P
but, i would think that a creatively designed macro could replace the INVOKE macro/functionality of ml32 in ml64
these guys are good at macros - i bet they could write one that would align the stack in and out
Well now I know where I stand. :wink
I am not an expert at SSE, but what's wrong with ALIGN 16 to align the LOCALs?
lol Greg - i meant nothing like that at all
i just happen to know those guys have played specifically with alignment macros :P
and, i thought i covered my
ass bases with...
Quote...and the other guys
dedndave, I'm confused by what the patch to /help does; /? and /help both show the list of switches for vanilla ML.EXE 6.15. Is /help meant to do something else? And does it do something else post-patch, because trying your patched ML.EXE, /? and /help still both show the list of switches.
Queue
be careful that you are executing the right copy of ML :bg
try ML615 /?
if you have ML in the path, you are looking at whatever version you have in the bin folder (probably)
i tested version 6.15.8803 and it fails for "/?", but works for "/help"
i didn't really patch any code on that
all i did was change the displayed string from
usage: ML [ options ] filelist [ /link linkoptions]
Run "ML /help" or "ML /?" for more info
to
usage: ML [ options ] filelist [ /link linkoptions]
Run "ML /help" for more info
for some reason, the parser sees "/?" as "/r"
the original (unpatched) 6.15.8803 displays the following...
C:\=> ml /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000. All rights reserved.
MASM : warning A4018: invalid command-line option : /R
MASM : fatal error A1017: missing source filename
"?" is a filenaming wildcard character, i guess - lol
i dunno
i looked at the code that parses it to see about fixing it
it was over-complicated, if you ask me - lol
so, i just changed the displayed string
that is a good way to verify you have the patched version, i suppose
C:\Utils>ml /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000. All rights reserved.
ML [ /options ] filelist [ /link linkoptions ]
/AT Enable tiny model (.COM file) /omf generate OMF format object file
/Bl<linker> Use alternate linker /Sa Maximize source listing
/c Assemble without linking /Sc Generate timings in listing
/Cp Preserve case of user identifiers /Sf Generate first pass listing
/Cu Map all identifiers to upper case /Sl<width> Set line width
/Cx Preserve case in publics, externs /Sn Suppress symbol-table listing
/coff generate COFF format object file /Sp<length> Set page length
/D<name>[=text] Define text macro /Ss<string> Set subtitle
/EP Output preprocessed listing to stdout /St<string> Set title
/F <hex> Set stack size (bytes) /Sx List false conditionals
/Fe<file> Name executable /Ta<file> Assemble non-.ASM file
/Fl[file] Generate listing /w Same as /W0 /WX
/Fm[file] Generate map /WX Treat warnings as errors
/Fo<file> Name object file /W<number> Set warning level
/FPi Generate 80x87 emulator encoding /X Ignore INCLUDE environment path
/Fr[file] Generate limited browser info /Zd Add line number debug info
/FR[file] Generate full browser info /Zf Make all symbols public
/G<c|d|z> Use Pascal, C, or Stdcall calls /Zi Add symbolic debug info
/H<number> Set max external name length /Zm Enable MASM 5.10 compatibility
/I<name> Add include path /Zp[n] Set structure alignment
/link <linker options and libraries> /Zs Perform syntax check only
/nologo Suppress copyright message
I double-checked, and even with your patched ML615.EXE, /? shows the switches, so I don't think I'm even encountering the problem you are. I have a vanilla 6.15 and a patched 6.15; the difference between my patched 6.15 and yours (ignoring the changes to the string you modified and the version number you tweaked) is a single byte at F6B8; in mine it's 0E, and in yours it's 7F. I'm mainly curious as to what that single byte difference is.
Queue
ok - i downloaded the original file from
http://win32assembly.online.fr/download.html
it is version 6.15.8803
is that what you have ?
the byte at offset F6B8 is 0E
i renamed ml.exe to mx.exe and ml.err to mx.err to avoid confusion with the pathed copy...
C:\=> mx /?
Microsoft (R) Macro Assembler Version 6.15.8803
Copyright (C) Microsoft Corp 1981-2000. All rights reserved.
MASM : warning A4018: invalid command-line option : /R
MASM : fatal error A1017: missing source filename
modifying that byte to 7F has no affect on the "/?" output
Yes, and I've done hex comparisons, byte-for-byte. I'm definitely working with the same version of ML as you.
Queue
Any alignment of the stack beyond the default 4 bytes will have to be done with some sort of code, a custom prologue for example, you can't just automatically align it.
If that was the case then we poor ml64 users would have it a lot easier...
Sinsi,
Thank you for the information. That is exactly what I did, but wondered what else could be done to eliminate the 3 instructions it takes to do this.
Dave.
Quote from: Greg Lyon on July 12, 2010, 04:59:15 AM
Well now I know where I stand. :wink
I am not an expert at SSE, but what's wrong with ALIGN 16 to align the LOCALs?
Greg,
ALIGN 16 is an assembly time directive, it just adjusts the address at which the next data or instruction will be located, and in the case of instructions, it pads the code with dummy instructions that do not affect either the registers or the flags (instructions like lea ebx,[ebx]). What I needed was something to reserve space in the stack, and also something to selectively, at execution time, select the actual stack address to use for an aligned store (movdqa). It turns out that you must do this yourself since the stack can be aligned differently at each call.
I guess I will have to time this both ways, using code to align for a movdqa, or no code and use movdqu.
Dave.
Dave,
I see, I thought you wanted to align the OWORD data variables. Sorry, I misunderstood.
dedndave Dave,
I was kidding, hence the wink. I am not as sharp as I used to be, some of these guys run circles around me.
shoot, Greg - some of the newbies can make me look bad - lol
Well, here is my final version of WordAlign, actually KRBWordAlign (both functions are there, KRBWordAlign is tested) The KRBWordAlign version has the code shuffled around as much as possible to confuse the reader, WordAlign might be easier to understand. The attached zip contains modified tests that test all 32768 different lengths of a wide character test string, all stuffed at the end of a VirtualAlloc buffer with an odd BYTE start. Three error return conditions are tested.
Dave.
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.
If your code is *expecting* an exception, there is something wrong.
that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.
Quote from: E^cube on July 12, 2010, 10:18:50 PM
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.
If your code is *expecting* an exception, there is something wrong.
that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.
You are not *expecting* an exception in normal code execution with those things. You are expecting exceptions in *abnormal* execution in most of them, and in the other are trying to intentionally create abnormal behavior.
Rare events are not part of
normal flow control.Exceptions dont help you debug anything when you are using them for flow control. They make it much much harder. You could have checked the size of the buffer, but intead you waited for an exception.. really? thats easier to debug?
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
Quote from: E^cube on July 12, 2010, 10:18:50 PM
Quote from: Rockoon on July 10, 2010, 02:32:39 PM
Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.
If your code is *expecting* an exception, there is something wrong.
that's not true, various protection methods are based off SEH such as nanomites, page guards, some vm engines, etc... also it can help prevent malicious attacks in security software by handling errors calmly vs it blowing up in your face, and is used in detecting vm machines like vmware. It's good for self debugging aswell.
You are not *expecting* an exception in normal code execution with those things. You are expecting exceptions in *abnormal* execution in most of them, and in the other are trying to intentionally create abnormal behavior.
Rare events are not part of normal flow control.
Exceptions dont help you debug anything when you are using them for flow control. They make it much much harder. You could have checked the size of the buffer, but intead you waited for an exception.. really? thats easier to debug?
actually again that's not true, in the case of nanomites it intentionally puts exception code in the place of say calls so that when the program comes across it, it throws an exception whichthe SEH handles, looks up the location in its database and runs the correct code to continue on.
In the terms of debuggers how do you think it breaks on certain parts of code? it's not magic...no, it sets a int 3 on the address, which throws an exception when ran, that the debuggers auto handles and allows you to pause the program flow and see what's in registers etc...
and you're missing the point of all this, ideally he should just check the size of the buffer, sure, but what if a buffer accidently isn't given, and instead a integar is? CRASH! this is unacceptable in vital programs running at system level such as services or programs that need to continue to run. His code could be incorporated in such a scenario is my point. Also it's not just about size, to give the example of wsprintf, if you have countless %s and very few string inputs, there's a crash right there which can run shellcode and all that non-sense, they did it with ollydbg.
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.
Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...
But I agree that, beyond ideology, you need a damn good justification to use SEH that way. By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol'
lstrcpy fails clamorously?
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
To be fair, the docs say (now, maybe not before?)
QuoteUsing this function incorrectly can compromise the security of your application. This function uses structured exception handling (SEH) to catch access violations and other errors. When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error. The caller is not safe to assume that insufficient space is the error condition.
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.
Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...
But I agree that, beyond ideology, you need a damn good justification to use SEH that way. By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
i've already explained why, if your program is of any importance as far as it running or being safe then SEH is required, otherwise you just contribute to the countless "exploits" skiddies discover in badly written software, which leaves users more open to attackers and makes windows look worse. Any kind of server software, any kind of service, or similar.Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.
Quote from: sinsi on July 13, 2010, 07:16:40 AM
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
By the way, has anybody looked at this example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375) where good ol' lstrcpy fails clamorously?
To be fair, the docs say (now, maybe not before?)
QuoteUsing this function incorrectly can compromise the security of your application. This function uses structured exception handling (SEH) to catch access violations and other errors. When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error. The caller is not safe to assume that insufficient space is the error condition.
Interesting: So they know it. On the other hand, it supports the "let it crash" line. Do you have an error check after each lstrcpy or lstrcat? I have 226 occurrencies of lstrc*** in the RichMasm source, none has an error check. So their SEH is a safe recipe for an unchaseable bug that occurs in very rare circumstances - the chance is about 1:4096. Fortunately most if not all of these lstrc*** deal with short messages well below the size of a page, but imagine you use it for file handling on a huge number of files? And we are not talking about a hobby coder's algo here - lstrcpy is part of the OS, and it crashes silently because they decided to "handle" the exception ::)
well instead of using lstrcpy you can rename to lstrcpyx for all instances and just have it be your own function with SEH that trys to call lstrcpy ;D 2 second fix. But yeah hobby coder or the code being used in a program that has no importance, SEH isn't needed. SEH is also nice in debugging because you log to a textfile the registers contents, the params passed when it crashed and ofcourse what function.
Well, it is a function, which returns a value, so some error-checking would be in order...
Does anyone check the return values of RegisterClassEx/CreateWindowEx? Only when your code doesn't show a window I'll bet, then you get rid of the check.
Return values are there for a reason, and how bad is it to branch after a "test eax,eax" anyway?
Then we can get one of those "unexpected error" or "internal error" message boxes :bdg
well - i have just seen several C examples where the exception was the rule (slap me if that's a bad pun - lol)
i am kind of an old-school guy, i know
but, i have always tried to write code so that errors don't happen to begin with
in the case of peripherals or some other hardware, of course, you can't always do that
but, generally, i try to test for and force correction of an error before allowing it to occur
here is a simple example:
i want to use the DIV instruction
prior to using it, i insure the dividend is within range and the divisor is non-zero
in most cases, the logic of the code is such that these error conditions cannot occur
if they do occur, i allow the user to alter input parameters to fix the problem or whatever steps are appropriate
if divide-by-zero does occur in my program, it indicates that i have a bug in my code
i don't use the exception handler to catch the mistakes - lol
now - that isn't very modern, perhaps
but, when i see exception handlers used that way, the term that comes to mind is "lazy coder"
i suppose there are many modern cases where my perception is off-base :P
Why you need to use SEH if nobody except you won't process raised exceptions?
YOur app is anyway SEHed my Windows before it starts. UnhandledExceptionFilter etc.
If it designed to crash it will crash anyway with self-made SEH or without and with system-made SEH.
Bad design and unhandled SEH is the reason of known result.
Windows is smart enough to handle properly such designed apps. So either you use SEH or you don't Windows anyway will terminate your/someone's/ buggy app.
Quote from: E^cube on July 13, 2010, 07:22:51 AM
Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.
A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.
Quote from: MichaelW on July 13, 2010, 10:31:39 AM
Quote from: E^cube on July 13, 2010, 07:22:51 AM
Computers are getting so fast now that a few more clocks to handle SEH isn't detrimental like back in the day.
A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.
If you're just writing hobby code that you're not going to use in any program of importance then you don't have to use SEH. When you write programs that many users use however, and your lack of SEH puts the users system at risk, like the countless apps i've seen on the "exploits" lists, then that's a problem.
Quote from: MichaelW on July 13, 2010, 10:31:39 AM
A "few" more clocks? In my test it took ~11000 more clocks just to bypass a divide by zero and continue execution.
IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.
Microsoft already gets enough heat from skiddies exploiting some of their API functions, inturn exploiting users systems, that's in part why they created managed code and the safe API. I think lstrcpy not crashing is a good thing, all errors should be handled gracefully, because keep in mind the general users can barely check their email much less know about programming etc...they don't want to see a program crash...it scares them.
Rockoon is right here, exception handling is for code that must deal with events that cannot be predicted at compile/assembly time, hardware, internet connections and the like, if something is not physically available then you must have a way to deal with the lack of response but outside of those circumstances you should write code that does not have faults in it for its target market. Better to write suicide code that explodes in you face with an error than to have undebuggable junk that hides the problem.
Quote from: jj2007 on July 13, 2010, 11:40:09 AM
IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.
I agree. My point was that the few more clocks justification is not valid.
Quote from: MichaelW on July 13, 2010, 03:27:26 PM
Quote from: jj2007 on July 13, 2010, 11:40:09 AM
IMHO the extra clocks are not the problem. The divide by zero is result of bad design, so let it crash properly, with a slap in the coder's face. As Dave put it, "lazy coders" use the handler to avoid reflecting on proper design. I wish lstrcpy would crash instead of silently "handling" an access violation.
I agree. My point was that the few more clocks justification is not valid.
The justifications have already been pointed out. but let me reiterate and also clarify something for you, the clock cycles you posted does NOT increase incrementally, that is, you can write a great deal of addition code in side of your session handler for example and it wouldn't be astronomical in cycles just because it's in a session handle, it's just the inital setup that takes the cycles. Also I recommend VEH over SEH as it's a lot more intelligent and I bet it's faster too.
As far as its use, i'm aware a lot of you are seasoned programmers, but this is definitely not the early 90's anymore, a lot has changed, including the programmers responsibility to write safe/reliable code, not just in terms of your program using it, but from outside influences as I mentioned earlier. Also SEH,VEH etc... set to log function/code crashes is a very nice/fast way to narrow down bugs in your program, much faster than debugging. Especially on x64 where they're aren't a lot of good debuggers out yet. And thinking in the future, how great would it be for a user to be able to run your program on windows 13 that you wrote for windows xp, and have a log generated of the apis/functions crashing so that you can easily write a fix :)
:lol By Windows 13 my computer will be fixing it's own damn bugs
Seh is only needed when dealing with smth. system-wide or global. That's shared among all processes - system settings, events as said, etc. which surely must be processed at app termination - cleanup time. The rest can be handled by checking return values GetLastError or by separate thread & synchronization (wait/until) - the weirdest nonblocking case.
Quote from: E^cube on July 13, 2010, 04:38:08 PM
The justifications have already been pointed out. but let me reiterate and also clarify something for you, the clock cycles you posted does NOT increase incrementally, that is, you can write a great deal of addition code in side of your session handler for example and it wouldn't be astronomical in cycles just because it's in a session handle, it's just the inital setup that takes the cycles.
The cycles I posted were for handling the exception. My test consumed ~3000 cycles if there was no exception, or ~14000 cycles if there was an exception.
Jez, guys! I'm sorry I raised such a firestorm. I was just trying to insure that I wouldn't walk off of a VIrtualAlloc buffer using SSE loads.
Dave.
Quote from: jj2007 on July 13, 2010, 06:38:45 AM
Quote from: Rockoon on July 12, 2010, 11:53:22 PM
If your code is *expecting* an exception, there is something wrong.
...
Rare events are not part of normal flow control.
Normally I would strongly agree - but what if the "good" checks cost so much more time than letting it crash, in a controlled way, into an "exception"? Guard pages do that all the time - have a look at the page faults column in Task Manager...
If the "good" checks cost more than the exception, then its a very rare event. You do know how costly exceptions are, right? :) First it goes to the OS's exception handler, then possibly it gets offloaded to yours, and then maybe back to the OS again for the ones still unhandled.
As far as the vast majority of page faults listed in task manager, they are being handled by the OS's virtual memory subsystem. I believe only two scenarios encompass the entire count:
memory that was swapped out
memory mapped files
Neither of these is handled by your application, so actually falls under my other observation: not being handled by the local procedure
If there are other faults included in the count, I'm all ears. I do not believe that the faults that your program catches are included in the count, but meh..
Ah, now here is the rub.
On the one hand we have code that is going to overshoot its buffer on purpose, and then from time to time it is going to just catch an exception if one is raised because it not only overshot the buffer, it also overshot the contiguous memory pages the buffer resides in.
On the other we have code that will divide by zero from time to time, where the programmer is going to just catch the exception if one is raised, and then execute default-value semantics, error out, or whatever.
One of these is not like the other. In the buffer case, not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too? Wow. Just wow. In the divide by zero case, its accidental or incidental, and not on purpose.
One area where I normally just want to swallow exceptions is fsqrt. Luckily the FPU lets us do just that.
Quote from: Rockoon on July 14, 2010, 07:30:30 AM
Ah, now here is the rub.
On the one hand we have code that is going to overshoot its buffer on purpose, and then from time to time it is going to just catch an exception if one is raised because it not only overshot the buffer, it also overshot the contiguous memory pages the buffer resides in.
On the other we have code that will divide by zero from time to time, where the programmer is going to just catch the exception if one is raised, and then execute default-value semantics, error out, or whatever.
One of these is not like the other. In the buffer case, not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too? Wow. Just wow. In the divide by zero case, its accidental or incidental, and not on purpose.
One area where I normally just want to swallow exceptions is fsqrt. Luckily the FPU lets us do just that.
Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?
Dave.
Quote from: KeepingRealBusy on July 14, 2010, 01:23:41 PM
Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?
Dave.
I'm saying that you normally shouldn't read 16 bytes from a buffer that ends in less than 16 bytes, and absolutely never read any bytes beyond the end of your allocation space without something having gone terribly wrong.
If you are intent on reading 16 bytes at a time then clearly performance is your concern. Since performance is your concern..
(A) align your strings so that they are 16-byte aligned.
(B) allocate space in 16-byte multiples so that you never overshoot your buffer.
(C) stop relying on NULL to terminate your strings. Store the length instead. You can still use a NULL to make them compatible with other routines.
i think that "overshoot" happens quite often
a lot of functions are passed string/buffer pointers without buffer size values
they assume the null terminator to be valid, i guess :bg
it probably also happens when functions try to dword align themselves inside a buffer
care isn't always taken to insure that accesses a few bytes above and below the buffer is avoided
when i wrote the ling long kai fang routines, i was extra careful to avoid this, and you have to specify both in and out sizes as parms
those routines dword-align themselves inside both buffers
i may have specified an aligned input buffer base - i don't remember at the moment
but, there is some code in there to avoid overshoot above the input value buffer
and some more code to avoid overshoot at the end of the output buffer
Quote from: Rockoon on July 14, 2010, 05:26:22 PM
Quote from: KeepingRealBusy on July 14, 2010, 01:23:41 PM
Specifically "not only are we overshooting the buffer on purpose, but sometimes we also overshoot the process space too". Are you saying that the exception handler should correct this? In the case of loading 16 bytes of a string using SSE, the string is absolutely valid and null terminated and in the buffer, but if it is short and at the end of the buffer, then you would get the fault. What should the exception handler do, or are you saying that this should not be handled by an exception handler?
Dave.
I'm saying that you normally shouldn't read 16 bytes from a buffer that ends in less than 16 bytes, and absolutely never read any bytes beyond the end of your allocation space without something having gone terribly wrong.
If you are intent on reading 16 bytes at a time then clearly performance is your concern. Since performance is your concern..
(A) align your strings so that they are 16-byte aligned.
(B) allocate space in 16-byte multiples so that you never overshoot your buffer.
(C) stop relying on NULL to terminate your strings. Store the length instead. You can still use a NULL to make them compatible with other routines.
You missed the operative sentence in my first post:
Quote
Note:
This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.
Dave.
Quote from: KeepingRealBusy on July 14, 2010, 06:03:00 PM
You missed the operative sentence in my first post:
Quote
Note:
This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.
Dave.
No, I didnt. A general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.
You have made the decision to not be general purpose when you started reading 16-bytes at a time.
Quote from: Rockoon on July 14, 2010, 06:11:53 PMA general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.
lstrcpy
is a general purpose routine... and I fully agree, it should not swallow protection violations
QuoteYou have made the decision to not be general purpose when you started reading 16-bytes at a time.
Although there are still a few non-SSE2 machines around, it might be time to declare reading 16-bytes at a time "normal".
Quote from: Rockoon on July 14, 2010, 06:11:53 PM
Quote from: KeepingRealBusy on July 14, 2010, 06:03:00 PM
You missed the operative sentence in my first post:
Quote
Note:
This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.
Dave.
No, I didnt. A general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.
You have made the decision to not be general purpose when you started reading 16-bytes at a time.
My routines:
Don't swallow protection violations.
Read 16 bytes at a time.
Require valid null terminated strings.
Handle both Wide character (Unicode) and normal character strings.
There is nothing that says that a 16 BYTE reading function cannot be a general purpose routine, they are not mutually exclusive.
What, exactly, do you not like about my routines?
Dave.
Quote from: jj2007 on July 14, 2010, 06:53:13 PM
Quote from: Rockoon on July 14, 2010, 06:11:53 PMA general purpose routine wouldnt be swallowing protection violations. Thats decidedly not general purpose.
lstrcpy is a general purpose routine...
Not if it swallows protection violations.
Quote from: jj2007 on July 14, 2010, 06:53:13 PM
Although there are still a few non-SSE2 machines around, it might be time to declare reading 16-bytes at a time "normal".
It really isnt an issue of "support." This is about design. If you swallow the page faults, then you are special purpose.
My last post was in error, however, since clearly the routine could be constructed to only make aligned reads even for unaligned input (result: never cross a page boundary in error when valid data was supplied to it) and that would make it general purpose.
Quote from: KeepingRealBusy on July 14, 2010, 07:32:00 PM
What, exactly, do you not like about my routines?
I never said that I didn't like your routines. I never even looked at them prior to just now. I said that swallowing the page faults is not general purpose, which some posters seemed to consider a valid strategy (that a page fault wasnt an error, that the string could still have been terminated validly)
I stand corrected. I thought you were addressing your comments to my code and not to the other side discussion about SEH and faults.
What you see here of my code was only a little piece to handle unaligned (odd BYTE aligned) wide characters. From what I have learned, I will redo all my character routines, and implement wide character routines as well.
Dave.
So here is an attempt to start a "16-bit safe collection": good ol ' string len.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
44 cycles for StrLen1 (safe)
47 cycles for StrLen2 (safe)
25 cycles for MasmBasic (unsafe)
132 cycles for MasmLib
44 cycles for StrLen1
47 cycles for StrLen2
25 cycles for MasmBasic
132 cycles for MasmLib
Results:
100 bytes for StrLen1
100 bytes for StrLen2
100 bytes for MasmBasic
Code sizes:
75 for StrLen1
75 for StrLen2
87 for MasmBasic
JJ,
I have looked at the code, and have one question. You pop the first 2 stack parameters into eax, leaving the return address on the stack, but it is unprotected by the esp value. I am not familiar with what happens during interrupts, but I do not think this is safe. If an interrupt comes in, where can the CPU save the current ip or any regs that need to be used?
Dave.
JJ,
A second problem. In the iteration loop you get two xmm regs (from [eax] and [eax+16) without checking the first for nulls. This is not safe for a the end of a VirtualAlloc buffer.
Dave.
Dave,
Thanks for looking at that.
Re lingo's pop the ret address technicque: Interrupt seem not to be a problem, although it is apparently nowhere documented.
Re point 2: You are perfectly right, there is a risk at the end of a VirtualAlloc buffer. Any suggestions? The routine is already a bit slow ::)
try this
invoke AddVectoredExceptionHandler,1,handlexcept
;do everything...
handlexcept proc pExceptionInfo
mov edi, pExceptionInfo
mov eax, [edi].EXCEPTION_POINTERS.pExceptionRecord
mov edx, [edi].EXCEPTION_POINTERS.ContextRecord
cmp [eax].EXCEPTION_RECORD.ExceptionCode,STATUS_BREAKPOINT
jne @F
;cmp [eax].EXCEPTION_RECORD.ExceptionAddress,; is it our code address
;jne @F ;if not let others have a go
add [edx].CONTEXT.regEip,1
mov eax,EXCEPTION_CONTINUE_EXECUTION
ret
@@:
mov eax,EXCEPTION_CONTINUE_SEARCH ;let others have a go
ret
handlexcept endp
VEH is xp+ only but it's a beautiful thing, it gets exceptions before SEH and others do, and you can add as many different handlers as you like, but only 1 is really needed. When you do the EXCEPTION_CONTINUE_SEARCH it passes it on to the other handlers then onto SEH, etc... down the list.
just a thought, here - it may or may not offer a speed advantage
copy the buffer contents into a "safe" buffer that is known to have adequate tail-end space for the over-shoot
i know it takes time to copy, but at least you wouldn't have to test inside the loop
Quote from: jj2007 on July 15, 2010, 06:33:42 AM
Dave,
Thanks for looking at that.
Re lingo's pop the ret address technicque: Interrupt seem not to be a problem, although it is apparently nowhere documented.
Re point 2: You are perfectly right, there is a risk at the end of a VirtualAlloc buffer. Any suggestions? The routine is already a bit slow ::)
As far as the pop two args from the stack, I remember Lingo doing:
pop ecx ; Get return.
pop eax ; Get first.
pop ebx ; Get second.
push eax ; Save relocated return.
You now have a protected return and two unprotected args but in eax and ebx. This will work, but don't count on the unprotected args on the stack. Another trick was
mov eax,[esp+4] ; Get arg
mov [esp+4],esi ; Save esi over the arg.
This will also work.
If you can wait for just a bit, I am working on an entire set of string routines that are safe and (mostly) SSE. Right now I have towlower towupper wcslwr wcsupr wcslwr_s wcsupr_s, and am working on wcscpy (a modification of WordAlign from my zip here), then wcslen then wcschr then wcscmp, then wcsstr, then the wcsn.... Then I'll work on the normal string versions. These are all for my own use, but I'll publish in a source zip for others to blatantly steal (right, Lingo, isn't that what they do to yours?).
I have a question about what to do with error returns such as the crt__ functions return. I was thinking about returning error codes in edx and the normal return in eax. The end of the functions would end with an "or edx,edx" so that the caller could just "jz Good" or "jnz Bad". Since these are not CDECL, the flags would not be destroyed by INVOKE's add esp,n.
I have even more questions about some of the crt_ comments in \crt\src like "the return string can be shorter or longer than the input string". Maybe for MBCS, but for Unicode?
The following are some of my times:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
558 cycles for wInstr (MasmBasic)
16444 cycles for StrStrIW
37 cycles for crt_towlower
10 cycles for KRBtowlower
1002 cycles for crt__wcslwr
723 cycles for KRBwcslwr
383 cycles for KRBwcslwr2
32 cycles for crt_towupper
10 cycles for KRBtowupper
576 cycles for crt__wcsupr
829 cycles for KRBwcsupr
411 cycles for KRBwcsupr2
--- done ---
Dave.
Quote from: KeepingRealBusy on July 15, 2010, 09:59:01 PM
As far as the pop two args from the stack, I remember Lingo doing:
pop ecx ; Get return.
pop eax ; Get first.
pop ebx ; Get second.
push eax ; Save relocated return.
Dave,
You probably meant push e
cx, not eax.
However, is it really needed?
Q. Why does MS-DOS switch stacks for hardware interrupts? (http://support.microsoft.com/kb/82774)
QuoteAPPLIES TO Microsoft Windows 3.1 Standard Edition
http://en.wikipedia.org/wiki/Task_State_Segment#Inner_Level_Stack_Pointers
QuoteThe TSS contains 6 fields for specifying the new stack pointer when a privilege level change happens. The field SS0 contains the stack segment selector for CPL=0, and the field ESP0/RSP0 contains the new ESP/RSP value for CPL=0. When an interrupt happens in protected (32-bit) mode, the x86 CPU will look in the TSS for SS0 and ESP0 and load their values into SS and ESP respectively. This allows for the kernel to use a different stack than the user program, and also have this stack be unique for each user program.
http://stackoverflow.com/questions/866672/switching-stacks-in-c
QuoteOn 16-bit DOS, an interrupt could occur and this interrupt would be initially running on the same stack. If you got interrupted in the middle of the operation, the interrupt could crash because you only updated ss and not sp.
On Windows, and any other modern environment, each user mode thread gets its own stack. If your thread is interrupted for whatever reason, it's stack and context are safely preserved
JJ,
But how do you get into ring 0 from you program, how does the system know how to get back to you? The CPU needs to save your return information somewhere, and that somewhere is your current stack, THEN it can swap the stacks and insure that the interrupt stack is enough for the processing.
Anyone else, am I wrong here?
Dave.
Yes.
Quote from: E^cube on July 16, 2010, 12:54:28 AM
Yes.
In what way?, I mean, how does the hardware change the stack on the fly without destroying any registers?
Dave.
it doesn't, I just really felt like saying yes :) I apologize
Apology accepted, but not necessary.
Any experts around that understand and can explain a privilege level switch.
Intel manuals, especially volume 3a chapter 6.3 "task switching"
Quote from: sinsi on July 16, 2010, 02:23:55 AM
Intel manuals, especially volume 3a chapter 6.3 "task switching"
sinsi,
Thank you. I knew that someday I would have to go through all of this. About 40 pages of documentation and diagrams later (AMD PDF's), I can safely say that anything we are doing here will not be affected by a task switch. The first thing that happens is that stack pointer is saved in the TSS (system) and loaded with an appropriate new stack pointer, then the flags and eip are pushed on to the NEW frame, then all regs are saved in the TSS. An opposite set of actions cause the task to be restarted.
Only something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.
So, JJ, your code is safe, and yes, I meant "push ecx", and I would use this instead of leaving the return address unprotected. With some of the MASM32 macros, I would not trust that some invocation wouldn't push a register for a calculation or a call and wipe out a unprotected return address ("print" comes to mind).
Dave.
Dave...
this topic has been beat to death a few times
it seems the members are split (50-50 ?) on this issue
some say it is ok to use space under [ESP] - some say it is not
the best we seem to do is - we agree to disagree :P
out of old-school habit, i avoid using stack space under the stack pointer
those who argue it is ok say that windows protects that space, as interrupts, other threads, etc, are never allowed to access it
you'll have to decide for yourself :bg
I've used parameters as storage before with no problems
myproc:
xchg ebx,[esp+4]
xchg esi,[esp+8]
...
pop ecx
pop ebx
pop esi
jmp ecx
I figure that if you reserve space (sub esp,xxx) it's yours but pushing params ([esp+x]) means they are fair game.
In the same way, anything below esp ([esp-x]) is undefined and likely to get zapped at some stage (is that what you mean by 'under [esp]' dedndave?), especially using a proc with a stack frame or simply forgetting what you did 50 lines ago :bdg
It's all personal, that's why we have the freedom of asm and not the constraints of a hll.
Quote from: KeepingRealBusy on July 16, 2010, 04:05:51 AMOnly something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.
So, JJ, your code is safe, and yes, I meant "push ecx", and I would use this instead of leaving the return address unprotected. With some of the MASM32 macros, I would not trust that some invocation wouldn't push a register for a calculation or a call and wipe out a unprotected return address ("print" comes to mind).
Dave.
Dave, thanks for reading this up in the "official" manuals. My Wiki quote on TSS said something similar, but Intel is a more reliable source.
So it boils down to "yes, you can do it but make sure you know what you are doing in that proc". And, for example,
print obviously pushes parameters.
QuoteOnly something you do in your task (push, mov [esp+n],DataOrReg, etc) would wipe out an unprotected stack location.
hang on - is that a quote from the intel manual ?
and - if so - the OS could possibly alter that, no ?
Quote from: sinsi
I've used parameters as storage before with no problems
myproc:
xchg ebx,[esp+4]
xchg esi,[esp+8]
I would worry about the speed of XCHG on memory, it is atomic and exposes the speed of the underlying memory (DRAM)
On my 3 GHz Prescott the "xchg ebx,[ebp+8]" takes ~100 machine cycles, or 33 ns, the memory access speed is ~17ns
I know xchg is slow but how does a push/mov compare?
I also used esp, not ebp, would that make a difference?
It's all voodoo anyway eh?
Quote from: sinsi
I know xchg is slow but how does a push/mov compare?
They would go via the write buffer, and cache. PUSH/POP pairs, figure 6 cycles. MOV EAX,[EBP+x]; XCHG EAX,EBX; MOV [EBP+x],EAX; also around 6 cycles (P4 Prescott) in some synthetic testing.
Quote
I also used esp, not ebp, would that make a difference?
No, XCHG reg,mem is intrinsically locked, ESP or EBP, etc all perform the same.
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:
182 cycles, (xchg reg,reg)*100
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100
183 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
Quote from: MichaelW
1919 cycles, (xchg reg,mem)*100
1908 cycles, (xchg mem,reg)*100
How fast is the P3 running?
I'll note that the encoding for both is XCHG mem,reg
00000000 87 45 08 xchg eax,[ebp+8]
00000003 87 45 08 xchg [ebp+8],eax
00000000 874508 xchg [ebp+8],eax
00000003 874508 xchg [ebp+8],eax
QuoteHow fast is the P3 running?
If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.
QuoteI'll note that the encoding for both is XCHG mem,reg
I did it both ways to see if there would be any significant difference in the cycle counts. On my P3 there wasn't, the difference in the results is within the run-to-run variation that is typical for cycle counts in the thousands.
I have not bothered to benchmark the following test piece but from memory within an algorithm XCHG was usually slow and could be replaced by MOV with a faster result. The 3 tests are mem-mem, reg-mem, reg-reg with the 1st being the slowest and the last being fastest. I have mainly seen this operation in exchange sorts (pointers or values) and usually XCHG is off the pace.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.data?
value dd ?
.data
item dd 0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL var1 :DWORD
LOCAL var2 :DWORD
push esi
push edi
; ---------
; mem - mem
; ---------
mov var1, 1234
mov var2, 5678
mov eax, var1
mov ecx, var2
mov var1, ecx
mov var2, eax
print str$(var1),13,10
print str$(var2),13,10
; ---------
; reg - mem
; ---------
mov esi, 1234
mov var1, 5678
mov eax, var1
mov var1, esi
mov esi, eax
print str$(esi),13,10
print str$(var1),13,10
; ---------
; reg - reg
; ---------
mov esi, 1234
mov edi, 5678
mov edx, esi
mov esi, edi
mov edi, edx
print str$(esi),13,10
print str$(edi),13,10
pop edi
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Quote from: dedndave on July 16, 2010, 04:42:09 AM
out of old-school habit, i avoid using stack space under the stack pointer
Are we sure that no debuggers trash the area under the stack?
I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.
Quote from: MichaelW on July 16, 2010, 12:05:31 PM
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe running on a P3:
Prescott P4:
146 cycles, (xchg reg,reg)*100
9247 cycles, (xchg reg,mem)*100
9277 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
306 cycles, (exchange reg,mem)*100 using mov
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]
The latter are intermediate cases using the stack:
push edx
mov edx, [ebx]
pop [ebx]
...
push [ebx]
mov [ebx], edx
pop edx
Slower than exchange reg,mem using mov but a lot faster than xchg.
Quote from: Rockoon on July 16, 2010, 03:22:58 PM
Are we sure that no debuggers trash the area under the stack?
I remember at once time back in 16-bit days that you absolutely had to add some extra stack space in order to accommodate debuggers, otherwise the debugger would happily start overwriting your code or data segment when stepping through your deepest function nesting.
In the 16-bit RM days hardware interrupts would use whatever stack was active when the interrupt occurred.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
.data?
value dd ?
.data
item dd 0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
call main
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL var1 :DWORD
LOCAL var2 :DWORD
push esi
push edi
invoke Sleep, 4000
; ---------
; mem - mem
; ---------
mov var1, 1234
mov var2, 5678
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 8
mov eax, var1
mov ecx, var2
mov var1, ecx
mov var2, eax
ENDM
counter_end
print str$(eax)," cycles, mem - mem",13,10
;print str$(var1),13,10
;print str$(var2),13,10
; ---------
; reg - mem
; ---------
mov esi, 1234
mov var1, 5678
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 8
mov eax, var1
mov var1, esi
mov esi, eax
ENDM
counter_end
print str$(eax)," cycles, reg - mem",13,10
;print str$(esi),13,10
;print str$(var1),13,10
; ---------
; reg - reg
; ---------
mov esi, 1234
mov edi, 5678
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 8
mov edx, esi
mov esi, edi
mov edi, edx
ENDM
counter_end
print str$(eax)," cycles, reg - reg",13,10
;print str$(esi),13,10
;print str$(edi),13,10
pop edi
pop esi
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Running on a P3:
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
35 cycles, mem - mem
19 cycles, reg - mem
8 cycles, reg - reg
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on an old Athlon 1.3 GHz:
147 cycles, (xchg reg,reg)*100
1630 cycles, (xchg reg,mem)*100
1631 cycles, (xchg mem,reg)*100
148 cycles, (exchange reg,reg)*100 using mov
270 cycles, (exchange reg,mem)*100 using mov
406 cycles, (exchange reg,mem)*100 using pop [ebx]
406 cycles, (exchange reg,mem)*100 using push [ebx]
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a Core2Duo 2.8 Ghz:
219 cycles, (xchg reg,reg)*100
1842 cycles, (xchg reg,mem)*100
1835 cycles, (xchg mem,reg)*100
184 cycles, (exchange reg,reg)*100 using mov
299 cycles, (exchange reg,mem)*100 using mov
507 cycles, (exchange reg,mem)*100 using pop [ebx]
507 cycles, (exchange reg,mem)*100 using push [ebx]
Results for \masm32\examples\exampl10\timer_demos\xchg\xchg_test.exe (plus jj2007 additions) running on a P4 2.8 Ghz:
146 cycles, (xchg reg,reg)*100
9271 cycles, (xchg reg,mem)*100
9158 cycles, (xchg mem,reg)*100
146 cycles, (exchange reg,reg)*100 using mov
312 cycles, (exchange reg,mem)*100 using mov
1005 cycles, (exchange reg,mem)*100 using pop [ebx]
497 cycles, (exchange reg,mem)*100 using push [ebx]
Why would xchg mem,reg be so extra costly on a P4?
Queue
that has always been that way - even on the 8088
i am a little surprised to see the xchg reg,reg comparison, though
for a long time, i have used XCHG EAX,reg32 (AX,reg16 in DOS) because it is a single byte op-code
still - it doesn't compare too badly against MOV
i see the test uses XCHG EDX,ECX - a 2-byte instruction
Quote from: MichaelW
QuoteHow fast is the P3 running?
If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.
Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz
Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?
As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.
In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.
It's not so much a cycles issue, than a time issue.
Which P4 core is that? A Northwood?
Celeron M timings:
165 cycles, (xchg reg,reg)*100
1910 cycles, (xchg reg,mem)*100
1910 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
310 cycles, (exchange reg,mem)*100 using mov
495 cycles, (exchange reg,mem)*100 using pop [ebx]
495 cycles, (exchange reg,mem)*100 using push [ebx]
Note the symmetry of the last two, in contrast to the Prescott P4:
1078 cycles, (exchange reg,mem)*100 using pop [ebx]
460 cycles, (exchange reg,mem)*100 using push [ebx]
Quote from: clive on July 16, 2010, 07:44:28 PM
Trying to quantify the memory speed.
PC133 SDRAM and IIRC I set it up to use the fastest supported timings.
dedndave,
Quote
hang on - is that a quote from the intel manual ?
and - if so - the OS could possibly alter that, no ?
No ,this is not a quote, but an observation from the documentation. If you are at a user task level, anything you do to get back to the OS must cause a task switch, and this automatically saves your stack pointer (and selector) in the TSS, then loads the stack pointer and selector with appropriate values depending on the reason for the switch (fault, interrupt, call), then starts saving your IP on the NEW (system) stack. Anything that happens to YOUR stack must happen at user task level, i.e., push, pop, mov and call (to your local procedure). With multiple threads, I believe, each thread has its own stack.
Could the OS possibly alter that? The OS is capable of putting anything anywhere in memory once it gets control. With single core, a task switch must happen for the OS to get control, but multi-core means the other core may be the OS. Yes it could change something while you are running. Are we talking virus conditions here? If so, anything could happen, otherwise, I doubt it will. To quote "Pogo" "We has met the enemy, and he is us."
Watch where you step, it gets pretty deep in some places.
Dave.
Quote from: clive on July 16, 2010, 07:44:28 PM
Quote from: MichaelW
QuoteHow fast is the P3 running?
If you mean the clock speed, it's 500MHz. If you mean subjectively, it's plenty fast for what I do.
Trying to quantify the memory speed. The number of cycles relates to one SDRAM READ, followed by a WRITE, that occur back-to-back at the same address across the entire bit line width of the memory subsystem. In your case here about 19ns for the READ, and 19ns for the WRITE. Say 52 MHz
Quote from: Queue
Why would xchg mem,reg be so extra costly on a P4?
As indicated above it exposes the speed of the memory subsystem. It is an atomic event (ie RMW), and a serializing event. Therefore the processor must complete/retire all pending operations (ooo, pipeline), and entirely flush the write buffers (at whatever depth it has) in the CPU, flush out everything pending/deferred to memory in the chipset, and then complete an indivisible READ (setting up addresses, with CAS/RAS latencies) followed by a WRITE. This is pretty much the worst case for synchronous memory's (SDRAM, DDRAM, RAMBUS, etc), exposing nasty CL (CAS Latency) numbers printed on the DIMMs.
In order to allow the processor to speed along, most everything sent to memory is buffered/deferred/delayed to write back in a lazy manner, and prioritize prefetching/cache line reads so as not to stall forward motion of the processor.
It's not so much a cycles issue, than a time issue.
Thank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?
Dave.
as i mentioned, i am from the old-school side of the fence on this issue
so far, i have seen no harm come from using the stack that way
but, it seems to me that leaving the barn door open doesn't mean the horses are going to leave :P
i feel more comfortable by adjusting the stack pointer
and, let's face it - it doesn't cost that much in terms of code size or clock cylces
I totally agree.
Dave.
Hi,
QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?
Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions. An old reference mentions
the following.
BT, BTC, BTR, BTS mem, reg/imm
XCHG reg, mem
ADD, ADC, AND, OR, SBB, SUB, XOR mem, reg/imm
DEC, INC, NEG, NOT mem
Regards,
Steve N.
Quote from: FORTRANS on July 17, 2010, 01:57:34 PM
Hi,
QuoteThank you for reminding me about the atomic nature of xchg. What other normal user instructions fall in this class?
Those that can be used with the LOCK prefix come to mind.
And those are RMW instructions. An old reference mentions
the following.
BT, BTC, BTR, BTS mem, reg/imm
XCHG reg, mem
ADD, ADC, AND, OR, SBB, SUB, XOR mem, reg/imm
DEC, INC, NEG, NOT mem
Regards,
Steve N.
Thank you, thank you, thank you. I'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"
Dave.
Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"
You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink
Quote from: jj2007 on July 17, 2010, 04:27:27 PM
Quote from: KeepingRealBusy on July 17, 2010, 04:08:59 PMI'm going to print this list out and post it right in front of my workstation with the caption "Don't even think about it!"
You might find it, ehm, challenging to code without AND, OR, SUB, XOR, DEC, INC... :wink
JJ,
I think the restriction is mem, reg/imm, reg,reg should be ok. Another thing I have to read up in the specs.
Dave.
Here is a snippet:
counter_begin 1000, HIGH_PRIORITY_CLASS
lea ebx, mem
REPEAT 100
inc dword ptr [ebx]
ENDM
counter_end
print ustr$(eax)," cycles, (inc mem)*100",13,10
... and various results:
170 cycles, (xchg reg,reg)*100
1909 cycles, (xchg reg,mem)*100
1909 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
307 cycles, (exchange reg,mem)*100 using mov
494 cycles, (exchange reg,mem)*100 using pop [ebx]
499 cycles, (exchange reg,mem)*100 using push [ebx]
594 cycles, (and mem)*100
594 cycles, (or mem)*100
594 cycles, (inc mem)*100
594 cycles, (inc dec mem)*100
594 cycles, (inc mem)*100
xchg seems to be the worst case.
I think the ability of an instruction to have a lock prefix is not the problem, it's the presence of the prefix, or for
XCHG mem, reg
Per the Intel manual:
"If a memory operand is referenced, the processor's locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL."
Quote from: jj2007 on July 17, 2010, 04:49:50 PM
Here is a snippet:
counter_begin 1000, HIGH_PRIORITY_CLASS
lea ebx, mem
REPEAT 100
inc dword ptr [ebx]
ENDM
counter_end
print ustr$(eax)," cycles, (inc mem)*100",13,10
... and various results:
170 cycles, (xchg reg,reg)*100
1909 cycles, (xchg reg,mem)*100
1909 cycles, (xchg mem,reg)*100
165 cycles, (exchange reg,reg)*100 using mov
307 cycles, (exchange reg,mem)*100 using mov
494 cycles, (exchange reg,mem)*100 using pop [ebx]
499 cycles, (exchange reg,mem)*100 using push [ebx]
594 cycles, (and mem)*100
594 cycles, (or mem)*100
594 cycles, (inc mem)*100
594 cycles, (inc dec mem)*100
594 cycles, (inc mem)*100
xchg seems to be the worst case.
JJ, could you post the .zip, I'll try on my AMD. Dave
;==============================================================================
include \masm32\include\masm32rt.inc
.586
include \masm32\macros\timers.asm
;==============================================================================
.data
mem dd 0
.code
;==============================================================================
start:
;==============================================================================
invoke Sleep, 3000
REPEAT 3
counter_begin 1000, HIGH_PRIORITY_CLASS
lea ebx, mem
REPEAT 100
inc dword ptr [ebx]
ENDM
counter_end
print ustr$(eax)," cycles, (inc mem)*100",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
lea ebx, mem
REPEAT 100
lock inc dword ptr [ebx]
ENDM
counter_end
print ustr$(eax)," cycles, (lock inc mem)*100",13,10
ENDM
inkey "Press any key to exit..."
exit
;==============================================================================
end start
627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100
638 cycles, (inc mem)*100
2246 cycles, (lock inc mem)*100
627 cycles, (inc mem)*100
2239 cycles, (lock inc mem)*100
Quote from: KeepingRealBusy on July 17, 2010, 05:23:09 PM
JJ, could you post the .zip, I'll try on my AMD. Dave
Here it is.
JJ,
Here are my timings. I added my cpuid for identification - (why else would I add it?):
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
144 cycles, (xchg reg,reg)*100
1853 cycles, (xchg reg,mem)*100
1819 cycles, (xchg mem,reg)*100
149 cycles, (exchange reg,reg)*100 using mov
506 cycles, (exchange reg,mem)*100 using mov
553 cycles, (exchange reg,mem)*100 using pop [ebx]
552 cycles, (exchange reg,mem)*100 using push [ebx]
793 cycles, (and mem)*100
777 cycles, (or mem)*100
819 cycles, (inc mem)*100
792 cycles, (inc mem)*100 using eax
808 cycles, (inc dec mem)*100
687 cycles, (inc mem)*100
I tried to add a call to crt__wcslwr_s to my Timings. First, it would not assemble. I then added the code to msvcrt.inc (copy of crt__wcslwr with _s added). This fails to link. The source in VS at \vc\crt\src shows both of these functions defined in the same source module, both actually call a common subfunction, crt__wcslwr with a -1 for the size parameter and crt__wcslwr_s with the supplied size.
Why does crt__wcslwr_s not exist?
Dave.
if you look inside masm32\include\msvcrt.inc...
externdef _imp___wcslwr:PTR c_msvcrt
crt__wcslwr equ <_imp___wcslwr>
try adding this to the beginning of your program
externdef _imp___wcslwr_s:PTR c_msvcrt
crt__wcslwr_s equ <_imp___wcslwr_s>
I did this, and got around the assembly error, but right into the linker error.
It must be present in \masm32\lib\msvcrt.lib, too (and it's not there).
Add it to \masm32\tools\makecimp\msvcrt.txt...
JJ,
Thank you. I knew that someone here would have the answer.
Dave
oh i see - he's just making up names :lol
It seems Hutch uses \masm32\tools\makecimp\makecimp.exe to modify the crt library.
I think this is all correct:
;==============================================================================
include \masm32\include\masm32rt.inc
.586
include \masm32\macros\timers.asm
;==============================================================================
.data
hmsvcr80 dd 0
ws dw "A","B","C",0
.code
;==============================================================================
start:
;==============================================================================
invoke LoadLibrary, chr$("msvcr80.dll")
mov hmsvcr80, eax
print hex$(eax),13,10
invoke GetProcAddress, hmsvcr80, chr$("_wcslwr_s")
mov esi, eax
print hex$(eax),13,10
invoke crt_printf,cfm$("%S\n"), ADDR ws
push SIZEOF ws
push OFFSET ws
call esi
add esp, 8
invoke crt_printf,cfm$("%S\n"), ADDR ws
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
invoke crt__wcslwr, ADDR ws
counter_end
print str$(eax)," cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
push SIZEOF ws
push OFFSET ws
call esi
add esp, 8
counter_end
print str$(eax)," cycles",13,10
invoke FreeLibrary, hmsvcr80
print str$(eax),13,10,13,10
inkey "Press any key to exit..."
exit
;==============================================================================
end start
78130000
7817FCAB
ABC
abc
101 cycles
181 cycles
1
It should be no big deal to create an import library for msvcr80.dll.
I don't know whether I blew it or not. I had tried to rebuild the library using "make" in \m32lib. Two modules had 2 errors, fptoa.asm and fptoa2.asm. I fixed the first error in each module from
fbstp [esp]
to
fbstp TBYTE PTR [esp]
That cleared up the first error in each, but the second error remains. A2006 undefined symbol PowerOf10. There is a proto for PowerOf10 in these modules, but the error remains.
Any good thoughts? Bad thoughts?
Dave.
OBTW, the make file has an error check for the assembly step, but apparently MASM does not set the error code for assembly errors using a response file. I think I remember reporting this to MS long ago, but will have to dig to see what their response was.
Dave.
JJ,
> It seems Hutch uses \masm32\tools\makecimp\makecimp.exe to modify the crt library.
No, the tool just makes an import library that avoids naming conflicts with masm reserve words. the content is determined by MSVCRT.DLL.
Now with the function that KeepingRealBusy wants, the idea is to test if its there first and you do this by the normal LoadLibrary(), GetProcAddress() and see if you can call it that way. If so then you simply add the name to the import list and build the import library.
Well, I modifyied the crt library with makecimp.exe. No assembly or link problems now. When I try to execute the .exe, I get "The procedure entry point _wcslwr_s could not be located in the dynamic link library msvcrt.dll.
Now what?
Dave.
Dave,
MichaelW found the reason: This is msvcr80.dll...
Check to see if the function you are after is in the standard MSVCRT.DLL
The attachment includes an import library and include file for msvcr80.dll and a small test app. Starting with the full module definition file (see msvcr80_full.def), I removed a number of exports that did not appear to me to be usable/useful. That left 1349 exports, versus 730 for msvcrt.lib. Note that I modified the generated include file substituting "cr8_" for "crt_" to avoid conflicts.
There is a msvcr80.dll version 8.0.50727.1433 on my Windows 2000 system. The file is dated Wednesday, October 24, 2007 so I have no idea how it got there, or if it will be present on other systems.
Hutch, Michael,
I have the zip, will try it.
Dave.
After successful build, test.exe gives me a Visual C++ Runtime Library error:
R6034
An app has made an attempt to load the C runtime library incorrectly.
There is an R6034 and that same error message in the DLL. The app runs no problem under Windows 2000. How did you build the EXE, and what happens for my previous example where I used run-time dynamic linking?
I think the solution is here (http://msdn.microsoft.com/en-us/library/ms235560.aspx), but I have no way to test it short of installing Windows XP on a spare system.
Both versions show the same error. The msvcr80.dll sits in the same folder and is version 8.0.50727.42 of 23 sept 2005
msvcr80.dll is the C run-time library for Visual C++ 2005. It is a "side-by-side" DLL. Your executable should have a manifest to use it.
If you ask me, Microsoft really screwed things up when they went this route. It's the same situation in VC 2008 and VC 2010.
Including a manifest eliminated the R6034 message, but now I get:
"The application failed to initialize properly (0xc0000142). Click..."
0xc0000142 is STATUS_DLL_INIT_FAILED
Any ideas?
You need to run mainCRTStartup? ... I don't know.
I ran your program (http://www.masm32.com/board/index.php?topic=14353.msg115460#msg115460) MichaelW, and I was surprised when it ran with no errors for me. I either use MSVCRT (usually) or the MSVCRxxx DLLs, I have never tried mixing them, I figured it would just be too problematic.
I know this is an interesting process, and I am interested in its conclusion. At least I (we) will know how to do this in the future.
BUT, I was just trying to drain the swamp! I do not really need crt__wcslwr_s. I was just trying to time it for comparison to my functions.
Dave.
I think the problem may be with the manifest that I am using, but since I don't have VS I have no good idea what the manifest should contain for this specific purpose, and I just don't have enough interest to delve very far into Microsoft's ridiculous manifest thing.
The manifest needs to contain the following:
<dependency>
<dependentAssembly>
<assemblyIdentity type='win32' name='Microsoft.VC80.CRT' version='8.0.50608.0' processorArchitecture='x86' publicKeyToken='1fc8b3b9a1e18e3b' />
</dependentAssembly>
</dependency>
The version must match your DLL version exactly and the publicKeyToken changes with the version.
I agree, it is ridiculous.
So basically it looks like, for Windows XP and later, to use the DLL you must have VS.
MichaelW,
Not necessarily, you just need the correct entry in the manifest.
[Edit] And having VC++ makes things a lot easier. :wink
useless dll, I dread even having to link to msvcrt.dll occassionally to cater to C code. We just need equivilant ASM functions for all the C ones to save the hassel. unfortunately while MASM32 and other ASM sdk's are great they still lack a lot of formatting functions.
Cube,
The trick is to use MSVCRT, not the later side by side DLL. MSVCRT is a standard "known" DLL since Win9x where the side by side versions are all over the place on later versions.
E^cube,
I really disagree. msvcrt.dll is a standard system DLL and has a ton of useful and time-tested functions. If they meet your needs, there is no reason to not use them. Plus, if you have ever programmed C, you are already familiar with them. I would be more inclined to use a CRT function than to use some function Joe Blow came up with.
Greg,
I disagree with your disagreement, msvcrt is fine in the C/C++ world, but we're part of the ASM world gentlemen, a better, more efficient world where mediocracy is left at the door. Sure if you just want sub par functions you can use msvcrt, but myself, i'd rather use a blazing fast hand written ASM equivalent. Why? well partly because I enjoy the speed, but also because it's green technology, faster speed means less clock cycles which means more energy conserved.
..and how much energy are you burning while reinventing the wheel?
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?
What!!!!!! Who invented that????
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?
Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.
the msvcrt does some under-the-hood stuff that is difficult to emulate or replace
partly because it has been around long enough to have the bugs worked out
partly because ms may have used proprietary knowledge in some of the code
most of the functions i have played with seem to compare well, performance wise, against anything i can write
all in all, it's fast and well-behaved - and.... it's already written !!! :P
i suppose, now that most CPU's support SSE2, it could be improved upon for some time-intensive functions
but - i think most of that has been (or is being) hashed over in the forum
E^cube,
I love programming with MASM too, but that's no reason to exclude other languages. I like to program in C, (Power)BASIC and shudder, oh my God, C# and even PowerShell. The idea that since you're an ASM programmer you must exclude all other programming languages just doesn't sit well with me at all. The CRT functions are far from mediocre, they are usually slower than ASM written functions, but that's because they do a lot of error checking etc. that the ASM functions don't do. Think about it, these functions have been beaten to death and tested over the years. They're are very reliable and very stable functions. It has been very infrequently that I have needed the speed of a hand tuned assembly procedure, but when I do, I can do it. ASM and C just go together, most C compilers include an assembler. You have every right to use only ASM and no CRT functions, knock yourself out, but don't push your limitations on everyone else.
Quote from: Greg Lyon on July 22, 2010, 04:26:11 AM
E^cube,
I love programming with MASM too, but that's no reason to exclude other languages. I like to program in C, (Power)BASIC and shudder, oh my God, C# and even PowerShell. The idea that since you're an ASM programmer you must exclude all other programming languages just doesn't sit well with me at all. The CRT functions are far from mediocre, they are usually slower than ASM written functions, but that's because they do a lot of error checking etc. Think about it, these functions have been beaten to death and tested over the years. They're are very reliable and very stable functions. It has been very infrequently that I have needed the speed of a hand tuned assembly procedure, but when I do, I can do it. ASM and C just go together, most C compilers include an assembler. You have every right to use only ASM and no CRT functions, knock yourself out, but don't push your limitations on everyone else.
i'm not pushing "my limitations" on anyone, i'm simply voicing my opinion, as you are. And the fact is,you and I are a completely different breed, myself, i'm an ASM warrior to the core, unrelenting dedication and devoting to the language I love. You as stated are not, more of a neutral programmer. The problem is this is not a time of neutrality, this is a time of war, language war, Microsoft has already fired the first shot by horrifically crippling masm64, and by completely removing inline asm support in visual studio x64. In response the community has fired back with projects like jwasm, fasm and goasm. That all work great on x64. So you enjoy hopping around to different languages, thats your progative, but don't push your nonchalant attitude on those who just want to code in ASM, on an asm forum...
Quote from: E^cube on July 22, 2010, 04:04:40 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?
Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.
sigh...
why so defensive all of a sudden?
...was it because you dont really write in asm to be "green?"
I mean, its a good "point" .. but its not why you write in ASM. Really. Its not. You know it. I know it.
Quote from: E^cube on July 22, 2010, 04:51:16 AM
i'm not pushing "my limitations" on anyone, i'm simply voicing my opinion, as you are. And the fact is,you and I are a completely different breed, myself, i'm an ASM warrior to the core, unrelenting dedication and devoting to the language I love. You as stated are not, more of a neutral programmer. The problem is this is not a time of neutrality, this is a time of war, language war, Microsoft has already fired the first shot by horrifically crippling masm64, and by completely removing inline asm support in visual studio x64. In response the community has fired back with projects like jwasm, fasm and goasm. That all work great on x64. So you enjoy hopping around to different languages, thats your progative, but don't push your nonchalant attitude on those who just want to code in ASM, on an asm forum...
Come on now, fasm, and goasm were not started for the reasons you are declaring. In fact, they were started before those reasons could even have existed.
Quote from: Rockoon on July 22, 2010, 06:44:42 AM
Quote from: E^cube on July 22, 2010, 04:04:40 AM
Quote from: Rockoon on July 22, 2010, 04:00:06 AM
..and how much energy are you burning while reinventing the wheel?
Far less than if I were doing it in a bloated C/C++ IDE... Fact of the matter is there are abillion C/C++ forums and sites, if you wanna be a fanboy that's great, but don't knock ASM in the process, especially in one of the few corners ASM still flourishes.
sigh...
why so defensive all of a sudden?
...was it because you dont really write in asm to be "green?"
I mean, its a good "point" .. but its not why you write in ASM. Really. Its not. You know it. I know it.
Actually in part it is, i'm a big fan of the green, money, green tea, various plants... and yes green tech. i'm defensive because people are on the offense.
Also Jeremy is a smart guy, he recognized the ASM community had a large void that he could fill with Goasm, and choose to do so. That may not be his primary reason, but he took the time to write all that documentation etc, and spent years of his life on it, all for the public, so clearly he cares. Unlike most C/C++ GPL projects, where the authors are disrespectful and rude, looking only for self promotion and most of the time donations, Jeremy on the other hand has asked for nothing, and IMO deserves everything. RadASM and EasyASM are similar. These are prime examples of the power of ASM, the power of devoted developers and users who don't give up when faced with difficult engineering challenges. These are the kind of people I respect,and appreciate. :thumbu
Folks,
Can we avoid the "assembler wars" here, different people have their reasons for using different tools, I simply respect their choice and don't inflict this stuff on them.
;-)
Just in case somebody is still interested in the original topic: I have made the Recall macro "SSE2 safe".
Quoteinclude \masm32\MasmBasic\MasmBasic.inc ; Download (http://www.masm32.com/board/index.php?topic=12460.0)
Init[/size]
Recall "\masm32\include\windows.inc", MyArray$(), -1, lc
Print Str$("%i lines found in Windows.inc\n", lc)
For_ n=0 To Min(lc-1, 15)
mov ecx, n ; we need some proof that
lea ecx, [2*ecx+27] ; this is assembler ;-)
Print Str$("\nLine %i\t", n+1), Left$(MyArray$(n), ecx)
Next
Exit
end start
Output:
22272 lines found in Windows.inc
Line 1 comment * -=-=-=-=-=-=-=-=-
Line 2
Line 3 WINDOWS.INC for 32 bit MA
Line 4
Line 5 This version is compatible wi
Line 6
Line 7 Project WINDOWS.INC at the Masm F
Line 8
Line 9 http://www.masm32.com/board/index.php
Line 10
Line 11 WINDOWS.INC is copyright software licence
Line 12 MASM32 project. It is available completely
Line 13 for any person to use for purposes including
Line 14 commercial software but the file must not be in
Line 15 commercial package and the file may not be redist
Line 16 without express permission from the MASM32 project.
> This version is compatible with ML.EXE Version 8.0
Hutch,
It's compatible with 6.14, 6.15, 9.0 and JWasm, too.
JJ,
> It's compatible with 6.14, 6.15, 9.0 and JWasm, too.
Do you think you can expand on this. I use all versions of ML.EXE from 6.14 to 10.0 and don't have any problems with the current form of Windows.inc. Noting that I don't get paid for maintaining this file, perhaps you could share with me what the problem is.
> Noting that I don't get paid for maintaining this file, perhaps you could share with me what the problem is.
Cool down young friend :bg
I just wanted to encourage you to mention that it is compatible not only with 8.0. It's kind of a compliment, Sir Hutch :U
:bg
After midnight readings, :U
E^cube,
You should probably avoid the Windows API also, as it was written mostly in C.
Quote from: Greg Lyon on July 22, 2010, 05:28:11 PM
E^cube,
You should probably avoid the Windows API also, as it was written mostly in C.
Hutch has spoken, that means you don't, unless it's about the thread topic. Thanks
Quote
After successful build, test.exe gives me a Visual C++ Runtime Library error:
R6034
An app has made an attempt to load the C runtime library incorrectly.
After 2 weeks nobody is interesting with one version of MSVCR80.DLL, so... :P