News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Possible problems with SSE usage.

Started by KeepingRealBusy, July 07, 2010, 12:57:11 AM

Previous topic - Next topic

KeepingRealBusy

I was looking at the SSE conversions we were developing and thought about some of the consequences of using SSE (or a general purpose register like eax) when scanning a string instead of just a BYTE (or WORD for Wide Character Unicode).

I wanted to check what would happen at the end of a buffer allocated by VirtualAlloc. I created this small test case and verified that movdqu from the start of a short (in my case 7 BYTES) string at the end of the buffer would fault before you can check for a null. I then add the code to allow this to work correctly.

Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

If you allocate a VirtualAlloc Buffer, and reserve the last 16 BYTES and do not put any data there (not even the last trailing null), then you will never overrun the buffer - you will always find the null first. Thus, you do not need this type of initial checking.

Note further:

This test case is coded for BYTE compares, but the same solution can be used when checking 8 wide characters in an xmm reg.

The method:

The plan is to force the first load to be at a mod 16 bound by ANDing the real start by -16. This will never overrun a VirtualAlloc Buffer because the start is always on a mod page size bound, and the length is always a mod page size length. Page size is 4KB with XP (maybe larger on 64 bit systems) so as long as you start at a mod 16 bound within the buffer, you will never overrun the buffer. This gets you more data than you wanted (the BYTES or WORDS preceeding the string) so you can calculate the number of bytes to skip and the bit mask to use on the extracted bits from pmovmskb. Using a bsf on the extracted and masked bits in the register will then give you a correct value or the first instance of the desired character. My test was a simple check for nulls in a buffer full of nulls.

Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

All:

The same problem can occur if you use mov eax,[esi] to scan a string, checking for nulls as you go. Instead, force the esi to the prior mod 4 bound, load the DWORD containing the desired starting character, and use the difference of actual start vs aligned start to position eax left 8 bits per skip character, and then rol eax,8 for each remaining character to be checked in al looking for a null. From then on, just adjust esi by 4 bytes and you will not overrun the buffer. You have to find some way to not overrun a VirtualAlloc buffer on a first check.

For scanning a Unicode strings, you are dealing with 2 WORDS and not 4 BYTES, so make adjustments and just deal with it.

This test case is a good place to test out your favorite code fragment to insure that the first load of your routine will not overrun the buffer.

Dave.

drizz

#1
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.

The truth cannot be learned ... it can only be recognized.

hutch--

Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

Quote from: drizz on July 07, 2010, 01:42:51 AM
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.



I read some of the comments in your link.  The common thinking was that it is faster to use the bad algos than worry about an exception once in a great while. In my case, I have acrually supplied the very few instructions it takes to eliminate this possibility, and these could be executed just once, not in the main loop (I did imply that they were in the main loop by including the or edx,0FFFFh to force acceptance of all matches after the first, but this could be dropped and a separate compare loop coded that had no extra code). This fix only needs to affect the first access, all others use aligned accesses which will not fault if you are checking for nulls.

You can have the best of both worlds, speed and safety.

Dave.

KeepingRealBusy

Quote from: hutch-- on July 07, 2010, 02:29:39 AM
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.

Hutch,

I haven't yet tried this with the crt__ routines, but the code seems to be all WORD oriented for Unicode, so if a Unicode string is set to start on an odd BYTE bound, the crt__ routines should work. I have been looking at this with SSE in mind, including my initial check. With odd BYTE alignment to avoid buffer overrun , loading an xmm reg at an aligned boundary would put the characters split between words, not too good for pcmpeqw compares. Is such alignment allowable?

Dave.

hutch--

Instead of guessing, load a unicode string on a 2 byte boundary and you never have the problem. 4 byte alignment is even better.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

Hutch,

I agree, use aligned strings. But what should a generalized library function do if passed such a string?

I will test this and see if itwill work at all for CRT__ routines, and get back.

Dave.

Rockoon

I think that a generalized library should throw an exception on unaligned data, and yes that would included 16-bit unicode, which should of course be 2-byte aligned.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: KeepingRealBusy on July 07, 2010, 12:57:11 AM
Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

Dave, your point is valid - see also this post for a concrete example. We did test the other method, i.e. anding the first address and masking out the leading bits, but fail to remember why we didn't continue that road. Maybe because the masking out costs some cycles? Does anybody have a better idea than a shr/shl pair?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
36      cycles for 10*shr/shl eax, cl
15      cycles for 10*shr/shl eax, 15
6       cycles for 10*and eax, nn


hutch--

If you are thinking of using SSE instructions on unicode data, make sure you use the required alignment as some SSE instructions require 16 BYTE alignment and will crash if the data is not 16 byte aligned. You check this on an instruction by instruction basis from the Intel manual.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

asmfan

Everything depends on nature of input data: strings/buffers/sizes. And what's known/unknown.
As an example is copying standard functions: memcpy, memmove. Some regions may/may not overlap.

In this case what if we have some aligned/unaligned buffer as input? in some cases we cannot "and eax, -16" and "movdqa/movaps" - we'll have to movdqu/movups. And what's length known/unknown? The firs zero byte is signaling - then only byte access allowed in this case. Boundaries before and after, crossing the 16 byte alignment and cache.
String start =15
size = 19
String end = 33
how to work this situation out? 2 boundaries crossed /16, 32/. How to load it? Using unaligned and most common way with tail cases processing or else... with common GPR processing for head and tail or else...?
Russia is a weird place

KeepingRealBusy

Here is a procedure, WordAlign, that will align an unaligned (odd BYTE bound) Wide Character string.

I even tested with the wcs_ routine, at least CRT__wcscpy, CRT__wcscmp, CRT__wcschr, and they all worked correctly. I had examined the crt source code and it all appeared to be just Wide Character oriented. Where is there any restriction documented. It all seems to work on BYTE bounds.

This is not my final version of WordAlign, still some cleanup to remove stalls, but as coded here, the logic is more readable.

Enjoy,

Dave.

ecube

Why not just use SEH to check for faults? idk how slow SEH is, but it'd get'r done  :U

Rockoon

Exceptions are for exceptional events, and only ones that the local procedure can't handle and are otherwise unwieldy to transmit back to the caller through normal return mechanics.

If your code is *expecting* an exception, there is something wrong.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

dedndave

i have noticed - that's the way C programmers do it - lol
they catch everything with an exception handler