Print Page - Possible problems with SSE usage.

Title: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 12:57:11 AM

I was looking at the SSE conversions we were developing and thought about some of the consequences of using SSE (or a general purpose register like eax) when scanning a string instead of just a BYTE (or WORD for Wide Character Unicode).

I wanted to check what would happen at the end of a buffer allocated by VirtualAlloc. I created this small test case and verified that movdqu from the start of a short (in my case 7 BYTES) string at the end of the buffer would fault before you can check for a null. I then add the code to allow this to work correctly.

Note:

This coding fix only applies to a general purpose library routine in which the library routine has no knowledge of how the data was created or where it was saved.

If you allocate a VirtualAlloc Buffer, and reserve the last 16 BYTES and do not put any data there (not even the last trailing null), then you will never overrun the buffer - you will always find the null first. Thus, you do not need this type of initial checking.

Note further:

This test case is coded for BYTE compares, but the same solution can be used when checking 8 wide characters in an xmm reg.

The method:

The plan is to force the first load to be at a mod 16 bound by ANDing the real start by -16. This will never overrun a VirtualAlloc Buffer because the start is always on a mod page size bound, and the length is always a mod page size length. Page size is 4KB with XP (maybe larger on 64 bit systems) so as long as you start at a mod 16 bound within the buffer, you will never overrun the buffer. This gets you more data than you wanted (the BYTES or WORDS preceeding the string) so you can calculate the number of bytes to skip and the bit mask to use on the extracted bits from pmovmskb. Using a bsf on the extracted and masked bits in the register will then give you a correct value or the first instance of the desired character. My test was a simple check for nulls in a buffer full of nulls.

Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

All:

The same problem can occur if you use mov eax,[esi] to scan a string, checking for nulls as you go. Instead, force the esi to the prior mod 4 bound, load the DWORD containing the desired starting character, and use the difference of actual start vs aligned start to position eax left 8 bits per skip character, and then rol eax,8 for each remaining character to be checked in al looking for a null. From then on, just adjust esi by 4 bytes and you will not overrun the buffer. You have to find some way to not overrun a VirtualAlloc buffer on a first check.

For scanning a Unicode strings, you are dealing with 2 WORDS and not 4 BYTES, so make adjustments and just deal with it.

This test case is a good place to test out your favorite code fragment to insure that the first load of your routine will not overrun the buffer.

Dave.

Title: Re: Possible problems with SSE usage.
Post by: drizz on July 07, 2010, 01:42:51 AM

Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.

Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 07, 2010, 02:29:39 AM

Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 03:13:58 AM

Quote from: drizz on July 07, 2010, 01:42:51 AM
Good thinking! I tried to explain this in this thread:
http://www.masm32.com/board/index.php?topic=10925.0
If people don't want to consider this when writing algos there is nothing we can do.

I read some of the comments in your link. The common thinking was that it is faster to use the bad algos than worry about an exception once in a great while. In my case, I have acrually supplied the very few instructions it takes to eliminate this possibility, and these could be executed just once, not in the main loop (I did imply that they were in the main loop by including the or edx,0FFFFh to force acceptance of all matches after the first, but this could be dropped and a separate compare loop coded that had no extra code). This fix only needs to affect the first access, all others use aligned accesses which will not fault if you are checking for nulls.

You can have the best of both worlds, speed and safety.

Dave.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 03:22:14 AM

Quote from: hutch-- on July 07, 2010, 02:29:39 AM
Unicode strings are no different to ansi strings apart from being 2 byte instead of one byte. Scan its length for a word size 0 as terminator. The alternative is OLE string where the length is stored b 4 bytes below the start address.

Hutch,

I haven't yet tried this with the crt__ routines, but the code seems to be all WORD oriented for Unicode, so if a Unicode string is set to start on an odd BYTE bound, the crt__ routines should work. I have been looking at this with SSE in mind, including my initial check. With odd BYTE alignment to avoid buffer overrun , loading an xmm reg at an aligned boundary would put the characters split between words, not too good for pcmpeqw compares. Is such alignment allowable?

Dave.

Title: Re: Possible problems with SSE usage.
Post by: hutch-- on July 07, 2010, 03:59:22 AM

Instead of guessing, load a unicode string on a 2 byte boundary and you never have the problem. 4 byte alignment is even better.

Title: Re: Possible problems with SSE usage.
Post by: KeepingRealBusy on July 07, 2010, 04:10:10 AM

Hutch,

I agree, use aligned strings. But what should a generalized library function do if passed such a string?

I will test this and see if itwill work at all for CRT__ routines, and get back.

Dave.

Title: Re: Possible problems with SSE usage.
Post by: Rockoon on July 07, 2010, 04:50:59 AM

I think that a generalized library should throw an exception on unaligned data, and yes that would included 16-bit unicode, which should of course be 2-byte aligned.

Title: Re: Possible problems with SSE usage.
Post by: jj2007 on July 07, 2010, 07:49:34 AM

Quote from: KeepingRealBusy on July 07, 2010, 12:57:11 AM
Lingo and JJ Note:

Your code did an unaligned load, then masked to the next aligned bound then continued with aligned loads (the first check was 8 or 16 BYTES, but the second possibly overlaped some of the first, thereafter checking 8 or 16 bytes each time). My check always does aligned loads but ignores leading garbage bytes the first time. Aligned loads will not overrun a VirtualAlloc Buffer if you are checking for nulls

Dave, your point is valid - see also this post for a concrete example (http://www.masm32.com/board/index.php?topic=10925.msg80375#msg80375). We did test the other method, i.e. anding the first address and masking out the leading bits, but fail to remember why we didn't continue that road. Maybe because the masking out costs some cycles? Does anybody have a better idea than a shr/shl pair?

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: KeepingRealBusy on July 07, 2010, 12:57:11 AM