repne scasb

BlackVortex · October 15, 2009, 11:31:08 AM

I'm making an algorithm to search for some bytes in an unknown size chunk of memory. I'm using "repne scasb" to find the first byte and then "repe cmpsb" to see if it's all the bytes I want, or else I continue from the next byte from where "repne scasb" left off

Reading Agner's optimization manuals it seems that these commands are't really fast. Is it worth the trouble to replace them with the usual instructions ? I mean, "repne scasb" does it all, the comparison, decrementing the ecx counter, incrementing the target "pointer" and repeats itself so no need for jumping, will I really save cycles ?

Will there be a noticeable speed improvement, or is the main bottleneck the"scanning" through memory ? On my PC it goes through 870mb memory in 1.3 seconds by the way :U

P.S.: It's easy to time it and see any speed differences, but isn't there an emulator or something that provides a profiling of a code snippet ?

dedndave · October 15, 2009, 11:44:38 AM

i was surprised to find that scanning with a loop was much faster
the trick is to always access memory as dwords on 4-aligned addresses
after you have found the "key" byte, using repz scasb might be good
(actually - if the string is 5 bytes or longer, a dword CMP or a single CMPSD might be good to eliminate boggies)
so, you want to load aligned dwords with MOV EAX,[EDI], then make the 4 tests for the key byte in register
it will make for bigger code, getting aligned and all that - but should be worth the effort
let us know how it turns out

dedndave · October 15, 2009, 12:16:13 PM

in my application, i wanted to scan an input BigNum integer for unused bytes (from the top down)
the original code used STD | REPZ SCASB | CLD - i scanned for 0 in unsigned or positives or FF in signed negatives
i found STD to be very slow, so i wanted to come up with something different
even without the STD problem, this code is still faster
i checked it over a wide range of lengths and found it to be ~3 times faster on a P4
as you can see, it handles up to 3 single bytes at the beginning and end, but aligned dwords otherwise
i know - lol - it looks clumsy - but it hauls ass
your application is a little different
you will want to get the dword into eax and compare the 4 bytes against the key
if the key is found, you have to adjust the address according to mod 4 position and compare the next 4 bytes
you might be able to improve it by comparing n bytes already in register

;reduce input length to ignore unused high-order bytes

;eax = sign XOR mask (0 or -1)
;ebx = 3
;ecx = input size in bytes
;edi = offset of last input byte + 1

scan00: test edi,ebx ;
jz scan01 ;

dec edi ;
cmp al,[edi] ;
jnz scan06 ;

dec ecx ;
jnz scan00 ;

jmp short scan07 ;

scan01: inc ebx ;

scan02: sub ecx,ebx ;
jb scan04 ;

sub edi,ebx ;
cmp eax,[edi] ;
jnz scan03 ;

jecxz scan07 ;

jmp scan02 ;

scan03: add edi,ebx ;

scan04: add ecx,ebx ;

scan05: dec edi ;
cmp al,[edi] ;
jnz scan06 ;

dec ecx ;
jnz scan05 ;

inc ecx ;

scan06: dec ecx ;

scan07: mov ebx,eax ;

i have since modified my code to remove the count in ecx
Japheth reminded me that wasn't necessary in another topic
this stuff i know - just somewhat forgotten - lol - old age, ya know

hutch-- · October 15, 2009, 12:50:20 PM

BlackVortex,

On most hardware, treat REP[E] ANYTHING like the plague, there are a couple of exceptions, REP MOVSD for block memory copy, a straight linear scanner using less registers generally is a lot faster. If you are doing binary search, try a boyer moore algo, if the binary pattern you are searching for is over about 6 bytes in length and you are scanning data over a couple of K, it really is a lot faster.

LATER: I just had a quick play on this quad that runs at about the same speed as your duo and the byte scanner is not that much faster than the time you mentioned, I get about 890ms just on the bare linear scan for 1 character. Unrolled it by 8 and it did not make any difference so that many memory reads seems to take a finite amount of time, no matter what. None the less its about 1.1 gig a second, how much faster do you need it ?

jj2007 · October 15, 2009, 01:26:28 PM

I can lend you the MasmBasic Len() algo. A very fast byte scanner, but you have to replace the pxor xmm0, xmm0 in line 3 with the byte you want to find. Apart from that, there is a thread treating the alternatives here.

Code Select

MbStrLen proc ; src:DWORD (pseudo arg; uncommenting adds frame!) - with lots of inspiration from Lingo, NightWare and Agner Fog
	pop eax			; trash the return address (still on stack, though)
	pop eax			; get the src pointer
	pxor xmm0, xmm0		; zero for comparison
	mov [esp-12], edx		; edx preserved because Masm32 szLen preserves it
	movups xmm1, [eax]	; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
	pcmpeqb xmm1, xmm0	; set bytes in xmm1 to FF if nullbytes found in xmm1
	mov edx, eax		; save pointer to string
	pmovmskb eax, xmm1	; set byte mask in eax
	bsf eax, eax		; bit scan forward
	jne Lt16			; less than 16 bytes, we are done
	and edx, -16		; align initial pointer to 16-byte boundary
	lea eax, [edx+16]		; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@:	pcmpeqb xmm0, [eax]	; ---- inner loop inspired by Lingo, with adaptions -----
	pcmpeqb xmm1, [eax+16]	; compare packed bytes in [m128] and xmm1 for equality
	lea eax, [eax+32]		; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
	por xmm1, xmm0		; or them: one of the mem locations may contain a nullbyte
	pmovmskb edx, xmm1	; set byte mask in edx
	test edx, edx
	jz @B

	sub eax, [esp-4]		; subtract original src pointer
	shl edx, 16		; create space for the ecx bytes
	push ecx			; this push trashes the src pointer (all registers preserved, except eax = return value)
	pmovmskb ecx, xmm0	; set byte mask in ecx (has to be repeated, sorry)
	or edx, ecx		; combine xmm0 and xmm1 results
	bsf edx, edx		; bit scan for the index
	pop ecx
	lea eax, [eax+edx-32]	; add scan index
Lt16:	mov edx, [esp-12]		; for saving edx, we use a free stack slot; -8 is the ret address
	jmp dword ptr [esp-2*4]	; ret address, one arg
MbStrLen endp

BlackVortex · October 15, 2009, 02:01:10 PM

Sorry jj, I can't understand mmx instructions :(

I made an alternative algo and replaced the rep command but I've screwed up it seems, it's much slower and it doesn't find the pattern. But I don't want to fix it because the "first byte scanning" too slow, so ...

My original (100% working) proc takes 4 parameters :
starting offset
search length
offset to the bytes to search for
number of the bytes to search for (can also be 1, algo still works)

And it returns the offset of the first match or -1 if not found

I've made it into a dll and call it from another language, is there some equivalent library I can test against ? Is there an equivalent procedure I can quickly use in its place and compare ?

EDIT: The Boyer–Moore algorithm seems great, is it implemented somewhere so I can test ?

hutch-- · October 15, 2009, 02:17:15 PM

3 versions in the MASM32 library and from memory a member IanB coded and improvement on the fastest of them.

News:

repne scasb

BlackVortex

dedndave

dedndave

hutch--

jj2007

BlackVortex

hutch--